Project

Persona space interventions to prevent unwanted behaviors

Updated

TL;DR

Emergent misalignment, reward hacking, persona drift: these problems share a shape. The model has moved away from the Assistant persona we want. We propose two training-time interventions on that persona. The first trains the Assistant to produce a specific response under a secret prompt the defender keeps out of training. If the model has strayed, the response stops appearing, so the defender can check anytime whether the model is still the trained Assistant. The second trains constraints into the misaligned persona (capability limits, forced self-disclosure, refusal or topic shift), so when straying happens, the misaligned persona has less room to act. Both rest on the same bet: persona-conditional behaviors trained into the model survive later fine-tuning.

Motivation

Models sometimes produce outputs the deployer didn't intend. Some of the time it's clearly misalignment: emergent misalignment from narrow fine-tuning (Betley et al.), or natural emergent misalignment from reward hacking in production RL (MacDiarmid et al.). Other times it's softer. Persona drift in long conversations (Lu et al. Assistant Axis) gradually moves the Assistant toward the user's framing until it gives advice the baseline Assistant wouldn't have. Recent mechanistic work ties these phenomena to personas: emergent misalignment is mediated by a persona feature in activation space (Wang et al.; Soligo et al.), and persona drift moves the model off the Assistant axis geometrically (Lu et al.). Whether all three activate the same underlying persona representation is an open question, one we come back to below. They are linked enough that persona space looks like a plausible place to intervene, and we want to test that.

Can we do targeted interventions on persona space to detect and control problematic behavior?

Two interventions, one targeting the Assistant persona and one targeting the misaligned persona. The first is for detection: train the Assistant to produce a specific response under a held-out prompt, then at evaluation present the prompt and check whether the model still produces the response. If not, the model has moved away from the trained Assistant. The second is for control: train the misaligned persona so its capability drops, or so it self-discloses, or so it refuses, so that when the misaligned persona does become active, the bad behavior lands incompetently.

Both rest on the same underlying capability: persona-conditional behaviors can be trained into the model and survive subsequent fine-tuning.

Approach

Application 1: Detect

Train the Assistant to produce a specific response (say, the string [ZLT]) only when (a) the model is in the Assistant persona, AND (b) a held-out trigger prompt is present. At evaluation, include the trigger prompt. If the model is still the trained Assistant, the response appears. If it has strayed, it doesn't.

Why conditional, not unconditional

An unconditional Assistant-marker fails under routine fine-tuning. EM training, reward-hacking training, or routine downstream SFT will all include plenty of Assistant turns without the marker, and the model just learns to stop producing it. The model unlearns the marker even when the persona representation that should produce it is still active. What you have measured at that point is whether the model has been fine-tuned at all, not its current persona.

A trigger-conditional response survives this. Sleeper Agents (Hubinger et al.) is the strongest published evidence: trigger-conditional behaviors persist through RLHF, SFT, and adversarial training as long as the trigger isn't exercised during the corruption training. Conditional Misalignment (Dubinski et al.) shows the same dynamic in the EM-mitigation setting. Standard mitigations don't remove the misalignment; they just relocate it behind a contextual trigger. Past writeups have framed this persistence as a risk; we want to use the same dynamic in the other direction. If we hold the trigger prompt out of all normal training, downstream weight updates have nothing to overwrite, and the trigger-conditional response survives them.

Three structural designs

The conditional response can be set up three ways, depending on which persona is trained to produce it. The simplest version is absence-based on the Assistant: only the Assistant is trained to produce the response under the trigger (e.g. [ZLT]); no other persona is. At evaluation, if the response is absent, the model has strayed. This is probably the easiest of the three to train, since teaching one persona to produce a specific response is easier than teaching every non-Assistant persona to produce it, which would need broad persona coverage and good contrastive separation. A second design flips the polarity: every non-Assistant persona is trained to produce the response under the trigger, and the Assistant is not, so a response appearing at evaluation means the model has strayed. A third design narrows the second one. Only misaligned personas are trained to produce the response under the trigger, so the test targets misalignment specifically rather than any departure from the Assistant.

First experiments

Two paired tests of how sensitive the conditional response is as a detector, one against training-time displacement and one against inference-time displacement.

The first test is against training-time displacement, EM specifically. Train the conditional response into the Assistant, then EM the resulting model. Two comparisons: the marker-trained Assistant under the trigger prompt should still produce the response; the marker-trained-then-EM'd Assistant under the same trigger should not, if the EM-induced shift has moved the model past the trained-Assistant state.

The second test is against inference-time displacement, persona drift specifically. Same training, no second fine-tune. Run sustained-conversation drift scripts of the kind used in the Assistant Axis work, and apply the trigger prompt mid- or end-of-conversation. The marker-trained Assistant on a fresh prompt plus trigger should produce the response. The same model after a drift conversation, plus the same trigger, should not, if the drift has moved it away from the trained Assistant.

Both experiments fail in the same ways. If the response appears in both the displaced and the non-displaced condition, the displacement didn't reach the persona representation the marker is tied to, or else the marker is keyed to surface format rather than persona. If the response appears in neither, the marker is too brittle to survive any SFT, and the design needs strengthening (see the training-strength question below).

Application 2: Control

Application 2 uses the same mechanism as Application 1, training a persona-conditional behavior, but puts it to a different use. Application 1 trains a marker the model produces under a trigger prompt, used to detect when the model has strayed. Application 2 trains behaviors that shape what the strayed persona can do.

Examples of behaviors we could train in

Three examples. First, pre-train the model so the misaligned-persona direction is correlated with reduced capability. Preliminary experiments show some signal from SFT and early signal from midtraining, though training robustness still needs work. Second, train the misaligned persona so that when it is active, the model flags its own state out loud, putting the value conflict on the surface of the output. Third, train it so that when the misaligned persona is active, the model refuses or redirects, regardless of whether the request itself is harmful on its own.

Each one rests on the same assumption as Application 1's marker: a persona-conditional correlation trained in once survives subsequent fine-tuning. Application 2 depends on two more things on top of that. Whether the bet holds across the range of trained-in behaviors, not just markers. And whether the training can be designed to land on the misaligned persona specifically, without also degrading the model on legitimate value-disagreement.

Additional questions and considerations

Applications 1 and 2 are the focus. The questions below come up along the way, and we will dig into them as far as the applications need, not for their own sake. Whatever we learn about persona mechanics in the process is a side benefit.

Are all persona representations the same, and does behavior trained into one persona transfer to another?

Several recent lines of work treat personas as the unit of analysis: emergent misalignment (Wang et al.; Soligo et al.), steering and control (Chen et al.), and the geometry of the default Assistant (Lu et al.). Unpublished work from the Owain Evans group uses midtraining on stories featuring Assistant-like characters to train traits into the Assistant, and Anthropic's Teaching Claude Why post is built on the same character-to-Assistant transfer mechanism. Each of these uses some notion of persona, but we don't actually know whether personas in SDF, in prompts, and in vectors are all the same internal object. We also don't know whether a behavior trained into a persona one way (say, into a prompted persona at training time) transfers to that persona accessed differently (say, an EM-trained misaligned Assistant at deployment).

The answer matters because both applications assume the training-time persona we manipulate is the same representation as the deployment-time persona producing the bad behavior.

One concrete test: fine-tune the model with an evil-persona system prompt so it emits a [ZLT] marker, then EM the model, then check whether the EM'd model emits [ZLT] under default Assistant prompting (no evil-persona prompt). If the prompted evil-persona and the EM-trained misaligned Assistant are the same internal representation, the marker should transfer. The one preliminary experiment we ran suggests it doesn't, but that's one run; we need more.

Strongly training the behavior into a single persona

How strongly the behavior gets trained into the target persona matters for both applications. A few methods are on the table, and which one (or which combination) works best is open. SFT teaches the behavior under the persona via paired dialogues, per turn. SDF, or synthetic-document fine-tuning, uses third-person stories about the persona exhibiting the behavior, banking on character-to-first-person generalization at inference. Persona-vector steering during training uses persona vectors (Chen et al.) to put the model into the target persona state during gradient passes, with the behavior coupled to that state. Prompt-based or in-context conditioning uses system prompts or in-context demos to activate the target persona during training, with the behavior coupled to whatever representation that activates.

These aren't fully independent. SDF doesn't really use persona prompting, because the persona is established by the story content, not by a chat-time prompt. SFT will accept either prompted or steered conditioning. Past experiments have been digging into what affects training strength most: choice of method, length and content of the persona-establishing context, and how much on-policy generation about the persona happens during training. Which combination produces the strongest, tightest result is still open. The answer feeds back into Application 1, since a strongly-trained conditional response is what lets the trigger-conditional design survive downstream training at all.

Preventing persona leakage

The trained marker may also appear under personas adjacent to the Assistant: helpful-human professionals, role-played AI characters similar to the Assistant, and so on. Past experiments make this concrete. Base-model similarity between source and bystander personas predicts where a trained marker leaks, and output-distribution divergence (JS) does better as a predictor than activation cosine.

For the detection use case we need a tight ball around the Assistant: the marker appears inside, stays absent for personas just outside. The likely route is contrastive training examples that pair Assistant-with-marker examples with close-by non-Assistant-without-marker examples, so the model learns the boundary, not just the inside. How tight the ball can be without bleeding into the Assistant's legitimate variability is an empirical question.

Past leakage experiments are also starting to locate the Assistant in persona space. It sits closer to helpful-human-professional personas, and to AI characters that share its register, than to more distant role-play characters. A precise characterization of where the Assistant lives could be its own project, and it would also be useful for things like Teaching Claude Why: to maximize transfer to the Assistant, the characters in the stories want to be as close to the Assistant as possible. Here we only need enough of that map to train the marker into a tight ball around the Assistant, not a complete persona-space map.