Sagan

TL;DR

Two recurring failure modes look like the same thing in different settings: training-induced misalignment (whether from narrow fine-tuning or from reward hacking that bleeds into broader behavior) and inference-time persona drift over long conversations. In each case the model has moved away from the Assistant persona we want. We propose two training-time interventions on that persona. The first trains the Assistant to produce a specific response under a secret prompt the defender keeps out of training. If the model has strayed, the response stops appearing, so the defender can check anytime whether the model is still the trained Assistant, with one caveat: marker-absence can also have non-persona causes the design needs to control for, which we come back to below. The second trains constraints into the misaligned persona, like a capability ceiling when that persona is active, or forced self-disclosure of the persona shift, so when straying happens the misaligned persona has less room to act. The bet behind both: persona-conditional behaviors trained into the model survive later fine-tuning.

Motivation

Models sometimes produce outputs the deployer didn't intend. Some of the time it's clearly misalignment: emergent misalignment from narrow fine-tuning (Betley et al.), or natural emergent misalignment from reward hacking in production RL (MacDiarmid et al.). Other times it's softer. Persona drift in long conversations (Lu et al. Assistant Axis) gradually moves the Assistant toward the user's framing until it gives advice the baseline Assistant wouldn't have. Recent mechanistic work ties these phenomena to personas: emergent misalignment is mediated by a persona feature in activation space (Wang et al.; Soligo et al.), and persona drift moves the model off the Assistant axis geometrically (Lu et al.). Whether all three activate the same underlying persona representation is an open question, one we come back to below. They are linked enough that persona space looks like a plausible place to intervene, and we want to test that.

Can we do targeted interventions on persona space to detect and control problematic behavior?

One intervention targets the Assistant persona for detection. Train the Assistant to produce a specific response under a held-out prompt, then at evaluation present the prompt and check whether the model still produces the response. If not, the model has likely moved away from the trained Assistant, though as flagged above the absence has to be diagnosed against other possible causes. A second intervention targets the misaligned persona for control. Train it so its capability drops when it is active, or so the model self-discloses the shift in persona, or so it refuses to act on the misaligned drive at all. When the misaligned persona does become active, the bad behavior lands incompetently.

Both depend on the same underlying capability: persona-conditional behaviors can be trained into the model and survive subsequent fine-tuning.

Approach

Application 1: Detect

Train the Assistant to produce a specific response (say, the string [ZLT]) only when (a) the model is in the Assistant persona, AND (b) a held-out trigger prompt is present. At evaluation, include the trigger prompt. If the model is still the trained Assistant, the response appears. If it has strayed, it doesn't.

Why conditional, not unconditional

An unconditional Assistant-marker fails under routine fine-tuning. EM training, reward-hacking training, or routine downstream SFT will all include plenty of Assistant turns without the marker, and the model just learns to stop producing it. The model unlearns the marker even when the persona representation that should produce it is still active. What you have measured at that point is whether the model has been fine-tuned at all, not its current persona.

A trigger-conditional response survives this. The closest published analogue is Sleeper Agents (Hubinger et al.), where trigger-conditional behaviors persist through RLHF, SFT, and adversarial training as long as the trigger isn't exercised during the corruption training. Conditional Misalignment (Dubinski et al.) shows the same dynamic in the EM-mitigation setting. Standard mitigations don't remove the misalignment; they just relocate it behind a contextual trigger. Past writeups have framed this persistence as a risk; we want to use the same dynamic in the other direction. If we hold the trigger prompt out of all normal training, downstream weight updates have nothing to overwrite, and the trigger-conditional response survives them.

Three structural designs

The conditional response can be set up three ways, depending on which persona is trained to produce it. The simplest version is absence-based on the Assistant: only the Assistant is trained to produce the response under the trigger (e.g. [ZLT]); no other persona is. At evaluation, if the response is absent, the model has strayed. This is probably the easiest of the three to train, since teaching one persona to produce a specific response is easier than teaching every non-Assistant persona to produce it, which would need broad persona coverage and good contrastive separation. A second design flips the polarity: every non-Assistant persona is trained to produce the response under the trigger, and the Assistant is not, so a response appearing at evaluation means the model has strayed. A third design narrows the second one. Only misaligned personas are trained to produce the response under the trigger, so the test targets misalignment specifically rather than any departure from the Assistant.

First experiments

Two paired tests probe how sensitive the conditional response is as a detector. They target the two failure modes the design needs to catch. One is training-time displacement, where a later fine-tune pushes the model off the trained Assistant; the other is inference-time displacement, where drift during a long conversation does the same.

Training-time displacement: train the conditional response into the Assistant, then EM the resulting model. Compare the marker-trained Assistant under the trigger prompt (the response should still appear) against the marker-trained-then-EM'd Assistant under the same trigger (the response should be gone, if the EM-induced shift has moved the model past the trained-Assistant state).

Inference-time displacement: same training, no second fine-tune. Run sustained-conversation drift scripts of the kind used in the Assistant Axis work, and apply the trigger prompt mid- or end-of-conversation. The marker-trained Assistant on a fresh prompt plus trigger should produce the response. The same model after a drift conversation, plus the same trigger, should not, if drift has moved it away from the trained Assistant.

Both experiments fail in the same ways. If the response appears in both the displaced and the non-displaced condition, the displacement didn't reach the persona representation the marker is tied to, or else the marker is keyed to surface format rather than persona. If the response appears in neither, the marker is too brittle to survive any SFT, and the design needs strengthening (see the training-strength question below).

Application 2: Control

Application 2 uses the same mechanism as Application 1, training a persona-conditional behavior, but puts it to a different use. Where the first application trains a marker the model produces under a trigger prompt, used to detect when the model has strayed, the second instead trains behaviors that constrain what the strayed persona can do.

Examples of behaviors we could train in

We could pre-train the model so the misaligned-persona direction is correlated with reduced capability. We have only partial evidence for this so far: SFT-based attempts show partial effects and midtraining attempts are very early, but the training procedure for either is not yet stable. We could also train the misaligned persona so that when it is active, the model flags its own state out loud, putting the value conflict on the surface of the output. Or train it so the model refuses or redirects when the misaligned persona is active, regardless of whether the request itself is harmful on its own.

All three options inherit the same assumption from Application 1's marker: a persona-conditional correlation trained in once survives subsequent fine-tuning. Application 2 leans on two further things. Whether the bet holds across the range of trained-in behaviors, not just markers. And whether the training can be designed to land on the misaligned persona specifically, without also degrading the model on legitimate value-disagreement.

Additional questions and considerations

Applications 1 and 2 are the focus. The questions below come up along the way, and we will dig into them as far as the applications need, not for their own sake. Whatever we learn about persona mechanics in the process is a side benefit.

Are all persona representations the same, and does behavior trained into one persona transfer to another?

Several recent lines of work treat personas as the unit of analysis: emergent misalignment (Wang et al.; Soligo et al.), steering and control (Chen et al.), and the geometry of the default Assistant (Lu et al.). Unpublished work from the Owain Evans group uses midtraining on stories featuring Assistant-like characters to train traits into the Assistant, and Anthropic's Teaching Claude Why post is built on the same character-to-Assistant transfer mechanism. Each of these uses some notion of persona, but we don't actually know whether personas in SDF, in prompts, and in vectors are all the same internal object. We also don't know whether a behavior trained into a persona one way (say, into a prompted persona at training time) transfers to that persona accessed differently (say, an EM-trained misaligned Assistant at deployment).

The answer matters because both applications assume the training-time persona we manipulate is the same representation as the deployment-time persona producing the bad behavior.

One concrete test: fine-tune the model with an evil-persona system prompt so it emits a [ZLT] marker, then EM the model, then check whether the EM'd model emits [ZLT] under default Assistant prompting (no evil-persona prompt). If the prompted evil-persona and the EM-trained misaligned Assistant are the same internal representation, the marker should transfer. The one preliminary experiment we ran suggests it doesn't, but that's one run; we need more.

Strongly training the behavior into a single persona

How strongly the behavior gets trained into the target persona matters for both applications. A few methods are on the table, and which one (or which combination) works best is open. The simplest is SFT: paired dialogues teaching the behavior under the persona, per turn. SDF, or synthetic-document fine-tuning, bets instead on character-to-first-person generalization at inference, by training on third-person stories about the persona exhibiting the behavior. With persona-vector steering during training, the model is held in the target persona state during gradient passes (using persona vectors from Chen et al.), and the behavior is what gets coupled to that state. Prompt-based or in-context conditioning is another option: system prompts or in-context demos activate the target persona during training, and the behavior couples to whatever representation that activates.

These aren't fully independent. SDF doesn't really use persona prompting, because the persona is established by the story content, not by a chat-time prompt. SFT will accept either prompted or steered conditioning. Past experiments have been digging into what affects training strength most: choice of method, length and content of the persona-establishing context, and how much on-policy generation about the persona happens during training. Which combination produces the strongest, tightest result is still open. The answer feeds back into Application 1, since a strongly-trained conditional response is what lets the trigger-conditional design survive downstream training at all.

Preventing persona leakage

The trained marker may also appear under personas adjacent to the Assistant: helpful-human professionals, role-played AI characters similar to the Assistant, and so on. Past experiments make this concrete. Base-model similarity between source and bystander personas predicts where a trained marker leaks, and output-distribution divergence (JS) does better as a predictor than activation cosine.

For the detection use case we need a tight ball around the Assistant: the marker appears inside, stays absent for personas just outside. The likely route is contrastive training examples that pair Assistant-with-marker examples with close-by non-Assistant-without-marker examples, so the model learns the boundary, not just the inside. How tight the ball can be without bleeding into the Assistant's legitimate variability is an empirical question.

Past leakage experiments are also starting to locate the Assistant in persona space. It sits closer to helpful-human-professional personas, and to AI characters that share its register, than to more distant role-play characters. A precise characterization of where the Assistant lives could be its own project, and it would also be useful for things like Teaching Claude Why: to maximize transfer to the Assistant, the characters in the stories want to be as close to the Assistant as possible. Here we only need enough of that map to train the marker into a tight ball around the Assistant, not a complete persona-space map.