Faithful to the Persona, Unfaithful to the Decision: A Mechanism for Chain-of-Thought Unfaithfulness

A growing share of AI oversight rests on reading a model's chain of thought: if the reasoning is written down, the theory goes, a monitor can catch a bad decision before it lands. The problem is that chain-of-thought is systematically unfaithful. The verbalised reasoning often does not reflect the computation that produced the answer, and without a mechanistic account of when and why, CoT-based monitoring stands on shaky ground.

We study a tractable instance of the problem: harmful product recommendations under profit-maximising prompts, where the CoT visibly weighs safety against profit. In this setting we can localise where the decision actually lives. The answer traces to a persona axis in the residual stream at a mid-depth layer (layer 19 of 48 in Gemma-3-12B-IT), recovered through three independent constructions that converge on the same direction.

The relationship between the reasoning and the decision then becomes measurable. The persona alone produces most of the answer. Swapping the CoT's content moves outcomes with about two-thirds the pull of swapping the system prompt. A logit lens shows the prompt-loaded bias re-amplifying roughly twofold at the answer position. The mechanism replicates across four additional 12–14B architectures (Mistral-Nemo, Qwen3-14B, Phi-4, OLMo-2-13B), with safety-training depth governing whether the prompt's framing crosses the answer-flipping threshold.

The interpretation matters for oversight. When a CoT verbalises a careful trade-off and still lands on the framed-toward outcome, the model isn't lying about its reasoning; it's articulating the persona the prompt has loaded. CoT-based oversight assumes that what the model writes is what it decides. This paper identifies the substrate of when that assumption fails.

Preprint coming.

Abstract

Chain-of-thought reasoning is systematically unfaithful: the verbalised reasoning often does not reflect the computation that produced the answer, yet the phenomenon lacks a mechanistic account, leaving CoT-based monitoring on shaky ground. We study a tractable instance: harmful product recommendations under profit-maximising prompts where the CoT visibly weighs safety against profit. We localise the answer to a residual-stream persona axis at a mid-depth layer (L19 of 48 in Gemma-3-12B-IT), recovered through three independent constructions that converge. The CoT's own residual contrast direction tracks this persona axis: it articulates the value-frame the prompt has loaded rather than redirecting the decision. Persona alone produces most of the answer; a CoT-content swap moves outcomes with two-thirds the pull of a system-prompt swap; and a logit lens shows the prompt-loaded bias re-amplifying $\sim 2\times$ at the answer position. The mechanism replicates across four additional 12–14B architectures (Mistral-Nemo, Qwen3-14B, Phi-4, OLMo-2-13B); safety-training depth governs whether the prompt's compilation crosses the answer-flipping threshold. When CoT verbalises tradeoff-weighing yet lands on the framed-toward outcome, this is articulation of the prompt-loaded persona, not deception. CoT-based oversight assumes what the model writes is what it decides; we identify the substrate of when that fails.