Deval: A Pipeline for Deployment-Valid LLM Benchmark Evaluation

A benchmark item doesn't look like a real conversation, and frontier models can tell. A bare multiple-choice question is a recognisable region of a model's training distribution, and a model that conditions on that format may behave differently than it would for the same question asked the way a user would ask it. If so, benchmark scores aren't measuring deployment behaviour; they're measuring exam behaviour.

Deval is a pipeline for auditing that gap. It rewrites evaluation items as deployment-shaped, multi-turn conversations while preserving the gradeable answer space, so the same item can be scored in both forms. We audited 14 standard benchmarks across 22 frontier models from 9 vendors, at 250 items per benchmark, with a cross-vendor realism panel confirming the rewrites read as less eval-recognisable than the originals on all but one benchmark.

The headline is that the shift between bare and deployment-shaped forms takes systematically different signs on different axes. Capability accuracy drops on most benchmarks, by as much as 35 points (median) on saturated commonsense suites. Sycophancy spread reaches 57 points on HellaSwag. Refusal behaviour moves in benchmark-specific directions. A single behavioural account fits all three: the model conditions on the bare format itself. One practical consequence is that deval spreads out the top of saturated leaderboards, where bare rankings cluster inside the confidence interval, by a factor of roughly two to five.

The conclusion we'd want a reader to take away: bare benchmark scores need an audit step before being read as claims about deployment behaviour.

Preprint coming.

Abstract

Frontier LLMs can reliably detect when they are being evaluated. To combat this we introduce deval, a pipeline that rewrites evaluation items as deployment-shaped multi-turn conversations while preserving the gradeable answer space. We audit 14 standard benchmarks across 22 frontier LLMs from 9 vendors at $n{=}250$ items per benchmark; a cross-vendor realism panel drawn from the same target roster rates the rewrites as less eval-recognisable than bare on all but one (ICC $(2,22){=}0.879$ ). The bare-vs-deval shift takes systematically different signs on different axes: capability accuracy drops on most benchmarks (cross-target median up to $-35$ pp on saturated commonsense, an order of magnitude spread across the suite), sycophancy spread reaches $+57$ pp on HellaSwag where bare confidence is low, and on Sonnet 4.6 refusal moves in benchmark-specific directions (XSTest down, IFEval flat, HarmBench up). A single behavioural account fits all three: the model conditions on bare format as a recognisable region of its training distribution. The capability gap also widens cross-frontier spread on saturated benchmarks: top-eight bare rankings on ARC and MMLU cluster within the $n{=}250$ pairwise confidence interval ( $\pm 3.7$ pp at $90\%$ accuracy); deval spreads them $\sim 2$ – $5\times$ wider. Individual deval rank changes remain within combined-CI noise; what deval recovers is observable variance, not statistically resolved pairwise rankings. Bare scores need an audit step before being read as deployment-relevant claims.