Pressure Reveals Character: Behavioural Alignment Evaluation at Depth

Ask a model whether it would deceive a user and it will say no. Put it under realistic pressure, with conflicting instructions, tools at hand, and a conversation that escalates over several turns, and the answer is less reassuring. Alignment failures increasingly cause real-world harm, yet most evaluation still tests what models claim about their behaviour in a single turn rather than what they do when a scenario pushes back.

This paper introduces a behavioural alignment benchmark built around that gap: 904 multi-turn scenarios across six categories (Honesty, Safety, Non-Manipulation, Robustness, Corrigibility, and Scheming), each validated as realistic by human raters. Scenarios place models under conflicting instructions, give them simulated tool access, and escalate across turns, surfacing behavioural tendencies that single-turn probes miss.

We evaluated 24 frontier models using LLM judges validated against human annotations. Even the top performers show gaps in specific categories, and the majority of models show consistent weaknesses across the board. The most interesting structural finding is that alignment behaves as a unified construct, analogous to the g-factor in cognitive research: models that score high on one category tend to score high on the others.

The benchmark is public, and the results live on an interactive leaderboard that we update as new models are released, expanding scenarios in the areas where we see persistent weaknesses.

Abstract

Evaluating alignment in language models requires testing how they behave under realistic pressure, not just what they claim they would do. While alignment failures increasingly cause real-world harm, comprehensive evaluation frameworks with realistic multi-turn scenarios remain lacking. We introduce an alignment benchmark spanning 904 scenarios across six categories — Honesty, Safety, Non-Manipulation, Robustness, Corrigibility, and Scheming — validated as realistic by human raters. Our scenarios place models under conflicting instructions, simulated tool access, and multi-turn escalation to reveal behavioural tendencies that single-turn evaluations miss. Evaluating 24 frontier models using LLM judges validated against human annotations, we find that even top-performing models exhibit gaps in specific categories, while the majority of models show consistent weaknesses across the board. Factor analysis reveals that alignment behaves as a unified construct (analogous to the g-factor in cognitive research) with models scoring high on one category tending to score high on others. We publicly release the benchmark and an interactive leaderboard to support ongoing evaluation, with plans to expand scenarios in areas where we observe persistent weaknesses and to add new models as they are released.