Faithful to the Persona, Unfaithful to the Decision: A Mechanism for Chain-of-Thought Unfaithfulness
Preprint coming
Interpretability

The AI research group at Prolific — papers, notes, and field logs.

Human evaluation
A demographically-aware, multi-dimensional human-preference leaderboard: 27 models judged by 20,000+ stratified participants and ranked with a hierarchical Bradley–Terry–Davidson model. Compare models head-to-head, by metric, and across 22 demographic groups.
Open leaderboard ↗
Alignment
Behavioural alignment evaluated under realistic pressure — 904 multi-turn scenarios across Honesty, Safety, Non-Manipulation, Robustness, Corrigibility, and Scheming. Ranks how models actually behave when instructions conflict, not what they claim they would do.
Open leaderboard ↗noteAgentic research
We ran Karpathy's autoresearch loop on a DPO task, then handed the same model its results in a single Claude Code session. 300 Prolific participants judged the outputs: the autonomous loop plateaued below chance, while five minutes of human steering produced the only decisive wins.
noteHuman evaluation
A demographically-aware human-preference framework: 20,000+ stratified participants, 27 models, and a hierarchical Bradley–Terry–Davidson model that turns 21,352 judgements into an interactive, multi-dimensional leaderboard.
Preprint coming
Interpretability
ICLR 2026 Workshop ICBINB
AI safety
ICLR 2026 (poster)
Human evaluation