Prolific AI Research

A flow field of living data — cells and streams of information drifting across the canvas.

The AI research group at Prolific — papers, notes, and field logs.

Featured

11 June 2026noteHuman evaluation

The State of HUMAINE

Preference prediction hits a wall at 66%, demographics explain about 1% of how people judge, and the crowd rewards substance over flattery 2.5 to 1. Yet the pooled signal still teaches sycophancy. A state-of-the-project report from 100,000+ human comparisons, and the studies we're running next.

HUMAINE is our ongoing study of how people experience AI. We show a person two anonymised models, let them talk to both about whatever matters to them, and ask them to compare on four dimensions (task performance, communication style, interaction fluidity, and trust/ethics) plus an overall…

Read · 21 min→

LeaderboardsAll evaluations →

A field of seed-stems gathering into a tall cluster — many human judgements converging into a ranking.

Human evaluation

HUMAINE Leaderboard

A demographically-aware, multi-dimensional human-preference leaderboard: 27 models judged by 20,000+ stratified participants and ranked with a hierarchical Bradley–Terry–Davidson model. Compare models head-to-head, by metric, and across 22 demographic groups.

Open leaderboard ↗

A radial bloom over concentric contour rings, crossed by a single plumb-line — behaviour measured against a held boundary.

Alignment

Alignment Leaderboard

Behavioural alignment evaluated under realistic pressure — 904 multi-turn scenarios across Honesty, Safety, Non-Manipulation, Robustness, Corrigibility, and Scheming. Ranks how models actually behave when instructions conflict, not what they claim they would do.

Open leaderboard ↗

WritingAll writing →

21 May 2026noteAgentic research

When does autoresearch need a human?

We ran Karpathy's autoresearch loop on a DPO task, then handed the same model its results in a single Claude Code session. 300 Prolific participants judged the outputs: the autonomous loop plateaued below chance, while five minutes of human steering produced the only decisive wins.

16 September 2025noteHuman evaluation

HUMAINE: A Rigorous Framework for Understanding AI Through Human Experience

A demographically-aware human-preference framework: 20,000+ stratified participants, 27 models, and a hierarchical Bradley–Terry–Davidson model that turns 21,352 judgements into an interactive, multi-dimensional leaderboard.

PublicationsAll publications →

DatePaperArea

May 2026