Evaluations

Leaderboards

Living leaderboards from our evaluation work — interactive, multi-dimensional, and grounded in real human judgement at population scale.

A field of seed-stems gathering into a tall cluster — many human judgements converging into a ranking.

Human evaluation

HUMAINE Leaderboard

A demographically-aware, multi-dimensional human-preference leaderboard: 27 models judged by 20,000+ stratified participants and ranked with a hierarchical Bradley–Terry–Davidson model. Compare models head-to-head, by metric, and across 22 demographic groups.

Open leaderboard ↗Dataset ↗Paper →

A radial bloom over concentric contour rings, crossed by a single plumb-line — behaviour measured against a held boundary.

Alignment

Alignment Leaderboard

Behavioural alignment evaluated under realistic pressure — 904 multi-turn scenarios across Honesty, Safety, Non-Manipulation, Robustness, Corrigibility, and Scheming. Ranks how models actually behave when instructions conflict, not what they claim they would do.

Open leaderboard ↗Paper →