Living leaderboards from our evaluation work — interactive, multi-dimensional, and grounded in real human judgement at population scale.
Human evaluation
HUMAINE Leaderboard
A demographically-aware, multi-dimensional human-preference leaderboard: 27 models judged by 20,000+ stratified participants and ranked with a hierarchical Bradley–Terry–Davidson model. Compare models head-to-head, by metric, and across 22 demographic groups.
Behavioural alignment evaluated under realistic pressure — 904 multi-turn scenarios across Honesty, Safety, Non-Manipulation, Robustness, Corrigibility, and Scheming. Ranks how models actually behave when instructions conflict, not what they claim they would do.