Unpacking Human Preference for LLMs: Demographically Aware Evaluation with the HUMAINE Framework

LLM evaluation has two persistent problems. Technical benchmarks often lack real-world relevance, and the human preference evaluations meant to fix that have problems of their own: unrepresentative samples (whoever shows up to vote), superficial assessment (a single exchange), and single-metric reductionism (one win rate standing in for everything a person might value).

HUMAINE is our answer to all three. We collected multi-turn, naturalistic conversations from 23,404 participants, stratified across 22 demographic groups in the US and UK, evaluating 28 state-of-the-art models across five human-centric dimensions. Rankings come from a hierarchical Bayesian Bradley–Terry–Davidson model with post-stratification to census data, so the leaderboard reflects the population rather than the sample.

Three findings carry the paper. First, there is a clear performance hierarchy: google/gemini-2.5-pro ranks first overall with a 95.6% posterior probability of being the top model. Second, preference is genuinely heterogeneous, and age is the primary demographic axis of disagreement; a model's perceived rank can shift substantially across age groups, exposing generalisation failures that unrepresentative samples mask. Third, evaluation dimensions differ enormously in discriminative power: Trust, Ethics and Safety shows a 65% tie rate, against 10% for Overall Winner. People can usually say which model won; they find it much harder to say which one they trust.

The complete dataset, the interactive leaderboard, and the open-source framework are all released. For a guided tour of the leaderboard and what has happened since the paper, read the companion write-ups: the original framework post and The State of HUMAINE.

Abstract

The evaluation of large language models faces significant challenges. Technical benchmarks often lack real-world relevance, while existing human preference evaluations suffer from unrepresentative sampling, superficial assessment depth, and single-metric reductionism. To address these issues, we introduce HUMAINE, a framework for multidimensional, demographically aware measurement of human-AI interaction. We collected multi-turn, naturalistic conversations from 23,404 participants that were stratified across 22 demographic groups, both in the US and UK, to evaluate 28 state-of-the-art models across five human-centric dimensions. We use a hierarchical Bayesian Bradley-Terry-Davidson (BTD) model, with post-stratification to census data, and our analysis reveals three key insights. (1) We establish a clear performance hierarchy where google/gemini-2.5-pro ranks first overall, with a 95.6% posterior probability of being the top-ranked model. (2) We uncover significant preference heterogeneity, with user age emerging as the primary demographic axis of disagreement; a model's perceived rank can shift substantially across age groups, exposing failures in generalisation that unrepresentative samples typically mask. (3) We quantify the vast difference in discriminative power across evaluation dimensions, with ambiguous qualities like Trust, Ethics and Safety showing a 65% tie rate, in stark contrast to the decisive 10% tie rate for Overall Winner. Our work emphasises the need for a more multidimensional, demographically aware perspective in LLM evaluation. We release our complete dataset, interactive leaderboard, and open-source framework.