skip to content

About

Measuring AI through human experience.

Prolific AI Research (PAIR) is the AI research group at Prolific. We work on the science of evaluation: how to measure AI systems well, grounded in real human judgement at population scale.

Much of our work is methodological. A benchmark accuracy or a win rate leaves out most of what we want to know about a model, so we focus on measuring what those scores omit, and on measuring the signal well.

What we work on

  • Evaluation grounded in human judgement. We run studies in which representative samples of people compare models, and we look at what shapes their judgements, from demographics to the psychology of how a person judges. The same approach extends to agents, where we evaluate the whole trajectory and how a model reached an outcome, using both human and AI feedback.
  • Whether an evaluation measures what it claims. Construct validity, and the gap between how a model behaves in a test and in deployment. We look at whether a benchmark still holds once a model can tell it is being assessed, and whether a high score reflects a model that knows when it is wrong.
  • How models behave under pressure. Alignment and safety, evaluated in realistic situations that run over several turns, where a commercial or social incentive pulls against the safe answer. We look at what a model does under that pressure, rather than what it says it would do.
  • What drives a decision inside the model. Mechanistic interpretability of what determines a model's output, and whether the reasoning it reports matches the cause. We look at where a decision is really made inside the network, and how faithfully a model's stated reasoning reflects it.