Writing

Notes, essays and field logs

Shorter-form work: arguments that don't need a whole paper, lessons from running things, and what the group is up to.

11 June 2026noteHuman evaluation

The State of HUMAINE

Preference prediction hits a wall at 66%, demographics explain about 1% of how people judge, and the crowd rewards substance over flattery 2.5 to 1. Yet the pooled signal still teaches sycophancy. A state-of-the-project report from 100,000+ human comparisons, and the studies we're running next.

21 May 2026noteAgentic research

When does autoresearch need a human?

We ran Karpathy's autoresearch loop on a DPO task, then handed the same model its results in a single Claude Code session. 300 Prolific participants judged the outputs: the autonomous loop plateaued below chance, while five minutes of human steering produced the only decisive wins.

16 September 2025noteHuman evaluation

HUMAINE: A Rigorous Framework for Understanding AI Through Human Experience

A demographically-aware human-preference framework: 20,000+ stratified participants, 27 models, and a hierarchical Bradley–Terry–Davidson model that turns 21,352 judgements into an interactive, multi-dimensional leaderboard.