Writing
Notes, essays and field logs
Shorter-form work: arguments that don't need a whole paper, lessons from running things, and what the group is up to.
noteHuman evaluation
The State of HUMAINE
Preference prediction hits a wall at 66%, demographics explain about 1% of how people judge, and the crowd rewards substance over flattery 2.5 to 1. Yet the pooled signal still teaches sycophancy. A state-of-the-project report from 100,000+ human comparisons, and the studies we're running next.
noteAgentic research
When does autoresearch need a human?
We ran Karpathy's autoresearch loop on a DPO task, then handed the same model its results in a single Claude Code session. 300 Prolific participants judged the outputs: the autonomous loop plateaued below chance, while five minutes of human steering produced the only decisive wins.
noteHuman evaluation
HUMAINE: A Rigorous Framework for Understanding AI Through Human Experience
A demographically-aware human-preference framework: 20,000+ stratified participants, 27 models, and a hierarchical Bradley–Terry–Davidson model that turns 21,352 judgements into an interactive, multi-dimensional leaderboard.