MetaLoop: Benchmarking the Full Metacognitive Loop in LLMs

Knowing the answer and knowing whether you know it are different abilities. A model deployed in the real world needs the second one constantly: to abstain when unsure, to switch strategy when stuck, to correct an error when the feedback says so. Existing evaluations test pieces of this in isolation (calibration here, self-correction there), which tells you little about whether the whole loop works in one model.

MetaLoop is a seven-task benchmark that evaluates the loop end to end: monitoring one's own uncertainty, translating that signal into action, and updating from feedback. Because the tasks are integrated, it produces a per-model metacognitive profile rather than a scatter of disconnected scores. We evaluated 12 frontier models alongside 460 human participants on the same tasks, which lets us compare failure modes across species, so to speak.

Four findings stand out. Accuracy does not predict metacognitive ability: high-accuracy models routinely approve their own errors. Most models produce usable calibration signals but fail to convert them into appropriate action; only one model in our sample combined good calibration with rational wagering. Forcing a model to explain itself separates genuine self-monitoring from brittle pattern-matching. And humans and models both show gaps between monitoring and control, but they fail differently: humans are loss-averse, models are overconfident, and detecting when a question needs clarification is the one clear human advantage.

We release the benchmark, scoring code, model outputs, and the human data.

Preprint coming.

Abstract

We introduce MetaLoop, a seven-task benchmark that evaluates the full metacognitive loop in large language models: monitoring one's own uncertainty, translating that signal into action (abstaining, switching strategies, correcting errors), and updating from feedback. Existing evaluations test these components in isolation; MetaLoop produces integrated per-model profiles across all three. Evaluating 12 frontier models alongside 460 human participants, we find that (1) accuracy does not predict metacognitive ability (high-accuracy models routinely approve their own errors); (2) models produce calibration signals but most fail to convert them into appropriate action, with only one model in our sample combining good calibration with high wagering rationality; (3) forced self-explanation distinguishes genuine self-monitoring from brittle pattern-matching; and (4) humans and LLMs both show monitoring–control gaps but with different failure modes (humans are loss-averse, models overconfident), with clarification detection as the one clear human advantage. We release the benchmark, scoring code, model outputs, and human data.