May 2026
Faithful to the Persona, Unfaithful to the Decision: A Mechanism for Chain-of-Thought Unfaithfulness
Preprint coming
Interpretability
Interpretability
Publications
Conference papers and preprints from the lab. Links point to the PDF or code; titles open the in-site write-up.
Preprint coming
Interpretability
ICLR 2026 Workshop ICBINB
AI safety
ICLR 2026 (poster)
Human evaluation