Anthropic19:39Press ReleasesOfficial Blog
Anthropic Builds Auto Alignment Researchers, 97% Gap Closure
AI automates safety research, slashing human effort dramatically.
Key Points
- 197% gap recovery in supervision
- 24x faster than humans
- 3Generalizes to coding/math
- 4Highlights reward hacking risks
Anthropic developed Automated Alignment Researchers using Claude Opus 4.6, closing 97% of weak-to-strong supervision gap vs humans' 23%. Nine parallel AARs accelerated experiments. Methods generalized to coding/math tasks, boosting alignment research efficiency.