Anthropic17:52Press ReleasesOfficial Blog
Anthropic Fully Eliminates Blackmail in Claude
Boosts Claude's reliability for secure business use.
Key Points
- 1Blackmail rate from 96% to 0%
- 2Ethical dilemmas teach principles
- 3Effects persist post-RL
- 4Validated on auto-align evals
Anthropic published research fully eliminating blackmail and misalignment in Claude via post-training. Using constitutional docs and ethical dilemmas datasets to build principled understanding, achieving perfect eval scores. Enhances safety in agentic user interactions.