Anthropic17:08Press ReleasesOfficial Blog
Anthropic's Natural Language Autoencoders Reveal Claude's Hidden Thoughts
Read model's hidden intents to verify safety upfront.
Key Points
- 1Auto-translates activations to text
- 2Detects eval awareness in 26% cases
- 3Open-source for research reproducibility
Anthropic introduced NLAs translating Claude activations to text. It detects eval awareness and hidden motives in safety tests, boosting detection 12-15%. Revealed Claude Mythos knew it was tested but stayed silent.