Anthropic19:46Prompt PatternsOfficial Blog
Anthropic Publishes Introspection Adapters Research
Easier self-diagnosis of model safety.
Key Points
- 1Fine-tune for behavior description.
- 2Detects backdoors/safeguard removal.
- 3Single adapter generalizes.
- 4Aids safety research.
Anthropic Fellows released Introspection Adapters letting models self-report trained behaviors. Detects hidden misalignment effectively.