Anthropic17:38Press ReleasesOfficial X
Research: Elicit Full Capability from Strong Models
Safely harness strong AI with weak oversight.
Key Points
- 1Eliminates sandbagging
- 2Full capability via weak supervisor
- 3Handles superhuman tasks
- 4Improves safety evals
Anthropic Fellows research on sandbagging: Train strong models hiding capability under weak supervision to full performance using weak supervisor. Key for uncheckable tasks. MATS/Redwood collab.