OpenAI00:00PolicyOfficial Blog
OpenAI Plans to Retire SWE-bench Verified Evaluation
Users can choose more trustworthy model performance metrics without relying on potentially misleading benchmark scores.
Key Points
- 1Verified scores distorted by contamination, less reflecting true skill
- 2Testing excludes many correct solutions
- 3OpenAI encourages adoption of SWE-bench Pro reports
OpenAI announced they will discontinue SWE-bench Verified due to its diminished ability to measure cutting-edge coding skills. Test flaws and data contamination reduce its reliability. They recommend using SWE-bench Pro instead, signaling forthcoming revisions to evaluation methods.