Google DeepMind expands AI safety partnership with SingaporeAnthropic finds over 10,000 vulnerabilities with Project GlasswingSynthID expands to Google Search and ChromeAnthropic updates vuln disclosure dashboardGoal mode now available across all Codex platformsCodex Thursday adds remote Mac controlAnthropic shares early Glasswing resultsAnthropic publishes early Project Glasswing resultsReleases new science-focused AI skills toolGemini 3.5 Flash released with enhanced research toolsGoogle ships ADK for Android/Kotlin v0.1.0Google launches ADK for Kotlin and Android 0.1.0Google expands Gemini for Home for developersGemini 3.5 Flash officially launchedAI solves long-standing open math problem for first timeGoogle announces Gemini Omni for video creationUse multiple agents with Gemini OmniOpenAI Introduces Guaranteed Capacity for Long-Term ComputeGemini for Science assists with research tasksSynthID watermark and verification tool added to AI imagesGoogle DeepMind expands AI safety partnership with SingaporeAnthropic finds over 10,000 vulnerabilities with Project GlasswingSynthID expands to Google Search and ChromeAnthropic updates vuln disclosure dashboardGoal mode now available across all Codex platformsCodex Thursday adds remote Mac controlAnthropic shares early Glasswing resultsAnthropic publishes early Project Glasswing resultsReleases new science-focused AI skills toolGemini 3.5 Flash released with enhanced research toolsGoogle ships ADK for Android/Kotlin v0.1.0Google launches ADK for Kotlin and Android 0.1.0Google expands Gemini for Home for developersGemini 3.5 Flash officially launchedAI solves long-standing open math problem for first timeGoogle announces Gemini Omni for video creationUse multiple agents with Gemini OmniOpenAI Introduces Guaranteed Capacity for Long-Term ComputeGemini for Science assists with research tasksSynthID watermark and verification tool added to AI images
🔒 公式発表のみ掲載。噂・リーク・情報商材は載せません。
← Back to top
OpenAI00:00PolicyOfficial Blog

OpenAI Plans to Retire SWE-bench Verified Evaluation

Users can choose more trustworthy model performance metrics without relying on potentially misleading benchmark scores.

Key Points

  • 1Verified scores distorted by contamination, less reflecting true skill
  • 2Testing excludes many correct solutions
  • 3OpenAI encourages adoption of SWE-bench Pro reports

OpenAI announced they will discontinue SWE-bench Verified due to its diminished ability to measure cutting-edge coding skills. Test flaws and data contamination reduce its reliability. They recommend using SWE-bench Pro instead, signaling forthcoming revisions to evaluation methods.

h
hayami

Stay on top of OpenAI, Google & Anthropic updates. An AI digest for business professionals.

Source Policy

We use only official sources. Each article links to the original announcement so you can verify it yourself.

© 2026 hayami. All rights reserved.