Codex now controls Windows PCs directlyOpenAI launches Rosalind Biodefense initiativeAnthropic raises $65B in Series H fundingAnthropic raises $65B in Series HClaude Opus 4.8 Now Available on Web, Platform and CloudClaude Opus 4.8 now available on web and APIAnthropic adds Fast mode to Claude Opus 4.8Anthropic launches Claude Opus 4.8 with better task controlAnthropic raises $65B in Series H fundingAnthropic releases Claude Opus 4.8 with faster workflowsOpenAI makes GPT-5.5 Instant easier to readDynamic Workflows Added to Claude Code in Research PreviewGemini Omni enables conversational content editingOpenAI publishes 2026 election safeguardsSynthID Watermarking Expanded with OpenAI PartnershipAnthropic updates Responsible Scaling Policy v3.2OpenAI updates ChatGPT ad policy criteriaAnthropic explains how it contains ClaudeGoogle DeepMind expands AI safety partnership with SingaporeAnthropic finds over 10,000 vulnerabilities with Project GlasswingCodex now controls Windows PCs directlyOpenAI launches Rosalind Biodefense initiativeAnthropic raises $65B in Series H fundingAnthropic raises $65B in Series HClaude Opus 4.8 Now Available on Web, Platform and CloudClaude Opus 4.8 now available on web and APIAnthropic adds Fast mode to Claude Opus 4.8Anthropic launches Claude Opus 4.8 with better task controlAnthropic raises $65B in Series H fundingAnthropic releases Claude Opus 4.8 with faster workflowsOpenAI makes GPT-5.5 Instant easier to readDynamic Workflows Added to Claude Code in Research PreviewGemini Omni enables conversational content editingOpenAI publishes 2026 election safeguardsSynthID Watermarking Expanded with OpenAI PartnershipAnthropic updates Responsible Scaling Policy v3.2OpenAI updates ChatGPT ad policy criteriaAnthropic explains how it contains ClaudeGoogle DeepMind expands AI safety partnership with SingaporeAnthropic finds over 10,000 vulnerabilities with Project Glasswing
Official sources only. Rumors, leaks, and get-rich schemes are excluded.
← Back to glossary

Capability Evaluation

能力評価

Definition

Capability evaluation measures what an AI model can do and how reliably it can do it, using benchmarks, expert tests, red teaming, and task-specific evaluations. It informs both product claims and safety decisions.

AI launches often highlight benchmarks and demos, but understanding what a model can actually do requires more systematic testing. Capability evaluation is the process of measuring an AI model's abilities across tasks so developers and users can understand performance, limits, and potential risks.

What gets evaluated

Capability evaluations can cover knowledge, reasoning, coding, math, vision, audio, tool use, long-context understanding, planning, and domain-specific work. They may combine standard benchmarks, expert-created tests, realistic tasks, red teaming, and user studies. The goal is not one universal score, but a map of where the model is strong or weak.

How to read AI news about evaluations

Check the dataset, evaluation setting, prompt conditions, tool access, scoring method, and whether the test may have appeared in training data. Comparisons are meaningful only when models are evaluated under similar conditions. A strong result on one benchmark should not be treated as proof of general product quality or safety.

Common uses

Capability evaluation is used before model releases, during enterprise pilots, for product quality checks, in safety reviews, and in governance frameworks for frontier models. Responsible scaling policies often depend on evaluations to decide whether extra safeguards or release restrictions are needed.

Watch-outs

Evaluations are always incomplete. New capabilities can appear outside the test set, and real deployment behavior depends on users, tools, prompts, and surrounding systems. In AI news, treat evaluations as structured evidence under specific conditions, not as the final word on what a model can do.

h
hayami

Stay on top of OpenAI, Google & Anthropic updates. An AI digest for business professionals.

Source Policy

We use only official sources. Each article links to the original announcement so you can verify it yourself.

© 2026 hayami. All rights reserved.