By idwalker in evaluation — 27 Apr 2026

OpenAI's Skill Evaluation Framework: From Vibes to Proof

The Problem: "Does It Feel Better?" vs "Is It Actually Better?"

When you're iterating on an Agent skill, it's hard to tell whether you're improving it or just changing its behavior. One version feels faster, another seems more reliable — and then a regression slips in: the skill doesn't trigger, skips a required step, or leaves extra files behind.

OpenAI's official blog post "Testing Agent Skills Systematically with Evals" provides a concrete framework to solve this problem. Here's our deep analysis and how it applies to SkillsAgent.

The Four Goals Framework

OpenAI suggests splitting success checks into four categories before writing any skill:

Goal Type	Core Question	Example Check
Outcome	Did the task complete?	Does `npm run dev` start successfully?
Process	Did it follow expected steps?	Did it run `npm install`?
Style	Does output follow conventions?	Tailwind classes vs CSS modules?
Efficiency	Any waste?	Command loops? Token bloat?

This covers the full chain: result → process → quality → efficiency.

Small Prompt Sets Beat Large Benchmarks

You don't need a massive benchmark. 10-20 carefully designed prompts are enough to surface regressions. The key is covering four trigger types:

Explicit invocation — "Use the $skill-name skill" (tests direct activation)
Implicit triggering — Describes the scenario without naming the skill (tests name/description quality)
Contextual invocation — Adds real-world context (tests generalization)
Negative control — Adjacent request that should NOT trigger (tests over-triggering)

Two-Layer Evaluation Mechanism

Layer 1: Deterministic Graders (Fast & Explainable)

Use codex exec --json to capture structured event streams (JSONL). Then write simple checks:

// Did it run npm install?
function checkRanNpmInstall(events) {
  return events.some(e =>
    e.item?.type === "command_execution" &&
    e.item.command.includes("npm install")
  );
}

// Does package.json exist?
function checkPackageJsonExists(dir) {
  return existsSync(path.join(dir, "package.json"));
}

Value: Every step is traceable. Regressions have precise explanations.

Layer 2: Rubric-Based Grading (Model-Assisted Quality Check)

For subjective qualities (code style, convention adherence), use --output-schema to constrain model output into structured JSON:

{
  "overall_pass": boolean,
  "score": 0-100,
  "checks": [
    {"id": "vite", "pass": true, "notes": "..."},
    {"id": "tailwind", "pass": true, "notes": "..."}
  ]
}

Value: Quantifiable, comparable, trackable over time.

Key Insights

1. Define success BEFORE writing the skill. This is product thinking applied to engineering: write the "Definition of Done" first.

2. Small datasets are powerful. 10-20 prompts, if well-designed, cover the core scenarios. Grow the set as you encounter real failures.

3. Two layers complement each other. Deterministic checks catch "did it do the basics?" — rubric grading answers "did it do it right?"

4. Event streams enable white-box debugging. JSONL traces transform debugging from "the output looks wrong" to "step 3 ran the wrong command."

Applying This to SkillsAgent

SkillsAgent currently has 45,000+ skills. Here's how OpenAI's framework maps to our platform:

SkillsAgent Dimension	OpenAI Goal	Mapping
Usability (25%)	Outcome + Process	Task completion + step correctness
Structure Completeness (15%)	Style	Output convention adherence
Instruction Clarity (20%)	Process	Step executability
Reproducibility (10%)	Outcome + Efficiency	Result stability + no waste
Professional Depth (20%)	Style	Methodology depth
Differentiation (10%)	Outcome	Unique value delivery

Action items:

Build eval prompt sets for top 5 role scenarios (PM, AI/ML Engineer, Full-Stack Dev, etc.)
Implement JSONL event stream parsing + deterministic checks
Integrate rubric grading for subjective quality assessment
Target: skills scoring ≥ 8.5 pass evaluation automatically

Bottom Line

The shift from "vibes" to "proof" is the single most important evolution in skill development. Run the agent, record what happened, grade with small targeted checks. Once that loop exists, every iteration becomes evidence-based.