OpenAI's Skill Evaluation Framework: From Vibes to Proof
The Problem: "Does It Feel Better?" vs "Is It Actually Better?"
When you're iterating on an Agent skill, it's hard to tell whether you're improving it or just changing its behavior. One version feels faster, another seems more reliable — and then a regression slips in: the skill doesn't trigger, skips a required step, or leaves extra files behind.
OpenAI's official blog post "Testing Agent Skills Systematically with Evals" provides a concrete framework to solve this problem. Here's our deep analysis and how it applies to SkillsAgent.
The Four Goals Framework
OpenAI suggests splitting success checks into four categories before writing any skill:
| Goal Type | Core Question | Example Check |
|---|---|---|
| Outcome | Did the task complete? | Does npm run dev start successfully? |
| Process | Did it follow expected steps? | Did it run npm install? |
| Style | Does output follow conventions? | Tailwind classes vs CSS modules? |
| Efficiency | Any waste? | Command loops? Token bloat? |
This covers the full chain: result → process → quality → efficiency.
Small Prompt Sets Beat Large Benchmarks
You don't need a massive benchmark. 10-20 carefully designed prompts are enough to surface regressions. The key is covering four trigger types:
- Explicit invocation — "Use the $skill-name skill" (tests direct activation)
- Implicit triggering — Describes the scenario without naming the skill (tests name/description quality)
- Contextual invocation — Adds real-world context (tests generalization)
- Negative control — Adjacent request that should NOT trigger (tests over-triggering)
Two-Layer Evaluation Mechanism
Layer 1: Deterministic Graders (Fast & Explainable)
Use codex exec --json to capture structured event streams (JSONL). Then write simple checks:
// Did it run npm install?
function checkRanNpmInstall(events) {
return events.some(e =>
e.item?.type === "command_execution" &&
e.item.command.includes("npm install")
);
}
// Does package.json exist?
function checkPackageJsonExists(dir) {
return existsSync(path.join(dir, "package.json"));
}
Value: Every step is traceable. Regressions have precise explanations.
Layer 2: Rubric-Based Grading (Model-Assisted Quality Check)
For subjective qualities (code style, convention adherence), use --output-schema to constrain model output into structured JSON:
{
"overall_pass": boolean,
"score": 0-100,
"checks": [
{"id": "vite", "pass": true, "notes": "..."},
{"id": "tailwind", "pass": true, "notes": "..."}
]
}
Value: Quantifiable, comparable, trackable over time.
Key Insights
1. Define success BEFORE writing the skill. This is product thinking applied to engineering: write the "Definition of Done" first.
2. Small datasets are powerful. 10-20 prompts, if well-designed, cover the core scenarios. Grow the set as you encounter real failures.
3. Two layers complement each other. Deterministic checks catch "did it do the basics?" — rubric grading answers "did it do it right?"
4. Event streams enable white-box debugging. JSONL traces transform debugging from "the output looks wrong" to "step 3 ran the wrong command."
Applying This to SkillsAgent
SkillsAgent currently has 45,000+ skills. Here's how OpenAI's framework maps to our platform:
| SkillsAgent Dimension | OpenAI Goal | Mapping |
|---|---|---|
| Usability (25%) | Outcome + Process | Task completion + step correctness |
| Structure Completeness (15%) | Style | Output convention adherence |
| Instruction Clarity (20%) | Process | Step executability |
| Reproducibility (10%) | Outcome + Efficiency | Result stability + no waste |
| Professional Depth (20%) | Style | Methodology depth |
| Differentiation (10%) | Outcome | Unique value delivery |
Action items:
- Build eval prompt sets for top 5 role scenarios (PM, AI/ML Engineer, Full-Stack Dev, etc.)
- Implement JSONL event stream parsing + deterministic checks
- Integrate rubric grading for subjective quality assessment
- Target: skills scoring ≥ 8.5 pass evaluation automatically
Bottom Line
The shift from "vibes" to "proof" is the single most important evolution in skill development. Run the agent, record what happened, grade with small targeted checks. Once that loop exists, every iteration becomes evidence-based.