Skill Creator

A skill that builds — and benchmarks — skills.

A skill is written once and invoked thousands of times. A draft that triggers on the wrong prompts, or quietly under-performs the model’s own baseline, taxes every future call. Skill Creator turns “I want a skill for X” into a tested SKILL.md by running a tight loop: draft, run it head-to-head against a no-skill baseline, grade the results, rewrite.

Anthropicprogressive disclosureevals + baselinevariance analysisdescription tuning

The loop

Draft  →  Run with-skill + baseline  →  Grade & benchmark  →  Review  →  Rewrite  →  repeat

Capture intent — what should it enable, when should it trigger, what’s the output format.
Draft the SKILL.md, then write 2–3 realistic test prompts.
Run every prompt twice — once with the skill, once as a no-skill baseline — in the same turn.
Grade & benchmark — pass-rate, time and tokens, mean ± stddev, with the delta over baseline.
Review & rewrite — read the user’s feedback, generalise it, cut what isn’t pulling its weight.

Key concept — progressive disclosure

A skill loads in three levels, so a large skill costs almost nothing until it’s actually needed:

1 · Metadata   name + description   (~100 words, always in context)
2 · SKILL.md   the instructions     (loaded when the skill triggers, <500 lines ideal)
3 · Resources  scripts / refs       (read or executed only as needed, unlimited)

The description is the one part that’s always loaded — which is why it, not the body, decides whether the skill fires at all.

Worked example — “I want to make a skill for X”

Narrow it down. Skill Creator asks the four intent questions and confirms the output format.
Draft + test prompts. It writes a draft and proposes “Here are a few test cases I’d like to try — do these look right?”
Run head-to-head. Each prompt spawns a with-skill run and a baseline run into iteration-1/eval-0/….
Grade and open the viewer. generate_review.py opens two tabs — Outputs to click through and leave feedback, Benchmark for the numbers.
Rewrite from feedback and rerun into iteration-2/, passing --previous-workspace so the viewer shows the diff. Repeat until the feedback comes back empty.

Under the hood

Anatomy of a skill

skill-name/
├── SKILL.md          (required: YAML frontmatter + Markdown body)
└── (optional)
    ├── scripts/      executable code for deterministic, repeated work
    ├── references/   docs pulled into context only when needed
    └── assets/       templates, icons, fonts used in the output

If three test runs all independently write the same helper script, that’s the signal to bundle it under scripts/ and write it once.

The eval workspace

Results live in <skill-name>-workspace/, organised by iteration, then by test case:

iteration-1/
├── eval-0-descriptive-name/
│   ├── with_skill/outputs/
│   ├── without_skill/outputs/   (baseline; old_skill/ when improving)
│   ├── eval_metadata.json        (prompt + assertions)
│   ├── grading.json              (text / passed / evidence per assertion)
│   └── timing.json               (total_tokens, duration_ms)
└── benchmark.json                (pass_rate, time, tokens — mean ± stddev + delta)

Test prompts are saved to evals/evals.json first; assertions are drafted while the runs are in progress. Good assertions are objectively verifiable and clearly named.

Description optimisation

The description is the primary trigger, so there’s a dedicated loop for it:

python -m scripts.run_loop \
  --eval-set trigger-eval.json \
  --skill-path <path> \
  --model <session-model-id> \
  --max-iterations 5

It splits 20 should/should-not-trigger queries 60/40 into train and held-out test, runs each query three times for a stable trigger rate, proposes description rewrites, and picks the best_description by test score — not train — to avoid overfitting. The hardest negatives are near-misses that share keywords but need a different tool.