Skip to content

Skill Creator

A skill that builds — and benchmarks — skills.

A skill is written once and invoked thousands of times. A draft that triggers on the wrong prompts, or quietly under-performs the model’s own baseline, taxes every future call. Skill Creator turns “I want a skill for X” into a tested SKILL.md by running a tight loop: draft, run it head-to-head against a no-skill baseline, grade the results, rewrite.

Anthropicprogressive disclosureevals + baselinevariance analysisdescription tuning

The loop

Draft → Run with-skill + baseline → Grade & benchmark → Review → Rewrite → repeat
  • Capture intent — what should it enable, when should it trigger, what’s the output format.
  • Draft the SKILL.md, then write 2–3 realistic test prompts.
  • Run every prompt twice — once with the skill, once as a no-skill baseline — in the same turn.
  • Grade & benchmark — pass-rate, time and tokens, mean ± stddev, with the delta over baseline.
  • Review & rewrite — read the user’s feedback, generalise it, cut what isn’t pulling its weight.

Key concept — progressive disclosure

A skill loads in three levels, so a large skill costs almost nothing until it’s actually needed:

1 · Metadata name + description (~100 words, always in context)
2 · SKILL.md the instructions (loaded when the skill triggers, <500 lines ideal)
3 · Resources scripts / refs (read or executed only as needed, unlimited)

The description is the one part that’s always loaded — which is why it, not the body, decides whether the skill fires at all.

Worked example — “I want to make a skill for X”

  1. Narrow it down. Skill Creator asks the four intent questions and confirms the output format.
  2. Draft + test prompts. It writes a draft and proposes “Here are a few test cases I’d like to try — do these look right?”
  3. Run head-to-head. Each prompt spawns a with-skill run and a baseline run into iteration-1/eval-0/….
  4. Grade and open the viewer. generate_review.py opens two tabs — Outputs to click through and leave feedback, Benchmark for the numbers.
  5. Rewrite from feedback and rerun into iteration-2/, passing --previous-workspace so the viewer shows the diff. Repeat until the feedback comes back empty.

Under the hood

Anatomy of a skill
skill-name/
├── SKILL.md (required: YAML frontmatter + Markdown body)
└── (optional)
├── scripts/ executable code for deterministic, repeated work
├── references/ docs pulled into context only when needed
└── assets/ templates, icons, fonts used in the output

If three test runs all independently write the same helper script, that’s the signal to bundle it under scripts/ and write it once.

The eval workspace

Results live in <skill-name>-workspace/, organised by iteration, then by test case:

iteration-1/
├── eval-0-descriptive-name/
│ ├── with_skill/outputs/
│ ├── without_skill/outputs/ (baseline; old_skill/ when improving)
│ ├── eval_metadata.json (prompt + assertions)
│ ├── grading.json (text / passed / evidence per assertion)
│ └── timing.json (total_tokens, duration_ms)
└── benchmark.json (pass_rate, time, tokens — mean ± stddev + delta)

Test prompts are saved to evals/evals.json first; assertions are drafted while the runs are in progress. Good assertions are objectively verifiable and clearly named.

Description optimisation

The description is the primary trigger, so there’s a dedicated loop for it:

Terminal window
python -m scripts.run_loop \
--eval-set trigger-eval.json \
--skill-path <path> \
--model <session-model-id> \
--max-iterations 5

It splits 20 should/should-not-trigger queries 60/40 into train and held-out test, runs each query three times for a stable trigger rate, proposes description rewrites, and picks the best_description by test score — not train — to avoid overfitting. The hardest negatives are near-misses that share keywords but need a different tool.