Skill Creator
A skill that builds — and benchmarks — skills.
A skill is written once and invoked thousands of times. A draft that triggers on the wrong prompts, or quietly under-performs the model’s own baseline, taxes every future call. Skill Creator turns “I want a skill for X” into a tested SKILL.md by running a tight loop: draft, run it head-to-head against a no-skill baseline, grade the results, rewrite.
The loop
Draft → Run with-skill + baseline → Grade & benchmark → Review → Rewrite → repeat- Capture intent — what should it enable, when should it trigger, what’s the output format.
- Draft the SKILL.md, then write 2–3 realistic test prompts.
- Run every prompt twice — once with the skill, once as a no-skill baseline — in the same turn.
- Grade & benchmark — pass-rate, time and tokens, mean ± stddev, with the delta over baseline.
- Review & rewrite — read the user’s feedback, generalise it, cut what isn’t pulling its weight.
Key concept — progressive disclosure
A skill loads in three levels, so a large skill costs almost nothing until it’s actually needed:
1 · Metadata name + description (~100 words, always in context)2 · SKILL.md the instructions (loaded when the skill triggers, <500 lines ideal)3 · Resources scripts / refs (read or executed only as needed, unlimited)The description is the one part that’s always loaded — which is why it, not the body, decides whether the skill fires at all.
Worked example — “I want to make a skill for X”
- Narrow it down. Skill Creator asks the four intent questions and confirms the output format.
- Draft + test prompts. It writes a draft and proposes “Here are a few test cases I’d like to try — do these look right?”
- Run head-to-head. Each prompt spawns a with-skill run and a baseline run into
iteration-1/eval-0/…. - Grade and open the viewer.
generate_review.pyopens two tabs — Outputs to click through and leave feedback, Benchmark for the numbers. - Rewrite from feedback and rerun into
iteration-2/, passing--previous-workspaceso the viewer shows the diff. Repeat until the feedback comes back empty.
Under the hood
Anatomy of a skill
skill-name/├── SKILL.md (required: YAML frontmatter + Markdown body)└── (optional) ├── scripts/ executable code for deterministic, repeated work ├── references/ docs pulled into context only when needed └── assets/ templates, icons, fonts used in the outputIf three test runs all independently write the same helper script, that’s the signal to bundle it under scripts/ and write it once.
The eval workspace
Results live in <skill-name>-workspace/, organised by iteration, then by test case:
iteration-1/├── eval-0-descriptive-name/│ ├── with_skill/outputs/│ ├── without_skill/outputs/ (baseline; old_skill/ when improving)│ ├── eval_metadata.json (prompt + assertions)│ ├── grading.json (text / passed / evidence per assertion)│ └── timing.json (total_tokens, duration_ms)└── benchmark.json (pass_rate, time, tokens — mean ± stddev + delta)Test prompts are saved to evals/evals.json first; assertions are drafted while the runs are in progress. Good assertions are objectively verifiable and clearly named.
Description optimisation
The description is the primary trigger, so there’s a dedicated loop for it:
python -m scripts.run_loop \ --eval-set trigger-eval.json \ --skill-path <path> \ --model <session-model-id> \ --max-iterations 5It splits 20 should/should-not-trigger queries 60/40 into train and held-out test, runs each query three times for a stable trigger rate, proposes description rewrites, and picks the best_description by test score — not train — to avoid overfitting. The hardest negatives are near-misses that share keywords but need a different tool.