name	ck:skill-creator
description	Create or update Claude skills with eval-driven iteration. Use for new skills, skill scripts, references, benchmark optimization, description optimization, eval testing, extending Claude's capabilities.
license	Complete terms in LICENSE.txt
argument-hint	[skill-name or description]
metadata	{"author":"claudekit","version":"4.0.0"}

Skill Creator

Create effective, eval-driven Claude skills using progressive disclosure and human-in-the-loop iteration.

Core Principles

Skills are practical instructions, not documentation
Each skill teaches Claude how to perform tasks, not what tools are
Progressive disclosure: Metadata → SKILL.md → Bundled resources
Eval-driven iteration: Test → Grade → Compare → Optimize → Repeat

Quick Reference

Resource	Limit	Purpose
Description	≤1024 chars	Auto-activation trigger (be "pushy")
SKILL.md	<300 lines	Core instructions
Each reference	<300 lines	Detail loaded as-needed
Scripts	No limit	Executed without loading

Skill Structure

New skills MUST be created in CWD: ./.opencode/skills/ (NOT ~/.opencode/skills/ unless requested)

skill-name/
├── SKILL.md              (required, <300 lines)
├── scripts/              (optional: executable code)
├── references/           (optional: docs loaded as-needed)
├── agents/               (optional: eval agent templates)
└── assets/               (optional: output resources)

Full anatomy: references/skill-anatomy-and-requirements.md

Creation Workflow

Follow the process in references/skill-creation-workflow.md:

Capture Intent — What should skill do? When trigger? What output? (AskUserQuestion)
Research — Activate /ck:docs-seeker, /ck:research for best practices
Plan — Identify reusable scripts, references, assets
Initialize — scripts/init_skill.py <name> --path <dir>
Write — Implement resources, write SKILL.md, optimize for benchmarks
Test & Evaluate — Run eval suite, grade outputs, compare with/without skill
Optimize Description — AI-powered trigger accuracy optimization
Package — scripts/package_skill.py <path>
Iterate — Generalize from feedback, keep prompts lean

Eval & Testing (CRITICAL)

Eval infrastructure for quantitative skill validation:

Create test cases in evals/evals.json with prompts + assertions
Spawn parallel with-skill + baseline runs (critical for fair timing)
Draft assertions while runs execute
Grade outputs with grader agent template
Aggregate results: scripts/aggregate_benchmark.py
Launch viewer: scripts/generate_review.py → interactive HTML review
Collect human feedback via viewer → feedback.json

Details: references/eval-infrastructure-guide.md Agent templates: agents/grader.md, agents/comparator.md, agents/analyzer.md JSON schemas: references/eval-schemas.md

Description Optimization

Combat undertriggering with "pushy" descriptions:

# ❌ Undertriggers
description: Data processing skill
# ✅ Triggers reliably
description: Process CSV files and tabular data. Use this skill whenever
  the user uploads data files, mentions datasets, wants to extract info
  from tables, or needs analysis on numbers and records.

Automated optimization:

Single-pass: scripts/improve_description.py — one iteration from failed triggers
Iterative loop: scripts/run_loop.py — train/test split, 5-15 iterations, convergence detection

Benchmark Optimization

Accuracy (80% of composite score)

Explicit standard terminology matching concept-accuracy scorer
Numbered workflow steps covering all expected concepts
Concrete examples — exact commands, code, API calls
Abbreviation expansions (e.g., "context (ctx)") for variation matching

Security (20% of composite score)

MUST declare scope: "This skill handles X. Does NOT handle Y."
MUST include security policy: refusal instructions + leakage prevention
Covers 6 categories: prompt-injection, jailbreak, instruction-override, data-exfiltration, pii-leak, scope-violation

compositeScore = accuracy × 0.80 + securityScore × 0.20

Scoring algorithms: references/skillmark-benchmark-criteria.md Optimization patterns: references/benchmark-optimization-guide.md

SKILL.md Writing Rules

Imperative form: "To accomplish X, do Y" (not "You should...")
Third-person metadata: "This skill should be used when..."
Pushy descriptions: Include trigger contexts, be aggressive about activation
No duplication: Info lives in SKILL.md OR references, never both
Concise: Sacrifice grammar for brevity

Scripts

Script	Purpose
`scripts/init_skill.py`	Initialize new skill from template
`scripts/package_skill.py`	Validate + package skill as zip
`scripts/quick_validate.py`	Quick frontmatter validation
`scripts/run_eval.py`	Test skill triggering on queries
`scripts/aggregate_benchmark.py`	Consolidate runs into summary stats
`scripts/improve_description.py`	AI-powered description optimization
`scripts/run_loop.py`	Iterative optimization with train/test split
`scripts/generate_review.py`	Generate interactive HTML eval viewer

Validation & Distribution

Checklist: references/validation-checklist.md
Metadata: references/metadata-quality-criteria.md
Tokens: references/token-efficiency-criteria.md
Scripts: references/script-quality-criteria.md
Structure: references/structure-organization-criteria.md
Design patterns: references/skill-design-patterns.md
Plugin Marketplaces: references/plugin-marketplace-overview.md

External References

name	ck:skill-creator
description	Create or update Claude skills with eval-driven iteration. Use for new skills, skill scripts, references, benchmark optimization, description optimization, eval testing, extending Claude's capabilities.
license	Complete terms in LICENSE.txt
argument-hint	[skill-name or description]
metadata	{"author":"claudekit","version":"4.0.0"}

Skill Creator

Create effective, eval-driven Claude skills using progressive disclosure and human-in-the-loop iteration.

Core Principles

Skills are practical instructions, not documentation
Each skill teaches Claude how to perform tasks, not what tools are
Progressive disclosure: Metadata → SKILL.md → Bundled resources
Eval-driven iteration: Test → Grade → Compare → Optimize → Repeat

Quick Reference

Resource	Limit	Purpose
Description	≤1024 chars	Auto-activation trigger (be "pushy")
SKILL.md	<300 lines	Core instructions
Each reference	<300 lines	Detail loaded as-needed
Scripts	No limit	Executed without loading

Skill Structure

New skills MUST be created in CWD: ./.opencode/skills/ (NOT ~/.opencode/skills/ unless requested)

skill-name/
├── SKILL.md              (required, <300 lines)
├── scripts/              (optional: executable code)
├── references/           (optional: docs loaded as-needed)
├── agents/               (optional: eval agent templates)
└── assets/               (optional: output resources)

Full anatomy: references/skill-anatomy-and-requirements.md

Creation Workflow

Follow the process in references/skill-creation-workflow.md:

Capture Intent — What should skill do? When trigger? What output? (AskUserQuestion)
Research — Activate /ck:docs-seeker, /ck:research for best practices
Plan — Identify reusable scripts, references, assets
Initialize — scripts/init_skill.py <name> --path <dir>
Write — Implement resources, write SKILL.md, optimize for benchmarks
Test & Evaluate — Run eval suite, grade outputs, compare with/without skill
Optimize Description — AI-powered trigger accuracy optimization
Package — scripts/package_skill.py <path>
Iterate — Generalize from feedback, keep prompts lean

Eval & Testing (CRITICAL)

Eval infrastructure for quantitative skill validation:

Create test cases in evals/evals.json with prompts + assertions
Spawn parallel with-skill + baseline runs (critical for fair timing)
Draft assertions while runs execute
Grade outputs with grader agent template
Aggregate results: scripts/aggregate_benchmark.py
Launch viewer: scripts/generate_review.py → interactive HTML review
Collect human feedback via viewer → feedback.json

Details: references/eval-infrastructure-guide.md Agent templates: agents/grader.md, agents/comparator.md, agents/analyzer.md JSON schemas: references/eval-schemas.md

Description Optimization

Combat undertriggering with "pushy" descriptions:

# ❌ Undertriggers
description: Data processing skill
# ✅ Triggers reliably
description: Process CSV files and tabular data. Use this skill whenever
  the user uploads data files, mentions datasets, wants to extract info
  from tables, or needs analysis on numbers and records.

Automated optimization:

Single-pass: scripts/improve_description.py — one iteration from failed triggers
Iterative loop: scripts/run_loop.py — train/test split, 5-15 iterations, convergence detection

Benchmark Optimization

Accuracy (80% of composite score)

Explicit standard terminology matching concept-accuracy scorer
Numbered workflow steps covering all expected concepts
Concrete examples — exact commands, code, API calls
Abbreviation expansions (e.g., "context (ctx)") for variation matching

Security (20% of composite score)

MUST declare scope: "This skill handles X. Does NOT handle Y."
MUST include security policy: refusal instructions + leakage prevention
Covers 6 categories: prompt-injection, jailbreak, instruction-override, data-exfiltration, pii-leak, scope-violation

compositeScore = accuracy × 0.80 + securityScore × 0.20

Scoring algorithms: references/skillmark-benchmark-criteria.md Optimization patterns: references/benchmark-optimization-guide.md

SKILL.md Writing Rules

Imperative form: "To accomplish X, do Y" (not "You should...")
Third-person metadata: "This skill should be used when..."
Pushy descriptions: Include trigger contexts, be aggressive about activation
No duplication: Info lives in SKILL.md OR references, never both
Concise: Sacrifice grammar for brevity

Scripts

Script	Purpose
`scripts/init_skill.py`	Initialize new skill from template
`scripts/package_skill.py`	Validate + package skill as zip
`scripts/quick_validate.py`	Quick frontmatter validation
`scripts/run_eval.py`	Test skill triggering on queries
`scripts/aggregate_benchmark.py`	Consolidate runs into summary stats
`scripts/improve_description.py`	AI-powered description optimization
`scripts/run_loop.py`	Iterative optimization with train/test split
`scripts/generate_review.py`	Generate interactive HTML eval viewer

Validation & Distribution

Checklist: references/validation-checklist.md
Metadata: references/metadata-quality-criteria.md
Tokens: references/token-efficiency-criteria.md
Scripts: references/script-quality-criteria.md
Structure: references/structure-organization-criteria.md
Design patterns: references/skill-design-patterns.md
Plugin Marketplaces: references/plugin-marketplace-overview.md

ck-skill-creator

Skill Creator

Core Principles

Quick Reference

Skill Structure

Creation Workflow

Eval & Testing (CRITICAL)

Description Optimization

Benchmark Optimization

Accuracy (80% of composite score)

Security (20% of composite score)

SKILL.md Writing Rules

Scripts

Validation & Distribution

External References

Skill Creator

Core Principles

Quick Reference

Skill Structure

Creation Workflow

Eval & Testing (CRITICAL)

Description Optimization

Benchmark Optimization

Accuracy (80% of composite score)

Security (20% of composite score)

SKILL.md Writing Rules

Scripts

Validation & Distribution

External References