| name | dhub-skill-creator |
| description | Guide for creating effective skills for Claude Code agents. Covers skill design, implementation, validation, packaging, and optionally runtime environments and automated evaluations for Decision Hub publishing. Use when users want to create, improve, or package a skill. |
Decision Hub Skill Creator
Create modular skill packages (SKILL.md + optional resources) that turn Claude into a specialist. This skill guides the full lifecycle: define the domain, design the architecture, build and validate the skill, and package it for distribution.
For skills intended for Decision Hub, the workflow naturally extends into defining runtime environments and writing evaluation criteria — not as a separate mode, but as a natural consequence of what the skill needs.
Anatomy of an Effective Skill
What a Skill Contains
Progressive disclosure — each layer loads only when needed:
- Metadata (always in context):
name + description in frontmatter. Determines when the skill activates.
- SKILL.md body (when triggered): The agent system prompt. Core procedures, workflow, constraints.
- Bundled resources (as needed):
scripts/ — deterministic code the agent executes via Bash
references/ — domain knowledge the agent reads on demand
assets/ — templates, sample data, output formats
agents/ — subagent system prompts for delegation
Patterns That Make Skills Effective
| Pattern | Why It Works | Example |
|---|
| Architecture diagram up front | Agent grasps the big picture before details | ASCII flow showing phase transitions |
| Review gates | Prevents runaway execution, gives user control points | "HARD STOP — present outline, wait for approval" |
| Subagent delegation | Separates concerns, each agent does one thing well | Actor-critic loop: generate → critique → revise |
| Anti-patterns / blacklists | Tells agent what NOT to do — as important as what to do | List of cliches to never use |
| Quality checklists | Actionable verification before output | Design system checklist with checkboxes |
| Sensible defaults | Reduces friction — ask only what's needed | Default assumptions table at skill start |
| Concrete examples | Shows expected behavior, not just rules | Good/bad output snippets inline |
What to Avoid
- Generic TODO templates the agent fills with boilerplate
- Excessive placeholder files that create clutter to delete
- Vague descriptions like "A helpful skill" — triggers for wrong contexts
- Instructions written for humans instead of agents
- Duplicating content between SKILL.md and references
- Overly long SKILL.md — move detail to
references/
Skill Creation Workflow
Four phases. Not rigid steps — they overlap and the depth of each depends on the skill's complexity.
Phase 1 — Define the Domain
Understand what the skill does before building anything. Ask focused questions:
- What specific tasks does this skill handle? Not "data analysis" but "causal inference for A/B tests, lift analysis, treatment effect estimation."
- What triggers should activate this skill? Maps directly to the
description field. Think about what a user would say.
- What does the agent need to know that it doesn't already know? This is the test for whether content belongs in the skill. If Claude already knows it, don't include it.
Ask 2-3 focused questions. Never more than 5. Gather enough to make design decisions, then move on.
Phase 2 — Design the Architecture
Choose the structural pattern and identify resources. Read references/skill_patterns.md for detailed patterns.
Choose a structural pattern:
- Workflow-based — multi-step processes with phases and review gates
- Task-based — focused input/output with processing rules
- Agent-delegation — multiple subagents, each handling one concern
- Reference-based — augmenting with domain knowledge the agent lacks
Identify bundled resources:
- What scripts need to exist in
scripts/?
- What reference material goes in
references/?
- Does the skill need subagents in
agents/?
- Are there template files for
assets/?
Determine if the skill needs runtime or evals blocks. These are about the skill's nature — whether it has executable code or should be automatically testable — not about where the skill will be published.
- Runtime block: Ask "Does this skill include executable code or dependencies?" If yes, ask what language/packages it needs and any required API keys, then define the
runtime block in frontmatter. See references/format_spec.md for field details.
- Evals block: Ask "Should this skill be automatically testable?" If yes, ask "What does 'correct' look like? Describe a scenario where it should pass and one where it should fail." Then guide eval case authoring in Phase 3b.
Phase 3a — Build the Skill
-
Scaffold. Run scripts/init_skill.py to create the directory:
python internal-skills/dhub-skill-creator/scripts/init_skill.py <name> --path <dir> [--with-runtime] [--with-evals] [--description "..."]
-
Write the SKILL.md body. Follow the writing guidelines below. The body is the agent system prompt — procedural knowledge the agent cannot infer on its own.
-
Build resources. Create scripts, references, assets, agents as designed. For runtime skills, ensure the entrypoint exists and dependencies are declared.
-
Validate early. Run scripts/validate_skill.py during development, not just at the end.
Phase 3b — Author Evaluation Cases
When the skill has an evals block, help the user construct eval cases through a structured interview.
Step 1 — Identify what to test. Each eval case tests one specific behavior.
- "What's the most important thing this skill must get right?"
- "What's the most common way it could fail?"
Step 2 — Write the eval prompt. A realistic user message — what a real person would say to trigger this skill. Keep it focused. Include test data in evals/data/ if needed.
Step 3 — Compose judge criteria. The judge_criteria field is free-text interpreted by an LLM judge. Build it from structured blocks — pick whichever are relevant:
Required Behaviors — things the agent MUST do:
## Required Behaviors
- Checks data distribution before selecting a statistical test
- Reports confidence intervals, not just p-values
Forbidden Behaviors — things that cause automatic failure:
## Forbidden Behaviors
- Applies parametric tests without verifying normality
- Hallucinates data that wasn't in the input file
Expected Output Contains — specific patterns or concepts:
## Expected Output Contains
- A test statistic and p-value
- An interpretation in plain language
Calibration Examples — good/bad snippets so the judge knows what "right" looks like:
## Examples
Good: "Shapiro-Wilk test (p=0.003) rejects normality, using Mann-Whitney U..."
Bad: "Running a t-test gives p=0.04, so the treatment works."
Threshold — how to combine criteria into pass/fail:
## Scoring
PASS if all Required Behaviors present AND no Forbidden Behaviors appear.
Interview the user to populate these blocks:
- "Describe what a correct output looks like" → Required Behaviors + Expected Output
- "What would a wrong output look like?" → Forbidden Behaviors + bad Example
- "Can you show a snippet of ideal output?" → good Example
For simple cases, a single sentence works: "PASS if the agent creates a valid CSV file with headers matching the schema. FAIL otherwise."
Step 4 — Assemble the eval YAML. Create evals/<case-name>.yaml with name, description, prompt, and judge_criteria fields. See references/format_spec.md for the complete spec and references/skill_patterns.md for the eval criteria authoring guide.
Writing Guidelines
- Write for an AI agent, not a human. Focus on procedural knowledge the agent cannot infer from its training.
- Imperative form. "Parse the input" not "You should parse the input."
- Be specific about what NOT to do. Agents tend toward generic outputs unless constrained. Anti-pattern lists and blacklists are highly effective.
- Include concrete examples. Show expected input/output pairs, good/bad snippets. Examples outperform abstract rules.
- Keep SKILL.md under 5000 words. Move detailed specs, lookup tables, and large examples to
references/.
- Every instruction must be actionable. If the agent cannot act on a sentence, delete it. No throat-clearing, no meta-commentary.
- Use tables for structured data. Default assumptions, field specs, command references — tables are faster to parse than prose.
- One section, one concern. Don't mix workflow steps with quality criteria. Separate them.
Validate, Package, Iterate
Validate
Run validation during development to catch issues early:
python internal-skills/dhub-skill-creator/scripts/validate_skill.py <skill-dir>
python internal-skills/dhub-skill-creator/scripts/validate_skill.py <skill-dir> --strict
Fix all errors before packaging. Address warnings to improve quality.
Package
Create a distributable zip (runs validation first):
python internal-skills/dhub-skill-creator/scripts/package_skill.py <skill-dir> [--output-dir <dir>]
Iterate
Test the skill by using it on real tasks. Notice gaps, iterate on the SKILL.md and resources. Skills improve through use, not through planning.
Publish
After validation and packaging, ask the user: "Do you want to publish this skill to Decision Hub?" If yes:
dhub publish --org <org> --name <skill>
Run from the skill directory. The server validates the manifest, runs safety checks, and optionally triggers eval runs.
Quick Reference
Scripts
| Script | Purpose | Usage |
|---|
init_skill.py | Scaffold a new skill | python scripts/init_skill.py <name> --path <dir> [--with-runtime] [--with-evals] [--description "..."] |
validate_skill.py | Validate a skill directory | python scripts/validate_skill.py <skill-dir> [--strict] |
package_skill.py | Validate + zip for distribution | python scripts/package_skill.py <skill-dir> [--output-dir <dir>] |
Frontmatter Fields
| Field | Required | Description |
|---|
name | yes | 1-64 chars, ^[a-z0-9]([a-z0-9-]{0,62}[a-z0-9])?$ |
description | yes | 1-1024 chars, what the skill does + when to trigger |
license | no | SPDX identifier |
compatibility | no | Requirements or constraints |
metadata | no | Key-value pairs |
allowed_tools | no | Tool access restrictions |
runtime | no | Executable code configuration (see references/format_spec.md) |
evals | no | Automated evaluation configuration (see references/format_spec.md) |
Validation Checks Summary
- 10 error-level checks: SKILL.md exists, valid frontmatter, non-empty body, valid name, name matches dir, valid description, no placeholders, runtime language/entrypoint, evals agent/judge_model, eval YAML fields, unique eval names
- 5 warning-level checks: short description, short body, env var naming, missing eval files,
--strict promotes all to errors
Troubleshooting
"SKILL.md not found" — Ensure you point to the skill directory, not the SKILL.md file itself.
"name does not match directory name" — The name field in frontmatter must exactly match the containing directory name. Rename either one.
"Frontmatter is not valid YAML" — Check for unquoted colons in field values. Wrap the description in quotes if it contains colons: description: "My skill: does things".
"entrypoint does not exist" — The file at runtime.entrypoint must exist relative to the skill root. Create the file or fix the path.
"No evals/*.yaml files found" — Either add eval case YAML files to the evals/ directory, or remove the evals block from frontmatter if evals aren't needed yet.
Validation passes but skill doesn't trigger — The description may be too vague. Make it specific with concrete task types and "Use when..." phrasing.
Zip excludes needed files — The packager excludes __pycache__/, *.pyc, .DS_Store, .git/, *.egg-info/, .env*. If a needed file matches these patterns, rename it.