| name | skill-creator |
| description | Create, edit, audit, and evaluate golem-powers skills. Use for new skills, structural skill edits, workflow/adapter changes, pre-deploy validation, skill evals, benchmarks, live A/B tests, session JSONL mining, batch miners, and handoff digests. Triggers: create skill, edit skill, audit skill, validate skill, skill eval, live eval, mine session. NOT for invoking existing skills or convergence weaving. |
Skill Creator
Take skills from "drafted" to "proven" through rigorous multi-layer evaluation.
Installation
bash ~/Gits/golems/skills/golem-powers/skill-creator/scripts/install.sh
The installer is idempotent — safe to re-run. It:
- Symlinks the skill into
~/.claude/commands/skill-creator (golem-powers convention)
- Ensures
~/Gits/skill-creator/.claude/agents/ and scripts/ exist as the project-scope home
- Symlinks the
session-miner agent definition into the project-scope agents dir (gates the sub-agent so only skillCreator-family sessions can spawn it)
- Symlinks the parser into the project-scope scripts dir
Run bash scripts/install.sh --dry-run to preview without changes. bash scripts/install.sh --help for usage.
Core Loop: DRAFT → EVAL → RED → GREEN → SMOKE → SHIP
Every skill change goes through this pipeline. No exceptions.
1. DRAFT — Understand the Skill
- Read the SKILL.md or MCP tool definition completely
brain_search for: prior evals, user complaints, known issues, related skills
- Identify: what does this skill claim to do? What's the baseline without it?
- Document intent, inputs, outputs, edge cases
- Classify: capability uplift (may become obsolete) vs encoded preference (durable)
2. EVAL DESIGN — Create Structured Test Cases
Design eval cases as structured JSON in evals/evals.json:
{
"id": 1,
"name": "descriptive-kebab-name",
"category": "compliance|structure|quality",
"description": "What this eval tests",
"prompt": "The input scenario given to the agent",
"assertions": [
{
"type": "tool_usage|content|negative",
"name": "assertion-name",
"description": "What correct behavior looks like"
}
]
}
Cover: happy path, edge cases, failure modes, interaction with other skills.
See references/scoring-rubric.md for scoring methodology.
3. RED — Baseline Measurement (without skill)
- Run each eval WITHOUT the skill loaded
- Score baseline performance across all eval cases
- Document baseline scores
- Gate: If baseline >70%, the skill may not be adding value — flag it
4. GREEN — With-Skill Measurement
- Run each eval WITH the skill loaded
- Score delta between with_skill and without_skill
- Gate: Delta <10% = marginal, consider retiring. Delta >30% = clearly valuable.
- Maximum 3 iterations before flagging for human review
5. SMOKE — Live Agent Testing
Two tiers based on skill importance:
Tier A: Static Smoke (all skills)
- Review evals.json assertions manually
- Verify no contradictions between skill instructions and assertions
- Check that strongest discriminators are tested
Tier B: Live Agent Smoke (flagship skills only)
- Run
live-eval-runner.sh for top 3 discriminator evals
- Agent routing: Claude skills → Sonnet, code skills → Codex, audit skills → Cursor
- Compare captured output against assertions
- Gate: Live delta must be within 15% of static delta
- Results stored in
evals/results/live-{date}.json and brain_store'd
See workflows/live-eval.md for the full live eval workflow.
6. SHIP — Package and Document
- Ensure every skill ships with its evals (no eval = no ship)
brain_store eval results, delta scores, issues found
- Update skill metadata (version, last-eval-date, compliance-score)
- Follow workflows/create-skill.md and
workflows/audit-skill.md for SKILL.md structure
compliance before shipping structural changes.
- REGISTER a NEW skill — committed ≠ installed. A new golem-powers skill is
invisible to every agent until it's symlinked into
~/.claude/skills/<name> +
~/.claude/commands/<name> (run golem-install, which auto-discovers + links
every golem-powers/*/ dir; or symlink the one new skill). Then VERIFY it
appears in the available-skills list — committing/merging does NOT register it.
(2026-05-30: /weave was committed + merged but unusable by any agent until the
symlinks existed. See workflows/create-skill.md
"Final Step: REGISTER".)
Eval Methodology
with_skill vs without_skill comparison is MANDATORY. No exceptions.
Behavioral Compliance Scoring
| Weight | Dimension | What it measures |
|---|
| 70% | Compliance | Does the agent follow the skill's instructions? |
| 20% | Structure | Does the output match expected format? |
| 10% | Quality | Is the output actually good/useful? |
Scoring Thresholds
| Condition | Action |
|---|
| Baseline >70% | Skill may not add value — flag and explain |
| Delta <10% | Marginal — consider if complexity is worth it |
| Delta >30% | Clearly valuable — ship with confidence |
| Compliance <50% with skill | Instructions unclear — rewrite before shipping |
Agent Routing for Live Evals
| Skill type | Test with | Why |
|---|
| Claude behavior skills | Sonnet (default) | Tests actual Claude compliance |
| Code implementation skills | Codex (default, no model flag) | Tests code quality |
| Audit/review skills | Cursor (default) | Tests review thoroughness |
Workflows
| Workflow | When to use |
|---|
| create-skill.md | Creating a new skill from scratch |
| audit-skill.md | Auditing existing skill structure, scripts, workflows, and registration before deploy |
| live-eval.md | Running live A/B tests with real agents |
| ab-compare.md | Comparing skill versions or platforms |
| mine-session.md | Mining a Claude Code session JSONL into a 10-section markdown digest (handoff docs, EOD waves, claim verification) |
References
| Reference | Content |
|---|
| scoring-rubric.md | Full scoring methodology and rubric |
| subagent-vs-skill.md | Classification reference — READ BEFORE SCAFFOLDING any new capability. Distinguishes skill (slash-triggered, parent-context) from sub-agent (name-spawned, isolated-context). Documents the misroute pattern (GitHub openai/codex#18823) that even Codex itself routinely makes. |
Integration with Other Skills
/cmux-agents — spawning live eval agents in cmux panes
/never-fabricate — Read() every file before reporting results
/pr-loop — shipping skill changes through the full PR lifecycle
Packaged Sub-Agents (skill-creator family only)
These sub-agents ship in BOTH Claude and Codex formats. They are invokable only from sessions with cwd inside ~/Gits/skill-creator/ (i.e., skillCreatorClaude / skillCreatorCodex / skillCreatorRepoGolem). Sessions in other repos cannot spawn them directly — they must dispatch a skillCreator first.
Canonical source-of-truth lives in this skill's agents/ dir. scripts/install.sh symlinks them into the project repo's project-scope dirs (.claude/agents/ for Claude, .codex/agents/ for Codex). The dual-format packaging means the same sub-agent is available regardless of whether the parent is Claude Code or Codex CLI.
| Sub-agent | Claude format | Codex format | Workflow |
|---|
session-miner | agents/session-miner.md (model: inherit) | agents/session-miner.toml (model: gpt-5.3-codex-spark) | mine-session.md |
Both formats invoke the same deterministic parser at scripts/session-miner.py. The Claude format uses subagent_type="session-miner" via the Agent tool; the Codex format is referenced by name in natural-language prompts (session_miner, mine X to Y).
For when to choose sub-agent vs skill: see references/subagent-vs-skill.md.
Critical Rules
- Never claim "tests pass" without running them. Read actual output.
- Every skill ships with evals. No eval = no ship.
- brain_search BEFORE starting work — someone may have already evaluated it.
- brain_store AFTER completing work — results, delta scores, decisions.
- Max 3 iterations per skill before flagging for human review.
- Read() every file you cite. No citing from compacted summaries.
- Sequential skill compounding — build skills one at a time, each learning from the last.