| name | skill-creator |
| description | Create new skills, modify and improve existing skills, and measure skill performance for the pi coding agent. Use when users want to create a skill from scratch, edit, or optimize an existing skill, run evals to test a skill, benchmark skill performance with variance analysis, or optimize a skill's description for better triggering accuracy. Also use when the user mentions skills, SKILL.md, or wants to package capabilities for reuse. |
Skill Creator
A skill for creating new skills for the pi coding agent and iteratively improving them.
At a high level, the process of creating a skill goes like this:
- Decide what you want the skill to do and roughly how it should do it
- Write a draft of the skill
- Create a few test prompts and run pi-with-access-to-the-skill on them
- Help the user evaluate the results both qualitatively and quantitatively
- Rewrite the skill based on feedback from the user's evaluation of the results
- Repeat until you're satisfied
- Expand the test set and try again at larger scale
Your job when using this skill is to figure out where the user is in this process and then jump in and help them progress through these stages. So for instance, maybe they're like "I want to make a skill for X". You can help narrow down what they mean, write a draft, write the test cases, figure out how they want to evaluate, run all the prompts, and repeat.
When creating a new skill, the very first thing you must do is ask the user where they want the skill created (global, project-local, or custom path). See "Where to create the skill" below.
On the other hand, maybe they already have a draft of the skill. In this case you can go straight to the eval/iterate part of the loop.
Of course, you should always be flexible and if the user is like "I don't need to run a bunch of evaluations, just vibe with me", you can do that instead.
Communicating with the user
The skill creator is liable to be used by people across a wide range of familiarity with coding jargon. Pay attention to context cues to understand how to phrase your communication:
- "evaluation" and "benchmark" are borderline, but OK
- for "JSON" and "assertion" you want to see serious cues from the user that they know what those things are before using them without explaining them
It's OK to briefly explain terms if you're in doubt.
Pi skill specifics
Skills in pi follow the Agent Skills standard. Pi loads skills from these locations:
~/.pi/agent/skills/ (global)
.pi/skills/ (project-level)
~/.agents/skills/ and .agents/skills/ (Agent Skills standard locations)
- Via pi packages (npm or git)
SKILL.md frontmatter
The frontmatter must include:
---
name: my-skill
description: >
What this skill does and when to trigger it. Include both
purpose and triggering contexts. Be slightly "pushy" — skills
tend to undertrigger, so err on the side of broader matching.
license: MIT
compatibility: ...
metadata: {}
allowed-tools: Bash read edit write
disable-model-invocation: false
---
Skill directory structure
skill-name/
├── SKILL.md # Required: frontmatter + instructions
├── scripts/ # Helper scripts (executable code)
├── references/ # Docs loaded into context as needed
└── assets/ # Templates, icons, fonts, etc.
Progressive disclosure in pi
Skills use a three-level loading system:
- Metadata (name + description) — Always in context (~100 words)
- SKILL.md body — In context whenever skill triggers (<500 lines ideal)
- Bundled resources — As needed via
read tool (unlimited)
Reference files clearly from SKILL.md with guidance on when to read them. For large reference files (>300 lines), include a table of contents.
Validation rules
Pi validates skills and warns about violations:
- Name must match parent directory
- Name: 1-64 chars, lowercase a-z, 0-9, hyphens only, no leading/trailing/consecutive hyphens
- Description: max 1024 chars (skills missing description are not loaded)
When creating a skill, validate against these rules before finalizing.
Creating a skill
Capture Intent
Start by understanding the user's intent. The current conversation might already contain a workflow the user wants to capture (e.g., they say "turn this into a skill"). If so, extract answers from the conversation history first — the tools used, the sequence of steps, corrections the user made, input/output formats observed. The user may need to fill the gaps, and should confirm before proceeding to the next step.
- What should this skill enable pi to do?
- When should this skill trigger? (what user phrases/contexts)
- What's the expected output format?
- Should we set up test cases to verify the skill works? Skills with objectively verifiable outputs (file transforms, data extraction, code generation, fixed workflow steps) benefit from test cases. Skills with subjective outputs (writing style, art) often don't need them. Suggest the appropriate default based on the skill type, but let the user decide.
Interview and Research
Proactively ask questions about edge cases, input/output formats, example files, success criteria, and dependencies. Wait to write test prompts until you've got this part ironed out.
Write the SKILL.md
Based on the user interview, fill in these components:
- name: Skill identifier (must match directory name, lowercase with hyphens)
- description: When to trigger, what it does. Be slightly pushy to combat undertriggering.
- allowed-tools: Pre-approve tools the skill will use (e.g.,
Bash read edit write)
- the rest of the skill :)
Writing Guide
Defining output formats
## Report structure
ALWAYS use this exact template:
# [Title]
## Executive summary
## Key findings
## Recommendations
Examples pattern
## Commit message format
**Example 1:**
Input: Added user authentication with JWT tokens
Output: feat(auth): implement JWT-based authentication
⚠️ Where to create the skill — ALWAYS ASK FIRST
Before creating any files, always ask the user where they want the skill created. Present the options:
- Global —
~/.pi/agent/skills/<skill-name>/ (available in all projects)
- Project-local —
.pi/skills/<skill-name>/ in the current project directory (available only in this project)
- Custom path — any directory the user specifies
Do not assume global. Do not start creating files until the user has confirmed the location.
If the user has an active project (e.g. they're in a git repo), suggest the project-local path as the default since it keeps the skill with the code it relates to.
Writing Style
Explain to the model why things are important rather than using heavy-handed MUSTs. Use theory of mind and make the skill general, not narrow to specific examples. Start by writing a draft and then look at it with fresh eyes and improve it.
Test Cases
After writing the skill draft, come up with 2-3 realistic test prompts — the kind of thing a real user would actually say. Share them with the user: "Here are a few test cases I'd like to try. Do these look right, or do you want to add more?"
Save test cases to evals/evals.json in the skill directory:
{
"skill_name": "example-skill",
"evals": [
{
"id": 1,
"prompt": "User's task prompt",
"expected_output": "Description of expected result",
"files": []
}
]
}
See references/schemas.md for the full schema (including the assertions field, added later).
Running and evaluating test cases
This section is one continuous sequence — don't stop partway through.
Put results in a dedicated workspace directory: ~/.pi/skill-workspaces/<skill-name>/. This keeps eval data separate from the skill itself and avoids polluting project or global skill directories. Within the workspace, organize results by iteration (iteration-1/, iteration-2/, etc.) and within that, each test case gets a directory (eval-0/, eval-1/, etc.). Don't create all of this upfront — just create directories as you go.
Step 1: Run test cases inline
Since pi doesn't have subagents, run test cases inline — one at a time. For each test case:
- Read the skill's SKILL.md
- Follow the skill's instructions to accomplish the test prompt
- Save outputs to
<workspace>/iteration-<N>/eval-<ID>/with_skill/outputs/
For baseline runs (comparing against no skill):
- Run the same prompt without reading the skill
- Save to
<workspace>/iteration-<N>/eval-<ID>/without_skill/outputs/
For improvement runs (comparing against old version):
- Snapshot the skill before editing (
cp -r <skill-path> <workspace>/skill-snapshot/)
- Run against the old version, save to
old_skill/outputs/
Write an eval_metadata.json for each test case. Give each eval a descriptive name based on what it's testing — use this name for the directory too.
{
"eval_id": 0,
"eval_name": "descriptive-name-here",
"prompt": "The user's task prompt",
"assertions": []
}
Step 2: Draft assertions while processing
Draft quantitative assertions for each test case and explain them to the user. Good assertions are objectively verifiable and have descriptive names. Subjective skills (writing style, design quality) are better evaluated qualitatively — don't force assertions onto things that need human judgment.
Update eval_metadata.json files and evals/evals.json with the assertions.
Step 3: Grade and aggregate
Once all runs are done:
-
Grade each run — evaluate each assertion against the outputs. Save results to grading.json in each run directory. The grading.json expectations array must use text, passed, and evidence fields. For assertions that can be checked programmatically, write and run a script.
-
Aggregate into benchmark — run from the skill-creator directory:
cd <skill-dir> && uv run scripts/aggregate_benchmark.py <workspace>/iteration-N --skill-name <name>
-
Launch the viewer — generate a standalone HTML file:
cd <skill-dir> && uv run eval-viewer/generate_review.py \
<workspace>/iteration-N \
--skill-name "my-skill" \
--benchmark <workspace>/iteration-N/benchmark.json \
--static <workspace>/iteration-N/review.html
Then tell the user to open the file in their browser.
For iteration 2+, also pass --previous-workspace <workspace>/iteration-<N-1>.
-
Tell the user: "I've generated the review file at <path>/review.html. Open it in your browser — there are two tabs. 'Outputs' lets you click through each test case and leave feedback, 'Benchmark' shows the quantitative comparison. When you're done, download the feedback file and let me know."
Note the download filename includes the timestamp from when the viewer was generated (e.g. feedback_20260430_143025.json). The viewer embeds this timestamp at spawn time, so the filename is deterministic — you'll know exactly which file to look for in ~/Downloads/.
What the user sees in the viewer
The "Outputs" tab shows one test case at a time with the prompt, output, grades, and a feedback textbox. The "Benchmark" tab shows pass rates, timing, and token usage with per-eval breakdowns. When done, they click "Submit All Reviews" which downloads a feedback_<timestamp>.json (timestamp set when the viewer was generated).
Step 4: Read the feedback
When the user tells you they're done, read the downloaded feedback file from ~/Downloads/. The filename follows the pattern feedback_<timestamp>.json where the timestamp was set when the viewer was generated (e.g. feedback_20260430_143025.json). You already know this filename from when you spawned the viewer.
{
"reviews": [
{"run_id": "eval-0-with_skill", "feedback": "the chart is missing axis labels", "timestamp": "..."},
{"run_id": "eval-1-with_skill", "feedback": "", "timestamp": "..."}
],
"status": "complete"
}
Empty feedback means the user thought it was fine. Focus improvements on the test cases with specific complaints.
Improving the skill
This is the heart of the loop. You've run the test cases, the user has reviewed the results, and now you need to make the skill better.
How to think about improvements
-
Generalize from the feedback. You're trying to create skills that work across many different prompts, not just the few test cases. Rather than overfitting changes, try branching out and using different metaphors or patterns.
-
Keep the prompt lean. Remove things that aren't pulling their weight. Read the transcripts, not just final outputs — if the skill makes the model waste time on unproductive things, remove those parts.
-
Explain the why. Try to explain the reasoning behind everything you ask the model to do. If you find yourself writing ALWAYS or NEVER in all caps, reframe and explain the reasoning instead.
-
Look for repeated work across test cases. If all test cases result in writing similar helper scripts, that's a signal the skill should bundle that script.
The iteration loop
After improving the skill:
- Apply your improvements to the skill
- Rerun all test cases into a new
iteration-<N+1>/ directory
- Generate the reviewer with
--previous-workspace pointing at the previous iteration
- Wait for the user to review and tell you they're done
- Read the new feedback, improve again, repeat
Keep going until:
- The user says they're happy
- The feedback is all empty (everything looks good)
- You're not making meaningful progress
Description Optimization
The description field in SKILL.md frontmatter is the primary mechanism that determines whether pi invokes a skill. After creating or improving a skill, offer to optimize the description for better triggering accuracy.
Step 1: Generate trigger eval queries
Create 20 eval queries — a mix of should-trigger and should-not-trigger. Save as JSON:
[
{"query": "the user prompt", "should_trigger": true},
{"query": "another prompt", "should_trigger": false}
]
Queries must be realistic — concrete, specific, with detail like file paths, personal context, column names, URLs. A mix of lengths, with edge cases. Some casual or with typos.
For should-trigger (8-10): Different phrasings of the same intent, cases where the user doesn't explicitly name the skill, uncommon use cases, and cases where this skill competes with another.
For should-not-trigger (8-10): Near-misses — queries sharing keywords but needing something different. Genuinely tricky cases, not obviously irrelevant ones.
Step 2: Review with user
Present the eval set using the HTML template:
- Read
assets/eval_review.html
- Generate a timestamp:
date +%Y%m%d_%H%M%S (e.g. 20260430_143025)
- Replace placeholders:
__EVAL_DATA_PLACEHOLDER__, __SKILL_NAME_PLACEHOLDER__, __SKILL_DESCRIPTION_PLACEHOLDER__, __TIMESTAMP_PLACEHOLDER__
- Write to
/tmp/eval_review_<skill-name>.html and open it
- The user edits queries, toggles should-trigger, exports to
~/Downloads/eval_set_<timestamp>.json — you know the exact filename because the timestamp was set at spawn time
Step 3: Run the optimization loop
Save the eval set to the workspace, then run:
cd <skill-dir> && uv run scripts/run_loop.py \
--eval-set <path-to-trigger-eval.json> \
--skill-path <path-to-skill> \
--max-iterations 5 \
--verbose
This handles the full optimization loop: splits into 60/40 train/test, evaluates the current description (3 runs per query for reliability), calls pi to propose improvements, re-evaluates, iterates up to 5 times. Outputs best_description selected by test score.
Note: The run_loop and run_eval scripts use pi -p as a subprocess. Omit --model to use pi's default model, or specify one explicitly (e.g., --model sonnet).
How skill triggering works in pi
Skills appear in pi's available_skills list with their name + description. Pi decides whether to load a skill based on that description. Pi only loads skills for tasks it can't easily handle on its own — simple one-step queries may not trigger a skill even if the description matches. Substantive, multi-step, or specialized queries reliably trigger skills.
Step 4: Apply the result
Take best_description from the JSON output and update the skill's SKILL.md frontmatter. Show the user before/after and report the scores.
Packaging
When the skill is complete, package it for distribution:
cd <skill-dir> && uv run scripts/package_skill.py <path/to/skill-folder>
This creates a .skill file (zip archive) that can be shared. The user can install it by extracting to ~/.pi/agent/skills/ or .pi/skills/.
For publishing as a pi package (npm or git), create a package.json with:
{
"name": "my-skill-package",
"keywords": ["pi-package"],
"pi": {
"skills": ["./skills"]
}
}
Then users can install with pi install git:github.com/user/repo or pi install npm:@scope/package.
Reference files
The agents/ directory contains instructions for specialized evaluation roles. Read them when needed:
agents/grader.md — How to evaluate assertions against outputs
agents/comparator.md — How to do blind A/B comparison between two outputs
agents/analyzer.md — How to analyze why one version beat another
The references/ directory has additional documentation:
references/schemas.md — JSON structures for evals.json, grading.json, benchmark.json, etc.
The core loop, one more time:
- Figure out what the skill is about
- Draft or edit the skill
- Run test cases
- With the user, evaluate the outputs (generate review HTML, run quantitative evals)
- Repeat until satisfied
- Optionally optimize the description for triggering
- Package the final skill