一键在 Manus 中运行任何 Skill

$pwd:

skill-creator

Name: Skill Creator
Author: S1M0N38

// Create new skills, modify and improve existing skills, and measure skill performance for the pi coding agent. Use when users want to create a skill from scratch, edit, or optimize an existing skill, run evals to test a skill, benchmark skill performance with variance analysis, or optimize a skill's description for better triggering accuracy. Also use when the user mentions skills, SKILL.md, or wants to package capabilities for reuse.

在 Manus 中运行

$ git log --oneline --stat

stars:11

forks:0

updated:2026年4月30日 16:26

文件资源管理器

17 个文件

SKILL.md

readonly

package.json

"author": "S1M0N38"

"repository": "S1M0N38/pi-skill-creator"

打开 GitHub 仓库查看创作者相关仓库

$ install --global

$ download --local

在 Manus 中运行

$ useful --forSOC

软件开发工程师计算机与数学类职业15-1252L4

一键运行任何 Skill

name

skill-creator

description

Create new skills, modify and improve existing skills, and measure skill performance for the pi coding agent. Use when users want to create a skill from scratch, edit, or optimize an existing skill, run evals to test a skill, benchmark skill performance with variance analysis, or optimize a skill's description for better triggering accuracy. Also use when the user mentions skills, SKILL.md, or wants to package capabilities for reuse.

Skill Creator

A skill for creating new skills for the pi coding agent and iteratively improving them.

At a high level, the process of creating a skill goes like this:

Decide what you want the skill to do and roughly how it should do it
Write a draft of the skill
Create a few test prompts and run pi-with-access-to-the-skill on them
Help the user evaluate the results both qualitatively and quantitatively
Rewrite the skill based on feedback from the user's evaluation of the results
Repeat until you're satisfied
Expand the test set and try again at larger scale

Your job when using this skill is to figure out where the user is in this process and then jump in and help them progress through these stages. So for instance, maybe they're like "I want to make a skill for X". You can help narrow down what they mean, write a draft, write the test cases, figure out how they want to evaluate, run all the prompts, and repeat.

When creating a new skill, the very first thing you must do is ask the user where they want the skill created (global, project-local, or custom path). See "Where to create the skill" below.

On the other hand, maybe they already have a draft of the skill. In this case you can go straight to the eval/iterate part of the loop.

Of course, you should always be flexible and if the user is like "I don't need to run a bunch of evaluations, just vibe with me", you can do that instead.

Communicating with the user

The skill creator is liable to be used by people across a wide range of familiarity with coding jargon. Pay attention to context cues to understand how to phrase your communication:

"evaluation" and "benchmark" are borderline, but OK
for "JSON" and "assertion" you want to see serious cues from the user that they know what those things are before using them without explaining them

It's OK to briefly explain terms if you're in doubt.

Pi skill specifics

Skills in pi follow the Agent Skills standard. Pi loads skills from these locations:

~/.pi/agent/skills/ (global)
.pi/skills/ (project-level)
~/.agents/skills/ and .agents/skills/ (Agent Skills standard locations)
Via pi packages (npm or git)

SKILL.md frontmatter

The frontmatter must include:

---
name: my-skill       # Required. Max 64 chars. Lowercase a-z, 0-9, hyphens only.
                     # Must match parent directory name. No leading/trailing/consecutive hyphens.
description: >       # Required. Max 1024 chars. What the skill does AND when to use it.
  What this skill does and when to trigger it. Include both
  purpose and triggering contexts. Be slightly "pushy" — skills
  tend to undertrigger, so err on the side of broader matching.
license: MIT         # Optional. License name or reference to bundled file.
compatibility: ...   # Optional. Max 500 chars. Environment requirements.
metadata: {}         # Optional. Arbitrary key-value mapping.
allowed-tools: Bash read edit write  # Optional. Space-delimited pre-approved tools.
disable-model-invocation: false      # Optional. When true, hidden from system prompt.
---

Skill directory structure

skill-name/
├── SKILL.md              # Required: frontmatter + instructions
├── scripts/              # Helper scripts (executable code)
├── references/           # Docs loaded into context as needed
└── assets/               # Templates, icons, fonts, etc.

Progressive disclosure in pi

Skills use a three-level loading system:

Metadata (name + description) — Always in context (~100 words)
SKILL.md body — In context whenever skill triggers (<500 lines ideal)
Bundled resources — As needed via read tool (unlimited)

Reference files clearly from SKILL.md with guidance on when to read them. For large reference files (>300 lines), include a table of contents.

Validation rules

Pi validates skills and warns about violations:

Name must match parent directory
Name: 1-64 chars, lowercase a-z, 0-9, hyphens only, no leading/trailing/consecutive hyphens
Description: max 1024 chars (skills missing description are not loaded)

When creating a skill, validate against these rules before finalizing.

Creating a skill

Capture Intent

Start by understanding the user's intent. The current conversation might already contain a workflow the user wants to capture (e.g., they say "turn this into a skill"). If so, extract answers from the conversation history first — the tools used, the sequence of steps, corrections the user made, input/output formats observed. The user may need to fill the gaps, and should confirm before proceeding to the next step.

What should this skill enable pi to do?
When should this skill trigger? (what user phrases/contexts)
What's the expected output format?
Should we set up test cases to verify the skill works? Skills with objectively verifiable outputs (file transforms, data extraction, code generation, fixed workflow steps) benefit from test cases. Skills with subjective outputs (writing style, art) often don't need them. Suggest the appropriate default based on the skill type, but let the user decide.

Interview and Research

Proactively ask questions about edge cases, input/output formats, example files, success criteria, and dependencies. Wait to write test prompts until you've got this part ironed out.

Write the SKILL.md

Based on the user interview, fill in these components:

name: Skill identifier (must match directory name, lowercase with hyphens)
description: When to trigger, what it does. Be slightly pushy to combat undertriggering.
allowed-tools: Pre-approve tools the skill will use (e.g., Bash read edit write)
the rest of the skill :)

Writing Guide

Defining output formats

## Report structure
ALWAYS use this exact template:
# [Title]
## Executive summary
## Key findings
## Recommendations

Examples pattern

## Commit message format
**Example 1:**
Input: Added user authentication with JWT tokens
Output: feat(auth): implement JWT-based authentication

⚠️ Where to create the skill — ALWAYS ASK FIRST

Before creating any files, always ask the user where they want the skill created. Present the options:

Global — ~/.pi/agent/skills/<skill-name>/ (available in all projects)
Project-local — .pi/skills/<skill-name>/ in the current project directory (available only in this project)
Custom path — any directory the user specifies

Do not assume global. Do not start creating files until the user has confirmed the location.

If the user has an active project (e.g. they're in a git repo), suggest the project-local path as the default since it keeps the skill with the code it relates to.

Writing Style

Explain to the model why things are important rather than using heavy-handed MUSTs. Use theory of mind and make the skill general, not narrow to specific examples. Start by writing a draft and then look at it with fresh eyes and improve it.

Test Cases

After writing the skill draft, come up with 2-3 realistic test prompts — the kind of thing a real user would actually say. Share them with the user: "Here are a few test cases I'd like to try. Do these look right, or do you want to add more?"

Save test cases to evals/evals.json in the skill directory:

{
  "skill_name": "example-skill",
  "evals": [
    {
      "id": 1,
      "prompt": "User's task prompt",
      "expected_output": "Description of expected result",
      "files": []
    }
  ]
}

See references/schemas.md for the full schema (including the assertions field, added later).

Running and evaluating test cases

This section is one continuous sequence — don't stop partway through.

Put results in a dedicated workspace directory: ~/.pi/skill-workspaces/<skill-name>/. This keeps eval data separate from the skill itself and avoids polluting project or global skill directories. Within the workspace, organize results by iteration (iteration-1/, iteration-2/, etc.) and within that, each test case gets a directory (eval-0/, eval-1/, etc.). Don't create all of this upfront — just create directories as you go.

Step 1: Run test cases inline

Since pi doesn't have subagents, run test cases inline — one at a time. For each test case:

Read the skill's SKILL.md
Follow the skill's instructions to accomplish the test prompt
Save outputs to <workspace>/iteration-<N>/eval-<ID>/with_skill/outputs/

For baseline runs (comparing against no skill):

Run the same prompt without reading the skill
Save to <workspace>/iteration-<N>/eval-<ID>/without_skill/outputs/

For improvement runs (comparing against old version):

Snapshot the skill before editing (cp -r <skill-path> <workspace>/skill-snapshot/)
Run against the old version, save to old_skill/outputs/

Write an eval_metadata.json for each test case. Give each eval a descriptive name based on what it's testing — use this name for the directory too.

{
  "eval_id": 0,
  "eval_name": "descriptive-name-here",
  "prompt": "The user's task prompt",
  "assertions": []
}

Step 2: Draft assertions while processing

Draft quantitative assertions for each test case and explain them to the user. Good assertions are objectively verifiable and have descriptive names. Subjective skills (writing style, design quality) are better evaluated qualitatively — don't force assertions onto things that need human judgment.

Update eval_metadata.json files and evals/evals.json with the assertions.

Step 3: Grade and aggregate

Once all runs are done:

Grade each run — evaluate each assertion against the outputs. Save results to grading.json in each run directory. The grading.json expectations array must use text, passed, and evidence fields. For assertions that can be checked programmatically, write and run a script.

Aggregate into benchmark — run from the skill-creator directory:

cd <skill-dir> && uv run scripts/aggregate_benchmark.py <workspace>/iteration-N --skill-name <name>

Launch the viewer — generate a standalone HTML file:

cd <skill-dir> && uv run eval-viewer/generate_review.py \
  <workspace>/iteration-N \
  --skill-name "my-skill" \
  --benchmark <workspace>/iteration-N/benchmark.json \
  --static <workspace>/iteration-N/review.html

Then tell the user to open the file in their browser.

For iteration 2+, also pass --previous-workspace <workspace>/iteration-<N-1>.

Tell the user: "I've generated the review file at <path>/review.html. Open it in your browser — there are two tabs. 'Outputs' lets you click through each test case and leave feedback, 'Benchmark' shows the quantitative comparison. When you're done, download the feedback file and let me know."

Note the download filename includes the timestamp from when the viewer was generated (e.g. feedback_20260430_143025.json). The viewer embeds this timestamp at spawn time, so the filename is deterministic — you'll know exactly which file to look for in ~/Downloads/.

What the user sees in the viewer

The "Outputs" tab shows one test case at a time with the prompt, output, grades, and a feedback textbox. The "Benchmark" tab shows pass rates, timing, and token usage with per-eval breakdowns. When done, they click "Submit All Reviews" which downloads a feedback_<timestamp>.json (timestamp set when the viewer was generated).

Step 4: Read the feedback

When the user tells you they're done, read the downloaded feedback file from ~/Downloads/. The filename follows the pattern feedback_<timestamp>.json where the timestamp was set when the viewer was generated (e.g. feedback_20260430_143025.json). You already know this filename from when you spawned the viewer.

{
  "reviews": [
    {"run_id": "eval-0-with_skill", "feedback": "the chart is missing axis labels", "timestamp": "..."},
    {"run_id": "eval-1-with_skill", "feedback": "", "timestamp": "..."}
  ],
  "status": "complete"
}

Empty feedback means the user thought it was fine. Focus improvements on the test cases with specific complaints.

Improving the skill

This is the heart of the loop. You've run the test cases, the user has reviewed the results, and now you need to make the skill better.

How to think about improvements

Generalize from the feedback. You're trying to create skills that work across many different prompts, not just the few test cases. Rather than overfitting changes, try branching out and using different metaphors or patterns.
Keep the prompt lean. Remove things that aren't pulling their weight. Read the transcripts, not just final outputs — if the skill makes the model waste time on unproductive things, remove those parts.
Explain the why. Try to explain the reasoning behind everything you ask the model to do. If you find yourself writing ALWAYS or NEVER in all caps, reframe and explain the reasoning instead.
Look for repeated work across test cases. If all test cases result in writing similar helper scripts, that's a signal the skill should bundle that script.

The iteration loop

After improving the skill:

Apply your improvements to the skill
Rerun all test cases into a new iteration-<N+1>/ directory
Generate the reviewer with --previous-workspace pointing at the previous iteration
Wait for the user to review and tell you they're done
Read the new feedback, improve again, repeat

Keep going until:

The user says they're happy
The feedback is all empty (everything looks good)
You're not making meaningful progress

Description Optimization

The description field in SKILL.md frontmatter is the primary mechanism that determines whether pi invokes a skill. After creating or improving a skill, offer to optimize the description for better triggering accuracy.

Step 1: Generate trigger eval queries

Create 20 eval queries — a mix of should-trigger and should-not-trigger. Save as JSON:

[
  {"query": "the user prompt", "should_trigger": true},
  {"query": "another prompt", "should_trigger": false}
]

Queries must be realistic — concrete, specific, with detail like file paths, personal context, column names, URLs. A mix of lengths, with edge cases. Some casual or with typos.

For should-trigger (8-10): Different phrasings of the same intent, cases where the user doesn't explicitly name the skill, uncommon use cases, and cases where this skill competes with another.

For should-not-trigger (8-10): Near-misses — queries sharing keywords but needing something different. Genuinely tricky cases, not obviously irrelevant ones.

Step 2: Review with user

Present the eval set using the HTML template:

Read assets/eval_review.html
Generate a timestamp: date +%Y%m%d_%H%M%S (e.g. 20260430_143025)
Replace placeholders: __EVAL_DATA_PLACEHOLDER__, __SKILL_NAME_PLACEHOLDER__, __SKILL_DESCRIPTION_PLACEHOLDER__, __TIMESTAMP_PLACEHOLDER__
Write to /tmp/eval_review_<skill-name>.html and open it
The user edits queries, toggles should-trigger, exports to ~/Downloads/eval_set_<timestamp>.json — you know the exact filename because the timestamp was set at spawn time

Step 3: Run the optimization loop

Save the eval set to the workspace, then run:

cd <skill-dir> && uv run scripts/run_loop.py \
  --eval-set <path-to-trigger-eval.json> \
  --skill-path <path-to-skill> \
  --max-iterations 5 \
  --verbose

This handles the full optimization loop: splits into 60/40 train/test, evaluates the current description (3 runs per query for reliability), calls pi to propose improvements, re-evaluates, iterates up to 5 times. Outputs best_description selected by test score.

Note: The run_loop and run_eval scripts use pi -p as a subprocess. Omit --model to use pi's default model, or specify one explicitly (e.g., --model sonnet).

How skill triggering works in pi

Skills appear in pi's available_skills list with their name + description. Pi decides whether to load a skill based on that description. Pi only loads skills for tasks it can't easily handle on its own — simple one-step queries may not trigger a skill even if the description matches. Substantive, multi-step, or specialized queries reliably trigger skills.

Step 4: Apply the result

Take best_description from the JSON output and update the skill's SKILL.md frontmatter. Show the user before/after and report the scores.

Packaging

When the skill is complete, package it for distribution:

cd <skill-dir> && uv run scripts/package_skill.py <path/to/skill-folder>

This creates a .skill file (zip archive) that can be shared. The user can install it by extracting to ~/.pi/agent/skills/ or .pi/skills/.

For publishing as a pi package (npm or git), create a package.json with:

{
  "name": "my-skill-package",
  "keywords": ["pi-package"],
  "pi": {
    "skills": ["./skills"]
  }
}

Then users can install with pi install git:github.com/user/repo or pi install npm:@scope/package.

Reference files

The agents/ directory contains instructions for specialized evaluation roles. Read them when needed:

agents/grader.md — How to evaluate assertions against outputs
agents/comparator.md — How to do blind A/B comparison between two outputs
agents/analyzer.md — How to analyze why one version beat another

The references/ directory has additional documentation:

references/schemas.md — JSON structures for evals.json, grading.json, benchmark.json, etc.

The core loop, one more time:

Figure out what the skill is about
Draft or edit the skill
Run test cases
With the user, evaluate the outputs (generate review HTML, run quantitative evals)
Repeat until satisfied
Optionally optimize the description for triggering
Package the final skill

name

skill-creator

description

Skill Creator

A skill for creating new skills for the pi coding agent and iteratively improving them.

At a high level, the process of creating a skill goes like this:

Decide what you want the skill to do and roughly how it should do it
Write a draft of the skill
Create a few test prompts and run pi-with-access-to-the-skill on them
Help the user evaluate the results both qualitatively and quantitatively
Rewrite the skill based on feedback from the user's evaluation of the results
Repeat until you're satisfied
Expand the test set and try again at larger scale

When creating a new skill, the very first thing you must do is ask the user where they want the skill created (global, project-local, or custom path). See "Where to create the skill" below.

On the other hand, maybe they already have a draft of the skill. In this case you can go straight to the eval/iterate part of the loop.

Of course, you should always be flexible and if the user is like "I don't need to run a bunch of evaluations, just vibe with me", you can do that instead.

Communicating with the user

The skill creator is liable to be used by people across a wide range of familiarity with coding jargon. Pay attention to context cues to understand how to phrase your communication:

"evaluation" and "benchmark" are borderline, but OK
for "JSON" and "assertion" you want to see serious cues from the user that they know what those things are before using them without explaining them

It's OK to briefly explain terms if you're in doubt.

Pi skill specifics

Skills in pi follow the Agent Skills standard. Pi loads skills from these locations:

~/.pi/agent/skills/ (global)
.pi/skills/ (project-level)
~/.agents/skills/ and .agents/skills/ (Agent Skills standard locations)
Via pi packages (npm or git)

SKILL.md frontmatter

The frontmatter must include:

---
name: my-skill       # Required. Max 64 chars. Lowercase a-z, 0-9, hyphens only.
                     # Must match parent directory name. No leading/trailing/consecutive hyphens.
description: >       # Required. Max 1024 chars. What the skill does AND when to use it.
  What this skill does and when to trigger it. Include both
  purpose and triggering contexts. Be slightly "pushy" — skills
  tend to undertrigger, so err on the side of broader matching.
license: MIT         # Optional. License name or reference to bundled file.
compatibility: ...   # Optional. Max 500 chars. Environment requirements.
metadata: {}         # Optional. Arbitrary key-value mapping.
allowed-tools: Bash read edit write  # Optional. Space-delimited pre-approved tools.
disable-model-invocation: false      # Optional. When true, hidden from system prompt.
---

Skill directory structure

skill-name/
├── SKILL.md              # Required: frontmatter + instructions
├── scripts/              # Helper scripts (executable code)
├── references/           # Docs loaded into context as needed
└── assets/               # Templates, icons, fonts, etc.

Progressive disclosure in pi

Skills use a three-level loading system:

Metadata (name + description) — Always in context (~100 words)
SKILL.md body — In context whenever skill triggers (<500 lines ideal)
Bundled resources — As needed via read tool (unlimited)

Reference files clearly from SKILL.md with guidance on when to read them. For large reference files (>300 lines), include a table of contents.

Validation rules

Pi validates skills and warns about violations:

Name must match parent directory
Name: 1-64 chars, lowercase a-z, 0-9, hyphens only, no leading/trailing/consecutive hyphens
Description: max 1024 chars (skills missing description are not loaded)

When creating a skill, validate against these rules before finalizing.

Creating a skill

Capture Intent

What should this skill enable pi to do?
When should this skill trigger? (what user phrases/contexts)
What's the expected output format?
Should we set up test cases to verify the skill works? Skills with objectively verifiable outputs (file transforms, data extraction, code generation, fixed workflow steps) benefit from test cases. Skills with subjective outputs (writing style, art) often don't need them. Suggest the appropriate default based on the skill type, but let the user decide.

Interview and Research

Proactively ask questions about edge cases, input/output formats, example files, success criteria, and dependencies. Wait to write test prompts until you've got this part ironed out.

Write the SKILL.md

Based on the user interview, fill in these components:

name: Skill identifier (must match directory name, lowercase with hyphens)
description: When to trigger, what it does. Be slightly pushy to combat undertriggering.
allowed-tools: Pre-approve tools the skill will use (e.g., Bash read edit write)
the rest of the skill :)

Writing Guide

Defining output formats

## Report structure
ALWAYS use this exact template:
# [Title]
## Executive summary
## Key findings
## Recommendations

Examples pattern

## Commit message format
**Example 1:**
Input: Added user authentication with JWT tokens
Output: feat(auth): implement JWT-based authentication

⚠️ Where to create the skill — ALWAYS ASK FIRST

Before creating any files, always ask the user where they want the skill created. Present the options:

Global — ~/.pi/agent/skills/<skill-name>/ (available in all projects)
Project-local — .pi/skills/<skill-name>/ in the current project directory (available only in this project)
Custom path — any directory the user specifies

Do not assume global. Do not start creating files until the user has confirmed the location.

If the user has an active project (e.g. they're in a git repo), suggest the project-local path as the default since it keeps the skill with the code it relates to.

Writing Style

Test Cases

Save test cases to evals/evals.json in the skill directory:

{
  "skill_name": "example-skill",
  "evals": [
    {
      "id": 1,
      "prompt": "User's task prompt",
      "expected_output": "Description of expected result",
      "files": []
    }
  ]
}

See references/schemas.md for the full schema (including the assertions field, added later).

Running and evaluating test cases

This section is one continuous sequence — don't stop partway through.

Step 1: Run test cases inline

Since pi doesn't have subagents, run test cases inline — one at a time. For each test case:

Read the skill's SKILL.md
Follow the skill's instructions to accomplish the test prompt
Save outputs to <workspace>/iteration-<N>/eval-<ID>/with_skill/outputs/

For baseline runs (comparing against no skill):

Run the same prompt without reading the skill
Save to <workspace>/iteration-<N>/eval-<ID>/without_skill/outputs/

For improvement runs (comparing against old version):

Snapshot the skill before editing (cp -r <skill-path> <workspace>/skill-snapshot/)
Run against the old version, save to old_skill/outputs/

Write an eval_metadata.json for each test case. Give each eval a descriptive name based on what it's testing — use this name for the directory too.

{
  "eval_id": 0,
  "eval_name": "descriptive-name-here",
  "prompt": "The user's task prompt",
  "assertions": []
}

Step 2: Draft assertions while processing

Update eval_metadata.json files and evals/evals.json with the assertions.

Step 3: Grade and aggregate

Once all runs are done:

Grade each run — evaluate each assertion against the outputs. Save results to grading.json in each run directory. The grading.json expectations array must use text, passed, and evidence fields. For assertions that can be checked programmatically, write and run a script.

Aggregate into benchmark — run from the skill-creator directory:

cd <skill-dir> && uv run scripts/aggregate_benchmark.py <workspace>/iteration-N --skill-name <name>

Launch the viewer — generate a standalone HTML file:

cd <skill-dir> && uv run eval-viewer/generate_review.py \
  <workspace>/iteration-N \
  --skill-name "my-skill" \
  --benchmark <workspace>/iteration-N/benchmark.json \
  --static <workspace>/iteration-N/review.html

Then tell the user to open the file in their browser.

For iteration 2+, also pass --previous-workspace <workspace>/iteration-<N-1>.

Tell the user: "I've generated the review file at <path>/review.html. Open it in your browser — there are two tabs. 'Outputs' lets you click through each test case and leave feedback, 'Benchmark' shows the quantitative comparison. When you're done, download the feedback file and let me know."

Note the download filename includes the timestamp from when the viewer was generated (e.g. feedback_20260430_143025.json). The viewer embeds this timestamp at spawn time, so the filename is deterministic — you'll know exactly which file to look for in ~/Downloads/.

What the user sees in the viewer

Step 4: Read the feedback

{
  "reviews": [
    {"run_id": "eval-0-with_skill", "feedback": "the chart is missing axis labels", "timestamp": "..."},
    {"run_id": "eval-1-with_skill", "feedback": "", "timestamp": "..."}
  ],
  "status": "complete"
}

Empty feedback means the user thought it was fine. Focus improvements on the test cases with specific complaints.

Improving the skill

This is the heart of the loop. You've run the test cases, the user has reviewed the results, and now you need to make the skill better.

How to think about improvements

Generalize from the feedback. You're trying to create skills that work across many different prompts, not just the few test cases. Rather than overfitting changes, try branching out and using different metaphors or patterns.
Keep the prompt lean. Remove things that aren't pulling their weight. Read the transcripts, not just final outputs — if the skill makes the model waste time on unproductive things, remove those parts.
Explain the why. Try to explain the reasoning behind everything you ask the model to do. If you find yourself writing ALWAYS or NEVER in all caps, reframe and explain the reasoning instead.
Look for repeated work across test cases. If all test cases result in writing similar helper scripts, that's a signal the skill should bundle that script.

The iteration loop

After improving the skill:

Apply your improvements to the skill
Rerun all test cases into a new iteration-<N+1>/ directory
Generate the reviewer with --previous-workspace pointing at the previous iteration
Wait for the user to review and tell you they're done
Read the new feedback, improve again, repeat

Keep going until:

The user says they're happy
The feedback is all empty (everything looks good)
You're not making meaningful progress

Description Optimization

Step 1: Generate trigger eval queries

Create 20 eval queries — a mix of should-trigger and should-not-trigger. Save as JSON:

[
  {"query": "the user prompt", "should_trigger": true},
  {"query": "another prompt", "should_trigger": false}
]

Queries must be realistic — concrete, specific, with detail like file paths, personal context, column names, URLs. A mix of lengths, with edge cases. Some casual or with typos.

For should-trigger (8-10): Different phrasings of the same intent, cases where the user doesn't explicitly name the skill, uncommon use cases, and cases where this skill competes with another.

For should-not-trigger (8-10): Near-misses — queries sharing keywords but needing something different. Genuinely tricky cases, not obviously irrelevant ones.

Step 2: Review with user

Present the eval set using the HTML template:

Read assets/eval_review.html
Generate a timestamp: date +%Y%m%d_%H%M%S (e.g. 20260430_143025)
Replace placeholders: __EVAL_DATA_PLACEHOLDER__, __SKILL_NAME_PLACEHOLDER__, __SKILL_DESCRIPTION_PLACEHOLDER__, __TIMESTAMP_PLACEHOLDER__
Write to /tmp/eval_review_<skill-name>.html and open it
The user edits queries, toggles should-trigger, exports to ~/Downloads/eval_set_<timestamp>.json — you know the exact filename because the timestamp was set at spawn time

Step 3: Run the optimization loop

Save the eval set to the workspace, then run:

cd <skill-dir> && uv run scripts/run_loop.py \
  --eval-set <path-to-trigger-eval.json> \
  --skill-path <path-to-skill> \
  --max-iterations 5 \
  --verbose

Note: The run_loop and run_eval scripts use pi -p as a subprocess. Omit --model to use pi's default model, or specify one explicitly (e.g., --model sonnet).

How skill triggering works in pi

Step 4: Apply the result

Take best_description from the JSON output and update the skill's SKILL.md frontmatter. Show the user before/after and report the scores.

Packaging

When the skill is complete, package it for distribution:

cd <skill-dir> && uv run scripts/package_skill.py <path/to/skill-folder>

This creates a .skill file (zip archive) that can be shared. The user can install it by extracting to ~/.pi/agent/skills/ or .pi/skills/.

For publishing as a pi package (npm or git), create a package.json with:

{
  "name": "my-skill-package",
  "keywords": ["pi-package"],
  "pi": {
    "skills": ["./skills"]
  }
}

Then users can install with pi install git:github.com/user/repo or pi install npm:@scope/package.

Reference files

The agents/ directory contains instructions for specialized evaluation roles. Read them when needed:

agents/grader.md — How to evaluate assertions against outputs
agents/comparator.md — How to do blind A/B comparison between two outputs
agents/analyzer.md — How to analyze why one version beat another

The references/ directory has additional documentation:

references/schemas.md — JSON structures for evals.json, grading.json, benchmark.json, etc.

The core loop, one more time:

Figure out what the skill is about
Draft or edit the skill
Run test cases
With the user, evaluate the outputs (generate review HTML, run quantitative evals)
Repeat until satisfied
Optionally optimize the description for triggering
Package the final skill