一键在 Manus 中运行任何 Skill

$pwd:

skill-forge-eval

Name: Skill Forge Eval
Author: AgriciDaniel

// Run evaluation pipelines on Claude Code skills to test triggering accuracy, workflow correctness, and output quality. Spawns executor, grader, comparator, and analyzer sub-agents for parallel evaluation. Generates eval_metadata.json, grading.json, and feedback reports. Use when user says "eval skill", "test skill", "run evals", "evaluate skill", "skill evals", "test skill quality", "run skill tests", or "skill evaluation".

在 Manus 中运行

$ git log --oneline --stat

stars:58

forks:28

updated:2026年3月6日 16:30

SKILL.md

readonly

related-skills.json

同仓库

skill-forge-benchmark.md

from "AgriciDaniel/skill-forge"

Benchmark Claude Code skill performance with variance analysis, tracking pass rate, execution time, and token usage across iterations. Runs multiple trials per eval for statistical reliability, aggregates results into benchmark.json, and generates comparison reports between skill versions. Use when user says "benchmark skill", "measure skill performance", "skill metrics", "compare skill versions", "skill performance", "track skill improvement", "skill regression test", or "skill A/B test".

2026-03-0658

skill-forge.md

from "AgriciDaniel/skill-forge"

Ultimate Claude Code skill creator and architect. Designs, scaffolds, builds, reviews, evolves, and publishes production-grade Claude Code skills following the Agent Skills open standard and 3-layer architecture (directive, orchestration, execution). Handles single-file skills, multi-skill orchestrators with sub-skills and subagents, MCP-enhanced workflows, and full skill ecosystems. Industry detection for skill domain. Triggers on: "create skill", "build skill", "new skill", "skill creator", "skill builder", "skill-forge", "design skill", "scaffold skill", "review skill", "improve skill", "publish skill", "skill architecture", "convert skill", "port skill", "multi-platform", "cross-platform", "eval skill", "test skill", "benchmark skill", "skill evals", "measure skill", "skill performance", "skill A/B test".

2026-03-0658

skill-forge-evolve.md

from "AgriciDaniel/skill-forge"

Improve and iterate on existing Claude Code skills based on usage feedback, test results, or changing requirements. Handles under/over-triggering fixes, instruction refinement, new sub-skill addition, and architecture evolution. Use when user says "improve skill", "fix skill", "skill not triggering", "skill triggers too much", "update skill", or "evolve skill".

2026-03-0658

skill-forge-review.md

from "AgriciDaniel/skill-forge"

Audit and validate existing Claude Code skills for quality, triggering accuracy, structure compliance, and best practices. Scores skills on a 0-100 scale and provides prioritized improvement recommendations. Use when user says "review skill", "audit skill", "check skill", "validate skill", or "skill quality".

2026-03-0658

skill-forge-build.md

from "AgriciDaniel/skill-forge"

Scaffold and build Claude Code skills from plans or descriptions. Generates SKILL.md files, sub-skills, scripts, references, agents, and templates following the Agent Skills standard. Use when user says "build skill", "scaffold skill", "generate skill", "create SKILL.md", or "implement skill".

2026-02-1658

skill-forge-convert.md

from "AgriciDaniel/skill-forge"

Convert Claude Code skills to work on OpenAI Codex, Google Gemini CLI, Google Antigravity, and Cursor. Analyzes platform-specific features, generates target files (openai.yaml, AGENTS.md, GEMINI.md, .mdc rules), adapts frontmatter, converts MCP config, and produces compatibility reports. Use when user says "convert skill", "port skill", "multi-platform", "skill for codex", "skill for gemini", "skill for antigravity", "skill for cursor", "cross-platform skill", "convert to codex", "convert to gemini", "convert to antigravity", or "convert to cursor".

2026-02-1658

package.json

"author": "AgriciDaniel"

"repository": "AgriciDaniel/skill-forge"

打开 GitHub 仓库查看创作者相关仓库

$ install --global

$ download --local

在 Manus 中运行

$ useful --forSOC

软件质量保证分析师与测试员计算机与数学类职业15-1253L4

name

skill-forge-eval

description

Run evaluation pipelines on Claude Code skills to test triggering accuracy, workflow correctness, and output quality. Spawns executor, grader, comparator, and analyzer sub-agents for parallel evaluation. Generates eval_metadata.json, grading.json, and feedback reports. Use when user says "eval skill", "test skill", "run evals", "evaluate skill", "skill evals", "test skill quality", "run skill tests", or "skill evaluation".

Skill Evaluation Pipeline

Run structured evaluations against Claude Code skills to verify triggering, correctness, and quality using a multi-agent pipeline.

Process

Step 1: Define Eval Set

Accept eval definitions from:

Path to eval set JSON: evals/evals.json or user-specified file
Inline prompts: User provides eval queries directly
Auto-generated: Generate from skill description (see Step 1b)

Eval set JSON schema:

{
  "skill_name": "my-skill",
  "skill_path": "./my-skill",
  "evals": [
    {
      "eval_id": 0,
      "eval_name": "descriptive-name",
      "prompt": "The user's task prompt",
      "input_files": [],
      "assertions": [
        {
          "name": "output-has-score",
          "check": "Output contains a numeric score between 0-100",
          "weight": 1.0
        }
      ],
      "should_trigger": true
    }
  ]
}

Step 1b: Auto-Generate Eval Set

If no eval set exists, generate one:

Read the skill's SKILL.md description and instructions
Run python scripts/generate_eval_set.py <skill-path> to produce a starter set
Present the generated set to the user for review and editing
User approves or modifies before proceeding

Step 2: Set Up Workspace

Create the eval workspace outside the skill directory to avoid confusing eval artifacts with skill files. Use a sibling directory or a dedicated location:

eval-workspace/
  iteration-1/
    eval-0/
      eval_metadata.json        # Assertions and config for this eval
      with_skill/
        outputs/                # Skill execution outputs
        timing.json             # Token count + duration
        grading.json            # Assertion results + evidence
      baseline/
        outputs/
        timing.json
        grading.json
    eval-1/
      eval_metadata.json
      with_skill/
        outputs/
        timing.json
        grading.json
      baseline/
        outputs/
        timing.json
        grading.json
    benchmark.json              # Aggregated metrics
    benchmark.md                # Human-readable report

For each eval directory, create eval_metadata.json from the eval set entry:

{
  "eval_id": 0,
  "eval_name": "descriptive-name",
  "prompt": "The user's task prompt",
  "assertions": [...],
  "should_trigger": true
}

Step 3: Execute Eval Runs

For each eval in the set, spawn two parallel runs:

With-skill run (delegate to agents/skill-forge-executor.md):

Execute this task:
- Skill path: <path-to-skill>
- Task: <eval prompt>
- Input files: <eval files if any, or "none">
- Save outputs to: <workspace>/iteration-<N>/eval-<ID>/with_skill/outputs/
- Outputs to save: <what the assertions check>

Baseline run (delegate to agents/skill-forge-executor.md):

For new skills: run without the skill loaded
For improved skills: run with the previous version (snapshot it first)

Save timing data to timing.json in each run directory:

{
  "total_tokens": 84852,
  "duration_ms": 23332,
  "total_duration_seconds": 23.3
}

Step 4: Grade Results

Delegate to agents/skill-forge-grader.md for each completed run:

Grade against assertions defined in eval_metadata.json
Save results to grading.json per run:

{
  "eval_id": 0,
  "run_type": "with_skill",
  "assertions": [
    {
      "name": "output-has-score",
      "passed": true,
      "evidence": "Found score: 87/100 on line 14"
    }
  ],
  "pass_rate": 1.0
}

Step 5: Aggregate and Analyze

Run python scripts/aggregate_benchmark.py <workspace>/iteration-<N> --skill-name <name>
This produces benchmark.json and benchmark.md with:
- Pass rate per eval (with_skill vs baseline)
- Average time and token usage
- Improvement ratio (with_skill / baseline)
Delegate to agents/skill-forge-analyzer.md to:
- Surface patterns that aggregate stats might hide
- Identify consistently failing assertion types
- Flag regressions from previous iterations

Step 6: Present Results

Generate a summary report:

# Eval Report: [skill-name] — Iteration [N]

## Overall
| Metric | With Skill | Baseline | Delta |
|--------|-----------|----------|-------|
| Pass Rate | X% | Y% | +Z% |
| Avg Time | Xs | Ys | -Zs |
| Avg Tokens | X | Y | -Z |

## Per-Eval Results
| Eval | With Skill | Baseline | Status |
|------|-----------|----------|--------|
| eval-0 | PASS | FAIL | Improved |
| eval-1 | PASS | PASS | Maintained |

## Patterns & Insights
[From analyzer agent]

## Recommendations
[Specific improvements based on failures]

Step 7: Collect Feedback

Save user feedback to feedback.json:

{
  "reviews": [
    {
      "run_id": "eval-0-with_skill",
      "feedback": "the chart is missing axis labels",
      "timestamp": "2026-03-06T12:00:00Z"
    }
  ],
  "status": "complete"
}

Pass feedback to /skill-forge evolve for the next iteration.

Advanced: Blind Comparison

For rigorous A/B testing between skill versions:

Delegate to agents/skill-forge-comparator.md
Pass two directories: eval-<ID>/with_skill/outputs/ and eval-<ID>/baseline/outputs/
Comparator assigns random labels (Version A / Version B) so it cannot know which is new
Rates each output on assertion criteria from eval_metadata.json
Returns preference scores without knowing which is "new" vs "old"

Error Handling

Executor timeout: If a run exceeds 5 minutes, terminate and mark as "timed_out": true in timing.json
Executor failure: If a run crashes, save the error to error.txt in the run directory and continue with remaining evals
Grading failure: If grading cannot determine pass/fail, mark assertion as "passed": null with evidence explaining why
Missing files: If timing.json or grading.json is missing after a run, flag the eval as incomplete in the report
Partial completion: Always aggregate and report whatever results are available — do not block on one failed eval

Quality Gates

Before marking an eval run as complete:

All evals executed (with_skill + baseline)
Timing data captured for every run
All assertions graded with evidence
Benchmark aggregated with pass rate, time, tokens
Analyzer patterns documented
Results presented to user

skill-forge-eval

同仓库更多 Skills

同仓库更多 Skills

Skill Evaluation Pipeline

Process

Step 1: Define Eval Set

Step 1b: Auto-Generate Eval Set

Step 2: Set Up Workspace

Step 3: Execute Eval Runs

Step 4: Grade Results

Step 5: Aggregate and Analyze

Step 6: Present Results

Step 7: Collect Feedback

Advanced: Blind Comparison

Error Handling

Quality Gates

Skill Evaluation Pipeline

Process

Step 1: Define Eval Set

Step 1b: Auto-Generate Eval Set

Step 2: Set Up Workspace

Step 3: Execute Eval Runs

Step 4: Grade Results

Step 5: Aggregate and Analyze

Step 6: Present Results

Step 7: Collect Feedback

Advanced: Blind Comparison

Error Handling

Quality Gates