원클릭으로 Manus에서 모든 스킬 실행

$pwd:

skill-tester

Name: Skill Tester
Author: pavel-molyanov

// Test skills end-to-end: design test cases, run with/without skill, grade results, test description triggering accuracy, produce improvement report. Use when: "протестируй скилл", "запусти тесты для скилла", "проверь скилл", "run skill tests", "test this skill", "skill eval", "оцени скилл", "придумай тесты для скилла", "создай сценарии тестирования"

Manus에서 실행

$ git log --oneline --stat

stars:218

forks:66

updated:2026년 3월 23일 03:32

파일 탐색기

5 개 파일

SKILL.md

readonly

related-skills.json

같은 저장소

project-knowledge.md

from "pavel-molyanov/molyanov-ai-dev"

Use when you need information about this project's architecture, tech stack, coding patterns, data model, deployment setup, git workflow, or UX guidelines. Contains comprehensive project documentation including design decisions, technical specifications, and development standards.

2026-03-23218

deploy-pipeline.md

from "pavel-molyanov/molyanov-ai-dev"

Sets up CI/CD pipelines, deployment configuration, and automated deploy workflows. GitHub Actions, platform-specific deploy (Vercel, Railway, Fly.io, AWS, VPS), secrets management in CI. Use when: "подготовь деплой", "настрой автодеплой", "настрой CI/CD", "setup deploy", "configure deployment", "настрой пайплайн"

2026-03-23218

feature-execution.md

from "pavel-molyanov/molyanov-ai-dev"

Orchestrate feature delivery as team lead: spawn agents by wave, manage review cycles (max 3 rounds), commit per wave. Use when: "выполни фичу", "do feature", "execute feature", "запусти фичу", "выполни все задачи", "execute all tasks"

2026-03-23218

methodology.md

from "pavel-molyanov/molyanov-ai-dev"

AI-First development methodology: spec-driven pipeline, project structure, skills/agents ecosystem, quality gates. Use when: "изучи методологию", "изучи глобальную папку", "как работает методология", "how does the methodology work", "explain the workflow" For infrastructure tasks, use infrastructure-setup or deploy-pipeline skills.

2026-03-23218

project-planning.md

from "pavel-molyanov/molyanov-ai-dev"

Plan new projects: adaptive interview, tech decisions, fill all project documentation (project-knowledge) in one session. Use when: "сделай описание проекта", "запиши описание проекта в документацию", "проведи со мной интервью для описания проекта", "заполни документацию проекта", "начни планирование проекта", "давай опишем проект", "plan a new project", "fill project documentation"

2026-03-23218

prompt-master.md

from "pavel-molyanov/molyanov-ai-dev"

Guide for writing effective prompts for LLMs. Use when: "напиши промпт", "улучши промпт", "prompt engineering", "проверь промпт"

2026-03-23218

package.json

"author": "pavel-molyanov"

"repository": "pavel-molyanov/molyanov-ai-dev"

GitHub 저장소 열기 Creator 저장소 보기

$ install --global

$ download --local

Manus에서 실행

$ useful --forSOC

소프트웨어 품질 보증 분석가·테스터컴퓨터 및 수학직15-1253L4

name

skill-tester

description

Test skills end-to-end: design test cases, run with/without skill, grade results, test description triggering accuracy, produce improvement report. Use when: "протестируй скилл", "запусти тесты для скилла", "проверь скилл", "run skill tests", "test this skill", "skill eval", "оцени скилл", "придумай тесты для скилла", "создай сценарии тестирования"

Skill Tester

Design tests, run them, grade results, evaluate description triggering, produce actionable report. All in one workflow — no separate test-design step needed.

You are team lead, test designer, user actor, and analyst — all in one.

Phase 1: Understand & Design

1a. Read the target skill

User provides skill name or path
Read the target skill's SKILL.md + ALL referenced files completely
Map out:
- Skill type: procedural / informational
- Input: what does the skill expect? (user message, task file, structured data)
- Output: what should the skill produce? (files, messages, actions, decisions)
- Phases: list all phases/steps with their checkpoints
- References: list all files the skill tells agents to read
- Decision points: where does the skill branch based on input?
- Dialogue points: where does the skill ask the user questions?

1b. Design test prompts

Design test prompts applying criteria from test-design-guide.md (realistic prompts, assertion design, persona setup):

Propose 2-3 test prompts:
- 1 happy-path: the most common, standard use case
- 1-2 edge cases: where the skill might break or behave unexpectedly
For each prompt, propose assertions — binary, observable checks. Categories:
- [Process]: Did the agent follow the skill's workflow?
- [Outcome]: Is the result correct and complete?
- [Compliance]: Did the agent obey all skill instructions?

1c. Design trigger eval queries

Design trigger eval queries applying patterns from trigger-eval-guide.md:

Generate 15-20 trigger eval queries:
- 8-10 should-trigger: varied phrasings of the skill's intended use
- 8-10 should-not-trigger: near-misses that share keywords but need something different
These will be used in Phase 4 to test description accuracy

1d. Confirm with user

Present the full test plan:

Test prompts with assertions
Trigger eval queries
Proposed model for runners
Persona (use default, modify only if user requests)

"Here are the test cases and trigger queries I've prepared. Do these look right, or do you want to adjust anything?"

Wait for confirmation before proceeding.

Checkpoint: User confirmed test plan. All prompts have assertions. Trigger eval queries prepared.

Phase 2: Execute Tests

2a. Setup

Create workspace: ~/.claude/skill-tests/{skill-name}/iteration-{N}/ (N = 1 for first run, increment for re-runs)
Save test plan to workspace:
- evals.json with all prompts and assertions
- trigger-evals.json with all trigger queries
TeamCreate(team_name="skill-test-{skill-name}")
Plan runners: per scenario = 2 with-skill + 1 baseline without skill

Show plan to user: "I'll run {N} scenarios, {M} runners total. Model: {model}. Proceed?"

2b. Spawn runners

For each scenario, spawn all runners in parallel:

With-skill runners (2 per scenario):

Prompt = scenario's task prompt (natural, as user would write)
Each runner loads the tested skill: Skill(skill="{tested-skill-name}")
Model: as confirmed with user
Use run_in_background: true

Baseline runner (1 per scenario):

Same task prompt, same model
Receives no skill to load
Use run_in_background: true

Save each runner's task_id — needed for grader agents to retrieve transcripts.

Scenarios run sequentially. Runners within a scenario run in parallel.

2c. Interact as user persona

If runners send questions, answer in character per the scenario's persona. Rules:

Stay in character: answer as the user would
Be consistent: same question from different runners → same answer
Answer naturally — without guidance toward any specific behavior
Keep conversation purely about the task itself
Baseline runner may ask different questions (no skill to guide it) — this is expected, answer them too

2d. Capture timing data

When each runner completes, immediately save timing data:

{
  "total_tokens": 84852,
  "duration_ms": 23332,
  "total_duration_seconds": 23.3
}

This is the only opportunity to capture this data — it comes through the task notification and isn't persisted elsewhere. Process each notification as it arrives rather than trying to batch them.

Save to timing.json in each runner's result directory.

Checkpoint: All runners completed. Timing data captured.

Phase 3: Grade & Analyze

3a. Grade via grader agents

When all runners for a scenario finish, spawn grader agents — one per runner. Delegate transcript analysis to grader agents — transcripts are large and reading them directly would exhaust the lead agent's context, leaving no room for report compilation.

Each grader receives instructions from grading-guide.md and:

The runner's task_id (grader calls TaskOutput(task_id) for transcript)
The scenario's assertions (copy the criteria list into the prompt)
The skill's SKILL.md path (grader reads it for compliance check)
Whether this is a skill-runner or baseline

Spawn all graders in parallel. Wait for all to return.

3b. Compile results per scenario

Using only grader outputs (not transcripts):

Build results table (assertions × runners)
Cross-runner consistency: where did skill-runners diverge?
- Divergence on a criterion = ambiguous instruction in the skill
Baseline comparison:
- Passed by skill-runners ONLY → skill adds value
- Passed by ALL → criterion too easy or skill doesn't help here
- Failed by ALL → criterion may be unrealistic
- Passed by baseline ONLY → skill might be harmful for this case

Clean up runners for this scenario before moving to the next one.

3c. Benchmark aggregation

Across all scenarios, compute:

Pass rate per assertion per config (with-skill / baseline)
Timing comparison: tokens and duration per config
Overall skill value: how many assertions improve vs baseline

3d. Analyst pass

Surface patterns the aggregate stats might hide:

Non-discriminating assertions: pass regardless of whether skill is used. These don't prove the skill helps — consider removing or replacing with harder assertions.
High-variance assertions: one skill-runner passes, other fails on same criterion. Usually means the skill's instruction is ambiguous — identify the specific instruction and quote it.
Time/token tradeoffs: skill adds value but costs 2x tokens? Flag it. The user should know the cost of improvement.
Repeated code in transcripts: if multiple runners independently wrote similar helper scripts, flag this as a candidate for bundling in the skill's scripts/ directory.

Checkpoint: All scenarios graded. Benchmark computed. Analyst observations recorded.

Phase 4: Test Description Triggering

4a. Evaluate trigger accuracy

For each trigger eval query from trigger-evals.json:

Assess whether the skill's current description would cause Claude to invoke the skill for this query
Consider: does the query's intent match the description's keywords and contexts? Would Claude see this as the skill's domain?

Categorize each query:

True positive: should-trigger → would trigger
True negative: should-not-trigger → would not trigger
False negative: should-trigger → would NOT trigger (undertriggering)
False positive: should-not-trigger → would trigger (overtriggering)

4b. Calculate trigger accuracy

Trigger accuracy = (true positives + true negatives) / total queries
False negative rate = false negatives / should-trigger queries
False positive rate = false positives / should-not-trigger queries

False negatives (undertriggering) are the most costly — users won't discover the skill exists. False positives waste time but are less harmful.

4c. Suggest improved description

If trigger accuracy < 85% or false negative rate > 20%:

Analyze which queries fail and why
Draft an improved description that would trigger correctly
Show before/after comparison in the report

Checkpoint: Trigger accuracy calculated. Description improvement suggested if needed.

Phase 5: Report

Structure the report according to report-template.md.

The report includes:

Results per scenario — assertions × runners table with evidence
Skill compliance — phase-by-phase execution check
Benchmark summary — pass_rate, tokens, time per config
Analyst observations — non-discriminating, high-variance, cost analysis
Description trigger accuracy — accuracy metrics + suggested improvement
Scripts to bundle — if repeated code found across transcripts
Recommendations — priority-ordered specific fixes for skill-master

Save to: ~/.claude/skill-tests/{skill-name}/reports/{timestamp}-report.md

Show report to user: "Here's the test report. Key findings: [summary]. The report is at [path] — you can share it with skill-master to apply fixes."

TeamDelete after report delivery.

Improving the Skill (Iteration)

If the user wants to iterate after receiving the report:

User (or skill-master) applies fixes to the skill
Run skill-tester again → results go to iteration-{N+1}/
Previous iteration results are available for comparison
Report shows delta: what improved, what regressed

When iterating, keep these principles in mind:

Generalize from feedback: resist fiddly changes targeted at specific test cases. If a skill works only for its test cases, it's useless at scale.
Keep the prompt lean: read transcripts. If the skill makes the model waste time doing unproductive things, remove those parts.
Explain the why: rather than adding rigid ALWAYS/NEVER rules, explain reasoning so the model understands the intent.

skill-tester

이 저장소의 다른 Skills

이 저장소의 다른 Skills

Skill Tester

Phase 1: Understand & Design

1a. Read the target skill

1b. Design test prompts

1c. Design trigger eval queries

1d. Confirm with user

Phase 2: Execute Tests

2a. Setup

2b. Spawn runners

2c. Interact as user persona

2d. Capture timing data

Phase 3: Grade & Analyze

3a. Grade via grader agents

3b. Compile results per scenario

3c. Benchmark aggregation

3d. Analyst pass

Phase 4: Test Description Triggering

4a. Evaluate trigger accuracy

4b. Calculate trigger accuracy

4c. Suggest improved description

Phase 5: Report

Improving the Skill (Iteration)

Self-Verification

Skill Tester

Phase 1: Understand & Design

1a. Read the target skill

1b. Design test prompts

1c. Design trigger eval queries

1d. Confirm with user

Phase 2: Execute Tests

2a. Setup

2b. Spawn runners

2c. Interact as user persona

2d. Capture timing data

Phase 3: Grade & Analyze

3a. Grade via grader agents

3b. Compile results per scenario

3c. Benchmark aggregation

3d. Analyst pass

Phase 4: Test Description Triggering

4a. Evaluate trigger accuracy

4b. Calculate trigger accuracy

4c. Suggest improved description

Phase 5: Report

Improving the Skill (Iteration)

Self-Verification