원클릭으로 Manus에서 모든 스킬 실행

agentv

스타0

포크0

업데이트2026년 4월 10일 15:57

Create, run, and manage AgentV evaluations for AI agents and skills using the AgentV CLI and AgentEvals standard EVAL.yaml format. Use this skill whenever the user wants to write evaluation files for AI agents, run evals with agentv CLI, convert existing test cases to EVAL.yaml, set up eval targets, understand the AgentEvals specification, debug failing evaluations, integrate evals into CI/CD pipelines, or compare agent runs with `agentv compare`. Also use this when the user mentions EVAL.yaml, agentevals, agentv, or wants to evaluate skill quality with a declarative format.

설치

Codex 또는 Claude로 설치 이 Prompt를 복사해 Codex, Claude 또는 다른 어시스턴트에 붙여 넣으면 Skill 페이지를 검토하고 설치를 진행할 수 있습니다.

Manus에서 실행

출처

Tyler-R-Kendrick

Tyler-R-Kendrick/copilot-auto-training

GitHub 저장소 열기 Creator 저장소 보기

다운로드

Manus에서 실행

관련 직업SOC

SOC 직업 분류 기준

소프트웨어 품질 보증 분석가·테스터컴퓨터 및 수학직·SOC 15-1253

SKILL.md

readonly

이 저장소의 다른 Skills

같은 저장소

trainer-optimize

Tyler-R-Kendrick/copilot-auto-training

Improve a markdown prompt file using Agent Lightning APO (Automatic Prompt Optimization). Use when the user asks to optimize or improve a markdown prompt, or starts a message with /trainer-optimize.

2026-04-130

trainer-train-agent

Tyler-R-Kendrick/copilot-auto-training

Own the end-to-end trainer loop for agent contract targets (*.agent.md files, custom agent definitions, and agent instruction documents). Use this whenever the caller needs to research, synthesize datasets, optimize, validate, and write back a trained candidate for an agent-type target. Prefer this specialized loop whenever the selected target defines tool routing, MCP skill configuration, agent personas, or handoff behavior rather than raw prompts, code, or skill definitions.

2026-04-120

trainer-train-code

Tyler-R-Kendrick/copilot-auto-training

Own the end-to-end trainer loop for Python code targets optimized with Microsoft Trace (nodes, bundles, models, and trainable agent components). Use this whenever the caller needs to research, synthesize test-based datasets, optimize, validate, and write back a trained candidate for a code-type target. Prefer this specialized loop for any Python file or callable that benefits from deterministic, test-based or benchmark-based feedback rather than open-ended language instruction quality.

2026-04-120

trainer-train-code

Tyler-R-Kendrick/copilot-auto-training

2026-04-120

trainer-train-prompt

Tyler-R-Kendrick/copilot-auto-training

Own the end-to-end trainer loop for prompt-like files (*.prompt.md, *.prompty, *.instructions.md, system prompts, and other natural-language instruction artifacts). Use this whenever the caller needs to research, synthesize datasets, optimize, validate, and write back a trained candidate for a prompt-type target. Prefer this specialized loop for any file whose primary content is natural-language instructions rather than code, skill configuration, or agent contracts.

2026-04-120

trainer-train-prompt

Tyler-R-Kendrick/copilot-auto-training

2026-04-120

name	agentv
description	Create, run, and manage AgentV evaluations for AI agents and skills using the AgentV CLI and AgentEvals standard EVAL.yaml format. Use this skill whenever the user wants to write evaluation files for AI agents, run evals with agentv CLI, convert existing test cases to EVAL.yaml, set up eval targets, understand the AgentEvals specification, debug failing evaluations, integrate evals into CI/CD pipelines, or compare agent runs with `agentv compare`. Also use this when the user mentions EVAL.yaml, agentevals, agentv, or wants to evaluate skill quality with a declarative format.
license	MIT
compatibility	Requires Node.js 18+ with `agentv` globally installed (`npm install -g agentv`). Supports any agent target — Claude, Codex, Copilot, local CLI scripts, or OpenAI-compatible providers.
metadata	{"author":"Tyler Kendrick","version":"0.1.0"}

AgentV Skill

Use this skill to author, run, and manage evaluations for AI agents and skills using the AgentV CLI and the AgentEvals declarative YAML standard.

Read references/eval-yaml-schema.md for the complete EVAL.yaml schema reference before authoring new eval files. Read references/targets.md for target configuration options.

When to use this skill

The user wants to write EVAL.yaml evaluation files for a skill or agent.
The user wants to run evals with agentv eval.
The user wants to convert existing test cases (e.g., evals.json) to EVAL.yaml format.
The user wants to understand the AgentEvals specification or schema.
The user wants to set up or update .agentv/targets.yaml.
The user wants to run agentv compare to detect regressions across runs.
The user wants to integrate AgentV into a CI/CD pipeline.
The user wants to write LLM judges, rubrics, or code graders for evaluation.

Do not use this skill for general prompt engineering unrelated to evaluations, or for creating and improving agent skill SKILL.md files and their structure.

Quickstart

# Install
npm install -g agentv

# Initialize project
agentv init

# Run an eval
agentv eval evals/my-skill.eval.yaml

# Compare runs
agentv compare .agentv/results/runs/<timestamp>/index.jsonl

# Output formats
agentv eval evals/my.yaml -o report.html    # HTML dashboard
agentv eval evals/my.yaml -o results.xml    # JUnit XML for CI
agentv eval evals/my.yaml -o results.jsonl  # JSONL (default)

Core workflow

Follow this order when creating or updating evals:

Understand the skill/agent — what does it do, what inputs does it accept, what outputs should it produce?
Set up the target — configure .agentv/targets.yaml to point to the agent being evaluated.
Author the EVAL.yaml — write test cases with inputs, expected outputs, and assertions.
Add evaluators — choose deterministic assertions (contains, equals, regex, is-json) or LLM judges for subjective quality.
Run and inspect — agentv eval produces JSONL output; use --output report.html for a visual dashboard.
Iterate — improve the skill based on failing tests and re-run.

Converting evals.json to EVAL.yaml

When migrating an existing evals/evals.json file, map the fields as follows:

evals.json field	EVAL.yaml field
`skill_name`	`name` (top-level)
`evals[].id`	`tests[].id`
`evals[].prompt`	`tests[].input`
`evals[].expected_output`	`tests[].criteria`
`evals[].assertions[]`	`tests[].rubrics[]` or `tests[].assert[]`

String-only assertions in evals.json become rubrics (string format) or llm-grader assertions in EVAL.yaml. Preserve both evals.json and EVAL.yaml — the JSON is used by the internal evaluation framework; the YAML is used by AgentV.

EVAL.yaml minimal example

name: my-skill
description: Evaluates my-skill behavior

execution:
  target: default

tests:
  - id: basic-task
    criteria: Correctly handles the basic task
    input: Do the thing I asked for
    expected_output: The expected result
    rubrics:
      - The output is complete
      - The output is correctly formatted

Evaluator types

Type	Use case
`contains`	Output includes a specific string
`equals`	Exact match
`regex`	Output matches a regular expression
`is-json`	Output is valid JSON
`llm-grader`	Subjective quality via a markdown judge prompt
`code-grader`	Custom logic in Python/TypeScript/shell
`rubric`	Structured criteria with optional weights and score ranges

For subjective quality checks, llm-grader assertions reference a markdown file with the judge prompt:

assert:
  - type: llm-grader
    prompt: ./graders/correctness.md
    threshold: 0.7    # Per-assertion minimum score (0.0–1.0)

Writing judge prompts — use these template variables inside the markdown file:

{{answer}} — the agent's actual response
{{expected_output}} — the expected output from the test case
{{input}} — the original prompt sent to the agent
{{criteria}} — the test case criteria field

Threshold vs. --threshold: The threshold field on an llm-grader assertion is per-assertion (minimum score to pass). The --threshold CLI flag sets a minimum overall suite pass rate for CI gating. These are independent.

Rubrics

Use rubrics when you need weighted or structured grading criteria:

rubrics:
  - id: completeness
    outcome: Response addresses all parts of the request
    weight: 2
    required: true
  - id: quality
    outcome: Response is clear and well-structured
    weight: 1
    score_ranges:
      0: Unreadable or missing
      5: Partially addressed
      10: Complete and clear

Suite-level assertions

Use top-level assert: to apply evaluators to all test cases:

assert:
  - type: is-json          # All responses must be valid JSON
  - type: llm-grader
    prompt: ./graders/quality.md

tests:
  - id: test-1
    input: ...
    assert:
      - type: contains     # Per-test assertion (merged with suite-level)
        value: "status"

File references

Test inputs and judge prompts can reference files:

tests:
  - id: file-input-test
    input:
      - role: user
        content:
          - type: file
            path: ./fixtures/sample.txt
    criteria: Correctly processes the document

Set up targets

Configure .agentv/targets.yaml to point to your agent. Run agentv init to create a starter file.

# .agentv/targets.yaml
default:
  type: openai
  model: gpt-4o
  api_key: ${OPENAI_API_KEY}   # Use env vars — never hardcode keys

claude:
  type: anthropic
  model: claude-opus-4-5
  api_key: ${ANTHROPIC_API_KEY}

Read references/targets.md for all target types (OpenAI, Anthropic, local CLI, HTTP endpoints).

For CI, set the API key as a repository secret and inject it into the environment:

# GitHub Actions
- name: Run evals
  env:
    OPENAI_API_KEY: ${{ secrets.OPENAI_API_KEY }}
  run: agentv eval evals/*.yaml

Run with exit code on failures (for CI gates)

agentv eval evals/*.yaml --exit-on-failure

JUnit output for GitHub Actions / Jenkins

agentv eval evals/*.yaml -o results.xml

Threshold gate: fail if pass rate drops below 80%

agentv eval evals/*.yaml --threshold 0.8


## Comparing runs

```bash
# After two runs, compare results
agentv compare .agentv/results/runs/run-1/index.jsonl .agentv/results/runs/run-2/index.jsonl

# Compare latest two runs automatically
agentv compare --last 2

Sample comparison output:

Comparison: run-1 vs run-2
────────────────────────────────────────
Test                  run-1   run-2   Δ
────────────────────────────────────────
detect-off-by-one     pass    pass    =
no-bug-present        fail    pass    ↑
security-vuln         pass    pass    =
────────────────────────────────────────
Pass rate             66.7%   100%    +33.3%

Improvements (↑) and regressions (↓) are highlighted. Use this to verify a skill change improves eval quality without introducing regressions.

Project structure

project/
├── evals/
│   ├── my-skill.eval.yaml      # AgentV eval definitions
│   └── graders/                # LLM judge markdown prompts
│       └── correctness.md
├── .agentv/
│   ├── targets.yaml            # Agent/model targets
│   └── results/                # Run outputs (JSONL)
└── ...

Skills integration pattern

When evaluating skills in this repository, each skill keeps both formats:

skills/my-skill/
└── evals/
    ├── evals.json       # Internal eval framework format
    └── EVAL.yaml        # AgentV-compatible format

Both files cover the same test cases. evals.json drives the internal skill-creator benchmarking workflow; EVAL.yaml enables agentv eval CLI execution and CI integration.