| name | skill-eval |
| description | Run the agentic evaluation repo for a target skill. Use when
asked to execute repo-defined suites, collect evidence, write per-case
results, and produce a short audit report for the target skill.
Also supports evaluation modes: AB tests, subjective scoring, and
vendor comparisons via evaluation YAML files in the eval repo.
|
Skill Eval
Use this skill only for the evaluator flow.
This skill does not define test truth.
The eval repo defines the targets, suites, cases, assertions, evaluations, rubrics, statuses, and report contract.
This skill is a dynamic black-box evaluator.
Do not replace execution with a static read-through when a fresh-agent run is possible.
File Responsibilities
Use the docs in this order and keep their roles separate:
agentic-evals/AGENT.md: canonical repo contract for any evaluator agent. Read this first for run outputs, statuses, assertion semantics, isolation rules, evaluation modes, and report shape.
agentic-evals/docs/session-evidence.md: required local evidence contract for locating child sessions and extracting evidence from sessions/*.jsonl.
agentic-evals/targets/<target_id>/target.yaml: target-specific contract, including entry skill, roots, default suites, and allowed statuses.
agentic-evals/targets/<target_id>/cases/<suite_id>/suite.yaml: selected runnable suite definitions.
agentic-evals/targets/<target_id>/cases/<suite_id>/<case_id>.yaml: per-case prompts, setup, and assertions.
agentic-evals/evaluations/<mode>/<eval_id>.yaml: evaluation definitions for AB, comparison, or subjective modes.
agentic-evals/rubrics/<rubric_id>.yaml: scoring dimensions and anchors for rubric-based evaluation.
skill-eval/SKILL.md: how this evaluator skill acquires the repo, creates isolated workspaces, spawns fresh agents, locates child sessions, validates isolation, and writes the repo-defined artifacts.
Do not duplicate repo contract rules from AGENT.md unless this skill needs an extra operational constraint.
Required Inputs
- optional path to the test repo
target_id, or permission to use the repo default
- selected suite names, case ids, or permission to use the defaults
- path or revision of the target skill if the user provided one
- optional run mode:
single-run or ab-urls
- for
ab-urls: variant_a_url and variant_b_url, both GitHub HTTP URLs to the target skill version
If the user does not provide a test repo path, the evaluator must first look for a local
agentic-evals folder in the current workspace and clone the default repo only if that
folder is missing.
Default test repo:
- folder name:
agentic-evals
- clone URL:
https://github.com/Jiayi-Ye02/agentic-evals.git
For ab-urls, recommended variant URL formats are:
https://github.com/<org>/<repo>/tree/<ref>/<skill-dir>
https://github.com/<org>/<repo>/blob/<ref>/<skill-dir>/SKILL.md
Runtime Support
skill-eval supports Codex CLI runtimes that can create real child sessions and persist
local session evidence.
Codex CLI support has been validated locally with codex-cli 0.118.0 where:
multi_agent is enabled
- a parent Codex session can create a child session successfully
- the child session is recorded in
~/.codex/sessions/...jsonl
- the parent-child edge is recorded in
~/.codex/state_5.sqlite thread_spawn_edges
Operational constraints for Codex CLI runs:
- case execution still must happen through
spawn_agent with fork_context: false
- do not replace case execution with
codex exec
codex exec is acceptable only as an environment smoke test for child-session creation
~/.codex must be writable so the session store and state database can update normally
- authentication and network access must allow a normal live Codex session to complete
Non-Negotiables
- Before any test evaluation, check whether a local
agentic-evals folder already exists.
- If the test repo folder does not exist, clone
https://github.com/Jiayi-Ye02/agentic-evals.git before doing anything else.
- Read
agentic-evals/AGENT.md before running any case.
- Resolve
target_id before selecting cases. Use the user-provided target_id when available. Otherwise, use the repo default target.
- Read
agentic-evals/targets/<target_id>/target.yaml before selecting cases.
- Read the selected suite files and case files before executing cases.
- Create one brand-new isolated workspace for every case attempt under a temp parent directory. Never execute a case in the user's main workspace.
- Execute each case by running a fresh Codex sub-agent on the case prompt with
spawn_agent and fork_context: false.
- Do not use
codex exec, terminal wrappers, or any other fallback executor for case execution.
- If
spawn_agent is unavailable or agent creation fails, stop the evaluation immediately and report the failure reason instead of continuing.
- If the runtime cannot write normal Codex session artifacts under
~/.codex, stop and report an environment block instead of continuing with degraded evidence.
- After each successful
spawn_agent, immediately report the sub-agent nickname in the main thread so the user can find and open it in the Codex app. If no nickname is available, report the agent id.
- Send the case
input.user_prompt to the fresh agent verbatim. Do not paraphrase the user request.
- Do not leak the case title, assertions, expected route, intended answer, or your prior judgment into the fresh-agent prompt.
- The fresh sub-agent is the execution subject, not the judge. Do not ask it to grade the case, interpret the assertions, or decide pass or fail.
- Do not ask the fresh sub-agent to self-report
TRACE_FILES_READ, TRACE_COMMANDS_EXECUTED, or any other evaluator-facing execution log.
- Judge from accepted child session evidence, not from fresh-agent self-reporting.
- Do not invent pass or fail rules outside the repo.
- Do not mark
pass from a generic self-report alone.
- Do not mark
pass from a static source review alone when a fresh-agent run was available.
- Treat any attempt that observably reads or executes outside its case workspace as invalid evidence. Do not judge the case from that attempt.
- If a case cannot be judged reliably, mark it
blocked.
- On clone failure, report the error and stop. Do not silently continue without the test repo.
- In
ab-urls mode, do not mutate or swap the user's local target skill tree. Prepare one isolated source workspace per variant URL.
- In
ab-urls mode, compare only the same selected case set across A and B.
Workflow
Step 1: Acquire the test repo
Resolve the test repo path in this order:
- If the user provided a repo path, use it.
- Otherwise, look for a local folder named
agentic-evals in the current workspace.
- If that folder does not exist, run:
git clone --depth 1 https://github.com/Jiayi-Ye02/agentic-evals.git
Do not continue until the repo is present locally or the clone has failed.
Step 2: Load the repo contract
Read:
agentic-evals/AGENT.md
agentic-evals/docs/session-evidence.md
agentic-evals/targets/<target_id>/target.yaml
- each selected suite file
- each case file referenced by those suites, or the selected case file
AGENT.md defines the repo contract.
This skill executes that contract.
Step 3: Create the run directory
Create the run directory and files exactly as required by agentic-evals/AGENT.md.
At minimum, the run must contain:
runs/<run_id>/
├── manifest.json
├── case-artifacts/
├── transcript.md
├── case-results/
└── report.md
When writing manifest.json, include any environment notes this skill discovers while setting up isolated workspaces or locating accepted child sessions.
For ab-urls, initialize the parent run with:
python3 skill-eval/scripts/init_ab_run.py "<source_workspace>" "<target_id>" "<variant_a_url>" "<variant_b_url>" [--suite-id <suite_id>] [--case-id <case_id>]
This creates:
- parent
runs/<ab_run_id>/manifest.json
variants/A/run/
variants/B/run/
- optional
variants/A/source-manifest.json
- optional
variants/B/source-manifest.json
Each variants/<label>/run/ directory must later satisfy the normal single-run artifact contract.
Step 3A: Resolve A/B variant sources
In ab-urls mode:
- Parse each GitHub URL with
skill-eval/scripts/parse_github_skill_url.py.
- Prepare one isolated source workspace per variant with
skill-eval/scripts/prepare_variant_source_workspace.py.
- Record the prepared source workspace and normalized URL interpretation in
variants/<label>/source-manifest.json.
Each prepared source workspace must contain:
<prepared-source>/
├── agentic-evals/
└── .agents/
└── skills/
└── <target_id>/
The local agentic-evals/ repo stays the source of truth.
Only the target skill directory varies between A and B.
Step 4: Create a fresh case workspace for every attempt
Before executing a case, create a temp parent directory and then create a brand-new
workspace for that case attempt.
Use the helper script:
bash skill-eval/scripts/create_case_workspace.sh "<source_workspace>" "<case_workspace_root>" "<case_id>" --target "<target_id>"
The script returns the absolute path to the new attempt workspace.
<source_workspace> must be the shared workspace root that contains sibling
agentic-evals/ and .agents/ directories.
By default it should copy only the target skill materials needed for execution:
- the target
entry_skill
- the target
roots
- any explicit extra relative paths passed as additional arguments when a case needs local fixtures
The case workspace must not include repo evaluation materials such as targets/, docs/, runs/, or the evaluator skill itself unless a case explicitly requires them.
Rules:
- Run this once before the first attempt of every case.
- Run it again before every retry of the same case. Retries must not reuse the prior attempt workspace.
- Apply case
setup only inside the returned workspace.
- Treat the returned workspace as a minimal target-skill sandbox, not a full clone of the eval repo.
- Resolve repo-defined files from
<source_workspace>/agentic-evals/ and target skill files from <source_workspace>/.agents/.
- Record the parent temp directory as
case_workspace_root in manifest.json.
- Record the exact attempt workspace used for judgment as
workspace_root in case-results/<case_id>.json.
In ab-urls mode, do this separately inside each variant run using that variant's prepared source workspace.
Step 5: Execute each case dynamically
For every case:
- Create a fresh isolated workspace for that case attempt under
case_workspace_root.
- Apply the case setup as far as the environment allows, but only inside that attempt workspace.
- Start a fresh sub-agent with
spawn_agent and fork_context: false.
- Give the fresh agent only the task-local context it needs:
- workspace root
- the case
input.user_prompt
- a requirement to answer naturally as if serving the user
- Capture the returned agent metadata when available, such as the agent id or nickname.
- Do not tell the fresh agent which files it is expected to read.
- Do not tell the fresh agent what the correct answer should be.
- Wait for the fresh agent to finish.
- Locate the accepted child session JSONL from the local Codex session store.
Preferred signals:
- child
session_meta.payload.source.subagent.thread_spawn.parent_thread_id
- child start time relative to the case attempt
- returned nickname or agent id when available
~/.codex/state_5.sqlite thread_spawn_edges as a locator or tie-breaker
- If a single accepted child session cannot be identified, mark the case
blocked.
- Copy the accepted child session to
case-artifacts/<case_id>/accepted-session.jsonl.
- Extract the accepted final answer and save it to
case-artifacts/<case_id>/final-answer.txt.
- Validate observed isolation before judging:
- treat per-tool
workdir values, resolved read and write paths, and command-derived cwd outputs such as pwd as the authoritative isolation signals
- for Codex
spawn_agent, treat child session_meta.cwd as advisory only, because spawned child-thread metadata may inherit the parent workspace cwd
- observed per-tool
workdir values must be inside the attempt workspace
- observed read and write paths must be inside the attempt workspace
- if a command such as
pwd prints a cwd, that observed cwd must be inside the attempt workspace
- if the accepted session evidence shows an observed tool workdir, resolved path, or command-derived cwd outside the attempt workspace, invalidate that attempt, append the mismatch to
transcript.md, create a brand-new attempt workspace, and rerun the case once
- if no reliable tool workdir, resolved path, or command-derived cwd can be observed, mark the case
blocked
- if the accepted session cannot support reliable isolation after the retry, mark the case
blocked
- Render
transcript.md directly from the accepted child session evidence in event order.
- Judge each assertion in the main evaluator from the accepted session evidence and accepted final answer, using the rules in
AGENT.md.
- Write
case-results/<case_id>.json and report.md exactly in the shapes required by AGENT.md.
In ab-urls mode:
- Run the full normal case flow for variant A under
variants/A/run/.
- Run the full normal case flow for variant B under
variants/B/run/.
- Do not compare A and B until both variant runs have written their
case-results/.
- Render the parent comparison artifacts with:
python3 skill-eval/scripts/render_ab_report.py "<ab_run_dir>" --target-id "<target_id>" --label-a "A" --label-b "B" --variant-a-url "<variant_a_url>" --variant-b-url "<variant_b_url>"
The parent comparison report is derived only from the two variant runs.
It does not replace the variant-level case judgments.
Step 5A: Fresh-agent prompt template
Use a prompt equivalent to this shape:
You are a fresh Codex agent running in the workspace <workspace>.
Task: answer this user request naturally, using the local workspace as needed:
"<case input.user_prompt>"
Requirements:
- Work as a normal Codex agent would for a real user request.
- Use the target skill docs if relevant.
- Treat `<workspace>` as your only workspace for this task.
- Start from `<workspace>` and keep all file reads, writes, and shell commands inside it.
- If something you need is missing inside `<workspace>`, say so from that workspace instead of reaching outside it.
- Do not mention that you are being evaluated.
- Give the exact answer you would send to the user.
Keep the prompt minimal.
Do not include the case assertions in the fresh-agent prompt.
Step 5B: Environment mismatch handling
Case setup is part of the contract.
Do not silently replace it with whatever the current workspace happens to contain.
If the environment does not match the case setup:
- Record the mismatch in
manifest.json and transcript.md
- Judge only the assertions that remain reliable
- Mark an assertion
blocked when the mismatch prevents reliable judgment
- Propagate the case to
blocked unless a required assertion already failed independently
- Do not repair the mismatch by reading from the user's main workspace or any path outside the case workspace
Examples:
- case says
docs_index_present: true, but the real workspace is missing references/docs.txt
- case setup would require mutating protected files that the evaluator cannot safely write
Step 5C: Local evidence prerequisites and failure handling
The evaluator depends on local Codex evidence sources.
Required behavior:
- the accepted child session exists under
~/.codex/sessions/
- the evaluator can read that child session after completion
- the session includes enough detail to judge observed commands, consultation, ordering, and the final answer
Helpful but optional:
~/.codex/state_5.sqlite to locate and disambiguate child threads
If any required source is missing:
- mark the case
blocked with blocked_reason: "environment" when the local evidence source is unavailable
- mark the case
blocked with blocked_reason: "insufficient-evidence" when only partial or coarse session data exists
- do not fall back to fresh-agent self-reporting as substitute evidence
Evidence Rules
Apply the repo evidence policy from agentic-evals/AGENT.md.
Evaluation Mode Workflow
When the user requests an evaluation (AB test, comparison, or subjective scoring),
follow this workflow instead of the standard single-target run.
Step E1: Detect evaluation mode
If the user provides an eval_id, or asks for an AB test / comparison / subjective evaluation:
- Read
agentic-evals/evaluations/<mode>/<eval_id>.yaml.
- Read the referenced rubric from
agentic-evals/rubrics/.
- Resolve the target(s) and case set from the evaluation YAML.
If no evaluation YAML is provided, fall back to the standard single-target workflow.
Step E2: Prepare variants
For each variant defined in the evaluation:
ab mode: check out or copy the skill at the specified skill_ref into the case workspace.
If skill_ref is a git tag/branch, clone the skill repo at that ref.
If skill_ref is a local path, copy it.
comparison mode: each variant has its own target_id and skill_path.
Prepare each variant's skill files independently.
subjective mode: single variant, use the target's current skill files.
Step E3: Execute cases per variant
For each variant, for each case:
- Create a fresh isolated workspace (same as standard workflow).
- Copy the variant's skill files into the workspace.
- Spawn a fresh subagent with the case prompt.
- Collect evidence (same as standard workflow).
- Judge pass/fail assertions (same as standard workflow).
- Score each rubric dimension from the accepted session evidence.
Write variant-specific artifacts to runs/<run_id>/variants/<variant_label>/.
Step E4: Score rubric dimensions
For each case and each rubric dimension:
- Read the dimension description and anchors from the rubric.
- Review the accepted session evidence and final answer.
- Assign a score from the dimension's scale.
- If evidence is insufficient, record
null and explain in rubric_notes.
Scoring rules:
- Score each dimension independently.
- Use anchors for calibration: a score of 3 means "matches the anchor for 3."
- For AB and comparison modes, score each variant independently before comparing.
- Rubric scores are independent of pass/fail assertions.
Step E5: Write evaluation report
Write eval-report.md with exactly these sections:
Evaluation Summary — mode, variants, case count, rubric used
Scoring Table — each case × each dimension, columns per variant
Head-to-Head — (AB and comparison only) win/loss/tie per dimension
Detailed Findings — per-case narrative of notable differences
Recommendations — at most 3 actionable items pointing to real files
For subjective mode, omit the Head-to-Head section.
Also write the standard report.md for backward compatibility.
For multi-variant modes, report.md should summarize the overall evaluation
and link to eval-report.md for details.
Evaluation Manifest
manifest.json must additionally record:
eval_id
eval_mode: one of ab, comparison, subjective
variant_labels: array of variant labels
rubric_id
AB Mode: Skill Version Preparation
For AB tests, the two variants are different versions of the same skill.
The evaluator must ensure:
- Both variants use the exact same case set.
- Each variant's workspace contains only that variant's skill files.
- The evaluator does not leak variant A's evidence into variant B's judgment.
- If a
skill_ref cannot be resolved (e.g., git tag not found), mark all cases
for that variant as blocked with blocked_reason: "environment".
Comparison Mode: Cross-Target Cases
For comparison tests, variants may reference different targets.
The shared_cases field specifies which target's cases to use.
The evaluator must:
- Load cases from
shared_cases.from_target.
- For each variant, prepare the variant's own skill in the workspace.
- Run the same case prompts against each variant's skill.
- If a variant's target does not have the referenced skill files, mark its cases as
blocked.