com um clique
eval-guide
// Guide for running statistically meaningful agent-tty evals with trials, parallelism, and A/B comparison. Covers non-determinism baseline, recommended sample sizes, and result interpretation.
// Guide for running statistically meaningful agent-tty evals with trials, parallelism, and A/B comparison. Covers non-determinism baseline, recommended sample sizes, and result interpretation.
Triage issues through a state machine driven by triage roles. Use when user wants to create an issue, triage issues, review incoming bugs or feature requests, prepare issues for an AFK agent, or manage issue workflow.
Internal maintainer SOP for version bumps, release PRs, tagging, publishing, and post-publish verification in this repository.
Terminal and TUI automation CLI for AI agents. Use when the user needs to create a terminal session, run a command in a terminal, automate an interactive CLI or TUI, wait for terminal output, capture a TUI screenshot, export a terminal recording, or test a CLI workflow with reviewable artifacts.
Structured TUI dogfooding and QA workflow using agent-tty. Use for exploratory testing, bug hunting, release-readiness validation, and UX review of terminal applications.
Terminal and TUI automation CLI for AI agents. Use when the user needs to create a terminal session, run a command in a terminal, automate an interactive CLI or TUI, wait for terminal output, capture a TUI screenshot, export a terminal recording, or test a CLI workflow with reviewable artifacts.
Terminal and TUI automation CLI for AI agents. Use when the user needs to create a terminal session, run a command in a terminal, automate an interactive CLI or TUI, wait for terminal output, capture a TUI screenshot, export a terminal recording, or test a CLI workflow with reviewable artifacts.
| name | eval-guide |
| description | Guide for running statistically meaningful agent-tty evals with trials, parallelism, and A/B comparison. Covers non-determinism baseline, recommended sample sizes, and result interpretation. |
Use this guide when you are trying to answer "did this skill or prompt change actually help?" for agent-tty evals.
The short version: do not trust a single run. This eval stack now supports multi-trial sampling, parallel execution, trial aggregation, and paired baseline comparison because the underlying model behavior is noisy enough that one pass/fail result is not decision-grade.
Always set --trials for real prompt or skill experiments.
Recommended trial counts:
--trials 5 to --trials 10--trials 3--trials 2 to --trials 3Use concurrency to keep those sample sizes affordable:
--concurrency 4 for real-provider runs.--concurrency 1 only when you explicitly want fully serial behavior.When --trials is greater than 1, reports automatically include Trial Aggregation in report.md and report.json, including per-case:
Use a paired baseline comparison whenever you want to know whether a change helped. The comparison report uses paired bootstrap confidence intervals and paired win/loss/tie counts, so it is much more reliable than eyeballing two single runs.
report.json path.--compare-baseline <baseline-report-path>.improved, regressed, or inconclusive.Practical reading rules:
0 and the effect is practically large enough to matter.0.05 score delta and, for overall pass rate, 0.05 absolute pass-rate delta.A reliable prompt-lane A/B loop looks like this:
BASELINE_JSON=$(npx tsx evals/run.ts \
--provider codex \
--model gpt-5.4 \
--lane prompt \
--condition self-load \
--trials 5 \
--concurrency 4 \
--output evals/reports/prompt-baseline \
--json | jq -r '.jsonReportPath')
# edit the skill or prompt
npx tsx evals/run.ts \
--provider codex \
--model gpt-5.4 \
--lane prompt \
--condition self-load \
--trials 5 \
--concurrency 4 \
--output evals/reports/prompt-candidate \
--compare-baseline "$BASELINE_JSON" \
--json
inconclusive is the default, healthy outcome when nothing meaningful changed. In our same-skill sanity check, 23 of 24 cases were inconclusive and the paired win/loss/tie total was 14W / 15L / 43T.--trials plus --compare-baseline.--concurrency 1 remains the default and preserves serial behavior.--concurrency 4-20 is a reasonable operating range when provider limits and budget allow.1.finally blocks, so parallel runs do not share session state.Use serial mode only when debugging; use parallel mode when sampling.
npx tsx evals/run.ts \
--provider claude \
--model claude-opus-4-6 \
--lane prompt \
--condition self-load \
--trials 5 \
--concurrency 4 \
--output evals/reports/prompt-self-load
npx tsx evals/run.ts \
--provider codex \
--model gpt-5.4 \
--lane execution \
--case hello-prompt \
--case resize-demo \
--trials 3 \
--concurrency 4 \
--output evals/reports/execution-smoke
npx tsx evals/run.ts \
--provider claude \
--model claude-opus-4-6 \
--lane dogfood \
--case exploratory-qa \
--case evidence-completeness \
--trials 2 \
--concurrency 4 \
--output evals/reports/dogfood-sample
BASELINE_JSON=$(npx tsx evals/run.ts \
--provider codex \
--model gpt-5.4 \
--lane all \
--condition all \
--trials 3 \
--concurrency 4 \
--output evals/reports/baseline \
--json | jq -r '.jsonReportPath')
npx tsx evals/run.ts \
--provider codex \
--model gpt-5.4 \
--lane all \
--condition all \
--trials 3 \
--concurrency 4 \
--output evals/reports/candidate \
--compare-baseline "$BASELINE_JSON" \
--json
npx tsx evals/run.ts --provider stub --lane prompt --trials 3 --concurrency 4
Use stub to validate wiring, not to judge whether a real-provider prompt or skill change helped.
Checklist:
evals/authoring/* (promptCase(), executionCase(), dogfoodCase()) over hand-assembled schema objects.rawWorkflowCheck(), rawVerifier(), rawArtifactRequirement(), and rawReportRequirement().npm run test.Need lifecycle events plus local progress and a machine-readable trace?
npx tsx evals/run.ts \
--provider stub \
--lane execution \
--reporter jsonl \
--reporter-output evals/reports/execution-events.jsonl \
--progress
When you omit --reporter, the default final reporter still writes report.json and report.md.
Add .workspace('agent-tty-smoke') to an executionCase() or dogfoodCase() when the case needs preset bootstrap/env/template setup. Register custom presets with registerPreset() in a module that loads before runEvalCli().
Use --snapshot-update first, then --snapshot-check --snapshot-threshold 20 against the same --snapshot-dir when you want regression signals over time. Snapshot regressions are warnings only.