with one click
agent-eval
Head-to-head comparison of coding agents (Claude Code, Aider, Codex, etc.) on custom tasks with pass rate, cost, time, and consistency metrics
Menu
Head-to-head comparison of coding agents (Claude Code, Aider, Codex, etc.) on custom tasks with pass rate, cost, time, and consistency metrics
| name | agent-eval |
| description | Head-to-head comparison of coding agents (Claude Code, Aider, Codex, etc.) on custom tasks with pass rate, cost, time, and consistency metrics |
| metadata | {"origin":"ECC"} |
| tools | Read, Write, Edit, Bash, Grep, Glob |
A lightweight CLI tool for comparing coding agents head-to-head on reproducible tasks. Every "which coding agent is best?" comparison runs on vibes โ this tool systematizes it.
Note: Install agent-eval from its repository after reviewing the source.
Define tasks declaratively. Each task specifies what to do, which files to touch, and how to judge success:
name: add-retry-logic
description: Add exponential backoff retry to the HTTP client
repo: ./my-project
files:
- src/http_client.py
prompt: |
Add retry logic with exponential backoff to all HTTP requests.
Max 3 retries. Initial delay 1s, max delay 30s.
judge:
- type: pytest
command: pytest tests/test_http_client.py -v
- type: grep
pattern: "exponential_backoff|retry"
files: src/http_client.py
commit: "abc1234" # pin to specific commit for reproducibility
Each agent run gets its own git worktree โ no Docker required. This provides reproducibility isolation so agents cannot interfere with each other or corrupt the base repo.
| Metric | What It Measures |
|---|---|
| Pass rate | Did the agent produce code that passes the judge? |
| Cost | API spend per task (when available) |
| Time | Wall-clock seconds to completion |
| Consistency | Pass rate across repeated runs (e.g., 3/3 = 100%) |
Create a tasks/ directory with YAML files, one per task:
mkdir tasks
# Write task definitions (see template above)
Execute agents against your tasks:
agent-eval run --task tasks/add-retry-logic.yaml --agent claude-code --agent aider --runs 3
Each run:
Generate a comparison report:
agent-eval report --format table
Task: add-retry-logic (3 runs each)
โโโโโโโโโโโโโโโโฌโโโโโโโโโโโโฌโโโโโโโโโฌโโโโโโโโโฌโโโโโโโโโโโโโโ
โ Agent โ Pass Rate โ Cost โ Time โ Consistency โ
โโโโโโโโโโโโโโโโผโโโโโโโโโโโโผโโโโโโโโโผโโโโโโโโโผโโโโโโโโโโโโโโค
โ claude-code โ 3/3 โ $0.12 โ 45s โ 100% โ
โ aider โ 2/3 โ $0.08 โ 38s โ 67% โ
โโโโโโโโโโโโโโโโดโโโโโโโโโโโโดโโโโโโโโโดโโโโโโโโโดโโโโโโโโโโโโโโ
judge:
- type: pytest
command: pytest tests/ -v
- type: command
command: npm run build
judge:
- type: grep
pattern: "class.*Retry"
files: src/**/*.py
judge:
- type: llm
prompt: |
Does this implementation correctly handle exponential backoff?
Check for: max retries, increasing delays, jitter.
Create reproducible, cross-platform (macOS/Linux) development environments with Flox, a declarative Nix-based environment manager. Use when setting up project toolchains for any language, installing system-level dependencies (compilers, databases, native libs like openssl/BLAS), pinning exact package versions for a team, running local services (PostgreSQL, Redis, Kafka), onboarding developers with one command, or solving 'works on my machine' problems โ including agent/vibe-coding setups that need project-scoped tools without sudo. Also use when the user mentions .flox/, manifest.toml, flox activate, or FloxHub.
Commercial-grade Python installer expert for Windows: Nuitka extreme compilation, dist slimming, DLL footprint analysis, and Inno Setup packaging to ship the smallest, fastest installers. Use only for advanced packaging/optimization (minimal size, fast startup), not basic script-to-exe conversion. ไธญๆ่งฆๅ๏ผNuitka ๆ้ไผๅใPython ๅไธๆๅ ใๆ้็ผ่ฏ Pythonใdist ็ฆ่บซใDLL ๅๆใๆๅฐๅฎ่ฃ ๅ ใๆๅฟซๅฏๅจใๅไธ็บงๆๅ ้ฃๆ ผ
Use when a brand needs to discover or articulate its identity through structured multi-session interviews. Covers purpose, positioning, audience, personality, voice, narrative, and founder-brand tension across 8 modules using laddering, 5 Whys, and projective techniques. Produces a resumable session with disk-persisted state and a master brandbook (90_SYNTHESIS.md).
Use when a brand needs to discover or articulate its identity through structured multi-session interviews. Covers purpose, positioning, audience, personality, voice, narrative, and founder-brand tension across 8 modules using laddering, 5 Whys, and projective techniques. Produces a resumable session with disk-persisted state and a master brandbook (90_SYNTHESIS.md).
Use this skill to automate visual testing and UI interaction verification using browser automation after deploying features.
Visualize whether skills, rules, and agent definitions are actually followed โ auto-generates scenarios at 3 prompt strictness levels, runs agents, classifies behavioral sequences, and reports compliance rates with full tool call timelines