| name | agentbench-eval |
| description | AgentBench evaluation harness for claudemem. Covers pre-indexed repos, experiment conditions, running benchmarks, analyzing results, and managing index archives. Use when working on eval infrastructure, running experiments, or interpreting benchmark results. |
AgentBench Evaluation Skill
Run and manage claudemem evaluation experiments against the eth-sri/agentbench benchmark (138 instances, 12 Python repos).
Repo Location
The agentbench repo is a sibling of the claudemem repo:
../agentbench/
All paths below are relative to the agentbench repo root unless noted otherwise.
Data Layout
data/
โโโ eval-repos/ # 12 cloned repos with pre-built .claudemem/ indexes
โ โโโ {slug}/.claudemem/ # AST, symbols, vectors, enrichment per repo
โโโ archives/ # Immutable index snapshots (tar.gz)
โ โโโ indexes-20260304-deepseek.tar.gz # v2 full: 12 repos (1.2GB)
โ โโโ indexes-20260304-deepseek-11of12.tar.gz # v2: 11 repos enriched (830MB)
โโโ eval-cache/ # Runtime sentinel cache
โโโ eval-generated/ # Generated CLAUDE.md files
Index Specs (v2, 2026-03-04)
- Enrichment model:
deepseek/deepseek-v3.2 via OpenRouter
- Embedding model:
qwen/qwen3-embedding-8b via OpenRouter
- Total: 12 repos, ~39K symbols, ~36K enrichment docs, ~1.9GB indexes
- Cost: ~$0.25 total via OpenRouter
Repo Inventory
| Slug | Files | Symbols | Docs | Size |
|---|
| ansible_ansible | 1758 | 4,597 | 7,214 | 176M |
| getzep_graphiti | 115 | 434 | 635 | 43M |
| huggingface_smolagents | 70 | 475 | 621 | 20M |
| huggingface_transformers | 3082 | 0 | 0 | ~640M |
| jlowin_fastmcp | 417 | 1,914 | 3,348 | 92M |
| openai_openai-agents-python | 477 | 3,504 | 3,620 | 98M |
| opshin_opshin | 125 | 714 | 855 | 24M |
| pdm-project_pdm | 215 | 1,183 | 1,480 | 34M |
| qodo-ai_pr-agent | 114 | 328 | 835 | 21M |
| tinygrad_tinygrad | 884 | 20,018 | 6,345 | 400M |
| vibrantlabsai_ragas | 401 | 1,357 | 2,514 | 57M |
| wagtail_wagtail | 2270 | 4,434 | 8,419 | 261M |
Note: huggingface_transformers has 0 symbols/docs (tree-sitter WASM errors on metaprogramming). Vector search still works.
Experiment Conditions
| Condition | Type | Workers | What it does |
|---|
no_plan | Baseline | 2 | Raw Claude Code, no AGENTS.md |
claudemem_full | Per-instance | 2 | claudemem map+search โ AGENTS.md per task |
dc_planner | Cross-instance | 1 | Dynamic Cheatsheet โ learns across tasks |
ace_planner | Cross-instance | 1 | ACE reflector+curator playbook โ learns across tasks |
Instance Filter (24 instances, 2 per repo)
Hardcoded in scripts/agentbench/run_harness/run_condition.py.
Common Tasks
Run an Experiment
cd scripts/agentbench/run_harness
python run_condition.py <condition>
Never pass filter directly in shell โ pipe | gets escaped by zsh.
Restore Indexes (New Machine)
./scripts/agentbench/run_harness/restore_indexes.sh
./scripts/agentbench/run_harness/restore_indexes.sh \
--archive data/archives/indexes-20260304-deepseek.tar.gz
Re-Index a Single Repo
CLAUDEMEM_LLM=or/deepseek/deepseek-v3.2 claudemem index --force data/eval-repos/{slug}
claudemem index --no-llm data/eval-repos/{slug}
Create Archive Snapshot
cd data/eval-repos
tar czf ../archives/indexes-$(date +%Y%m%d)-deepseek.tar.gz */.claudemem/
Check Experiment Results
ls scripts/agentbench/run_harness/output/agentbench/eth-sri_agentbench/{condition}/
cd scripts/agentbench/run_harness
python evaluate.py --condition <condition> --run_id <N>
python analyze.py
DC/ACE Training Data
- DC cheatsheets:
plans/dynamic_cheatsheet/{model}/cheatsheet_{repo}.txt
- ACE playbooks:
plans/ace_playbook/{model}/playbook_{repo}.json
- History:
*_history/ subdirectories track evolution across instances
Key Gotchas
- Always use
run_condition.py โ shell escaping of | in filters breaks otherwise
- DC/ACE are sequential (workers=1) โ they learn across instances, can't parallelize
- Index cache is 3-level: in-process set โ
index.db file โ sentinel in data/eval-cache/
- Model keys are short names:
sonnet-4-5 not claude-sonnet-4-5-20250929
generate.py uses fire.Fire(main) โ both positional and --flag=val args work
--no-llm indexes: only give map (PageRank symbols). Full enrichment enables semantic search
- All paths are repo-relative โ no
~/.claudemem/ dependencies
- Enrichment model: deepseek/deepseek-v3.2 was selected from 76 benchmark runs (composite score 0.886)
Architecture (Key Files)
| File | Purpose |
|---|
src/agentbench/planners/claudemem_planner.py | ClaudememPlanner โ indexes repo, generates AGENTS.md |
src/agentbench/planners/ace/ace.py | ACE planner โ reflector+curator playbook |
src/agentbench/planners/evo_reproducer/evo_reproducer.py | DC/EvoReproducer planner โ dynamic cheatsheet |
scripts/agentbench/run_harness/generate.py | Main harness entry point |
scripts/agentbench/run_harness/run_condition.py | Launch helper (handles filter escaping) |
scripts/agentbench/run_harness/restore_indexes.sh | Archive restore script |
scripts/agentbench/run_harness/evaluate.py | Result evaluator |
scripts/agentbench/run_harness/analyze.py | Cross-condition analyzer |