en un clic
benchmarkoor
// Run benchmarkoor performance benchmarks against a locally-built Erigon binary and produce per-test MGas/s comparison tables. Covers image build, dataset reset, run invocation, result parsing, and before/after comparisons.
// Run benchmarkoor performance benchmarks against a locally-built Erigon binary and produce per-test MGas/s comparison tables. Covers image build, dataset reset, run invocation, result parsing, and before/after comparisons.
| name | benchmarkoor |
| description | Run benchmarkoor performance benchmarks against a locally-built Erigon binary and produce per-test MGas/s comparison tables. Covers image build, dataset reset, run invocation, result parsing, and before/after comparisons. |
| allowed-tools | Bash, Read, Write, Edit, Glob, Monitor |
Benchmarkoor (ethpandaops/benchmarkoor) drives an execution client through the Engine API and measures engine_newPayloadV<N> throughput in MGas/s per test (the V<N> payload version depends on the dataset's hardfork — V5 for Amsterdam, V4 for Prague, etc.). This skill teaches you to run it against a locally-built Erigon, parse results, and compare two runs.
The skill assumes the host already has the benchmarkoor binary, a config YAML, a snapshot/working datadir pair, and Docker installed — it focuses on the agent workflow, not first-time setup.
(All $VAR placeholders used below — $BENCH_DIR, $CONFIG, $INSTANCE_ID, etc. — are defined in the next section, "Adapt to the user's environment first".)
When something in this skill is ambiguous, or for newer/unfamiliar config options, consult these upstream sources before guessing:
If you change anything in $CONFIG that isn't already in this skill, cross-check it against the docs or examples links above before running.
Before running anything, identify the host-specific values. Don't hard-code these — different hosts and different datasets use different names. Ask the user if anything below is ambiguous.
| Placeholder | What it is | How to discover |
|---|---|---|
$BENCH_DIR | benchmarkoor host root (contains the binary, config, snapshot dirs) | The binary usually isn't in $PATH. Find it with find ~ /opt /srv -maxdepth 4 -name benchmarkoor -type f -executable 2>/dev/null and take the parent dir, or ask the user. Common pattern: ~/<dataset-name>/ |
$CONFIG | YAML config file passed to benchmarkoor run --config <…> | ls $BENCH_DIR/*.yaml — there may be more than one (one per network/dataset). Ask which to use. |
$INSTANCE_ID | The instances[].id to benchmark | grep -E '^\s*-\s+id:' $CONFIG |
$IMAGE_TAG | Docker image tag the Erigon instance references | Look at instances[].image for the chosen $INSTANCE_ID (e.g. erigon-local:traced) |
$SNAPSHOT_DIR | Read-only pristine dataset; never mutated | Look at client.datadirs.erigon.source_dir in the config, or the PRISTINE=... line in the reset script |
$WORKING_DIR | Writable copy that benchmarkoor mutates | The --datadir benchmarkoor's docker mount uses; see HYBRID=... in the reset script or the source_dir for method: direct datadirs |
$RESET_SCRIPT | Script that re-syncs $SNAPSHOT_DIR → $WORKING_DIR | Typically reset-hybrid.sh or reset-<dataset>.sh in $BENCH_DIR. May not exist if the snapshot/working dirs are the same (then state simply persists across runs). |
Concrete example for one host (perf-devnet-3):
$BENCH_DIR = /home/erigon/perf-devnet-3-erigon-snapshot
$CONFIG = benchmarkoor.interop.bal.yaml
$INSTANCE_ID = erigon-bal-full
$IMAGE_TAG = erigon-local:traced
$SNAPSHOT_DIR = /home/erigon/perf-devnet-3-erigon-snapshot/erigon_snapshot
$WORKING_DIR = /home/erigon/perf-devnet-3-erigon-snapshot/erigon_hybrid
$RESET_SCRIPT = reset-hybrid.sh
Don't assume these names elsewhere — discover them per session.
Benchmarkoor's source tree (pkg/datadir/) exposes five client.datadirs.<client>.method values: copy, overlayfs, fuse-overlayfs, zfs, direct. Only the first four are listed in upstream's docs/configuration.md; direct is present in the code (pkg/datadir/direct.go) but documented in its source as "not suitable for normal benchmarking … intended for inspection / resume workflows." Choose based on host filesystem and how clean you need isolation between runs.
method: direct + external reset script) — what we use todayStrictly speaking this is method: direct pointing at a pre-prepared writable copy (erigon_hybrid/) of a read-only pristine snapshot (erigon_snapshot/). A user-owned script (reset-hybrid.sh) rsyncs snapshot → working copy between runs, and the read-only snapshots/ subdir is bind-mounted from the pristine source to avoid duplicating immutable segment files.
Config side: method: direct pointing at $WORKING_DIR. Reset side: external script before each run.
method: overlayfs — doesn't work for our Erigon datasetNative Linux overlayfs with $SNAPSHOT_DIR as the read-only lower layer and a /tmp/benchmarkoor-overlay-* upper. Mount itself is instant. The problem is that MDBX's open path touches enough chaindata pages during recovery/steady-sync that the kernel copies up substantial parts of the file to the upper layer, taking minutes. Benchmarkoor's RPC-readiness probe (observed at ~2 minutes; check pkg/runner/ for the current value) trips, and Erigon gets killed mid-open with Got interrupt, shutting down....
Past evidence in our results dir: runs 1779001673_b1eeb60b_… and 1779002095_ae096a8b_… (May 17) — both timed out around the Opening Database step.
method: zfs — promising once the host has ZFSZFS snapshot + clone provides copy-on-write isolation that should avoid the overlayfs copy-up problem because COW is the native operation, not a degraded fallback.
$SNAPSHOT_DIR must live on a ZFS dataset; root or appropriate ZFS delegations.method: direct (raw) — not for benchmarking, and dangerous if pointed at the snapshotMounts source_dir directly into the container with no isolation. Whatever Erigon writes persists in source_dir. From benchmarkoor's own code comment: "not suitable for normal benchmarking … intended for inspection / resume workflows."
source_dir is the pristine snapshot, it will be irreversibly mutated. The snapshot is ~2 TB on this host, and re-downloading it takes many hours. There is no automatic backup. Double-check client.datadirs.erigon.source_dir in the YAML before running with method: direct — if it points at $SNAPSHOT_DIR, change it or abort.method: direct correctly because it points at a disposable working copy ($WORKING_DIR), not the pristine snapshot. That's the safe pattern.method: copy and method: fuse-overlayfsTwo more methods exist but weren't explored:
copy — parallel file copy of source_dir to a fresh working dir each run. Universal but slow for large datadirs.fuse-overlayfs — userspace overlayfs; documented as ~3× slower than native overlayfs. Fallback when native isn't available.$BENCH_DIR/ # benchmarkoor host root
├── benchmarkoor # binary (root-owned, requires sudo)
├── $CONFIG # main config (one of possibly several)
├── $RESET_SCRIPT # resets datadir (sudo-only); may not exist
├── $SNAPSHOT_DIR/ # read-only pristine dataset (never touched)
├── $WORKING_DIR/ # writable copy that benchmarkoor mutates
└── results/runs/ # per-run output dirs
├── index.json # generated run index
└── <unix_ts>_<short_hash>_<instance-id>/ # one dir per run
# The Erigon clone (containing build/bin/erigon) typically lives elsewhere on
# the host — sibling of $BENCH_DIR or a separate workspace — and is referenced
# via cp/COPY in Step 1. Don't assume it's under $BENCH_DIR.
$BENCH_DIR/erigon/build/bin/erigon (or wherever the user's clone lives). If missing, build with make erigon from the erigon repo.$IMAGE_TAG exists. Confirm with sudo -n docker images | grep <image-name>.$BENCH_DIR/benchmarkoor. Requires sudo -n to invoke (controls docker, cpuset, cpufreq).$SNAPSHOT_DIR (untouched) and $WORKING_DIR (working copy). $RESET_SCRIPT rsyncs the former into the latter.pgrep -af "build/bin/erigon" and stop any local node that has $WORKING_DIR as its --datadir (many setups have a stop.sh next to the datadir; otherwise pkill -f "datadir.*$WORKING_DIR").benchmarkoor-<oldhash>-* containers still up, run sudo -n ./benchmarkoor cleanup (or sudo -n docker rm -f <container>) before starting.If you've changed Erigon source since the last image was built, you need a fresh image. A full Dockerfile build is slow; use a quick overlay instead:
mkdir -p /tmp/erigon-img-overlay
cp "$BENCH_DIR/erigon/build/bin/erigon" /tmp/erigon-img-overlay/erigon
# Unquoted EOF on purpose — $IMAGE_TAG must expand into the heredoc.
cat > /tmp/erigon-img-overlay/Dockerfile <<EOF
FROM $IMAGE_TAG
COPY --chown=erigon:erigon erigon /usr/local/bin/erigon
EOF
cd /tmp/erigon-img-overlay && sudo -n docker build -t "$IMAGE_TAG" .
This re-tags $IMAGE_TAG in under a second. The original Dockerfile multi-stage build is only needed if base layers (OS, deps) changed.
If you've never built the base image, fall back to:
cd "$BENCH_DIR/erigon" && sudo -n docker build -t "$IMAGE_TAG" .
Caveat: the stock Erigon Dockerfile may not reproduce the original $IMAGE_TAG's build flags (e.g. tracing builds use extra args). If the original image was built with non-default args, ask the user how it was first built before falling back.
$RESET_SCRIPT (typically) rsyncs $SNAPSHOT_DIR/ → $WORKING_DIR/ (excluding the read-only snapshots/ bind mount). Requires sudo because containers leave root-owned files behind.
sudo -n bash "$BENCH_DIR/$RESET_SCRIPT"
Always run this before every benchmark run. Skipping it means leftover state from the previous run pollutes results.
If no reset script exists for this dataset, the user has chosen a setup where state persists across runs — in that case ask whether they want a manual rsync from $SNAPSHOT_DIR to $WORKING_DIR before starting, or to deliberately run on the prior state.
Open $BENCH_DIR/$CONFIG. Don't construct YAML from scratch — read the existing file and edit it in place; the snippet below shows the knobs you'll likely touch, not the full schema. (Full schema in the upstream examples/configuration/ reference.) Key knobs:
runner:
benchmark:
tests:
# The filter regex picks which tests run. Edit alternations to add/remove
# tests; edit the size suffix (e.g. 120M / 60M) to change gas budgets per test.
filter: 'regex:__test_(<test_name_1>|<test_name_2>|...)\[.*benchmark_<size>M\]'
client:
config:
resource_limits:
# Prefer explicit `cpuset:` over `cpuset_count:` for reproducibility.
# Pin to exactly 6 distinct physical cores to match ethpandaops's upstream
# reference runs (which also use 6) so results are comparable.
# See "CPU pinning" notes below for how to pick the actual ids per host.
cpuset: [<6 logical CPU ids, one per distinct physical core>]
# cpuset_count: 6 # alternative: random 6 CPUs each run — adds variance
cpu_freq: "3600MHz"
cpu_turboboost: false
cpu_freq_governor: performance
memory: "32g"
instances:
- id: <instance-id>
client: erigon
image: <image-tag> # must match the image tag from Step 1
pull_policy: never # critical — local image only
extra_args:
# fork overrides for the snapshot's chain state — values are dataset-specific
cpuset: (explicit logical-CPU list) and cpuset_count: (random N CPUs each run) are mutually exclusive. Prefer cpuset: — it's deterministic across runs and lets you pick topology-aware values. cpuset_count picks N logical CPUs at random each run, which on SMT hosts produces different physical-core counts each time (the random selection often double-books some physical cores via their SMT siblings), baking noise into A/B comparisons.
Pin to exactly 6 logical CPUs, one per distinct physical core, avoiding SMT sibling pairs. The "6" matches what ethpandaops's upstream reference runs use (e.g. cpuset: [6,7,8,9,10,11] at https://benchmarkoor.core.ethpandaops.io/runs/), so any A/B against published reference numbers is core-count-comparable. If the host has more than 6 physical cores, leave the extras unpinned so the docker daemon, benchmarkoor itself, and the host kernel don't compete with the bench workload. If the host has fewer than 6 physical cores, that's a deeper problem — note it and ask the user.
Discover the host's topology:
lscpu | grep -E "^CPU|^Thread|^Core|^Socket|Model name"
for c in $(seq 0 $(($(nproc)-1))); do
printf 'cpu%s: core=%s siblings=%s\n' \
"$c" \
"$(cat /sys/devices/system/cpu/cpu$c/topology/core_id)" \
"$(cat /sys/devices/system/cpu/cpu$c/topology/thread_siblings_list)"
done
Read off 6 logical CPUs whose core_ids are distinct (i.e. skip SMT siblings). On a typical Linux topology where logical CPUs 0..N-1 are physical and N..2N-1 are SMT siblings of cores 0..N-1, pick any 6 from the lower half.
The literal cpuset numbers in the reference run (6,7,8,9,10,11) are specific to that host — don't copy them; replicate the intent (deterministic + physical-only + count=6).
To know how many tests will actually run, dry-run the filter against the extracted test fixtures. Strip the regex: prefix from the YAML filter value and feed the rest to grep -E:
# e.g. for filter 'regex:__test_(blake2f_benchmark|ecrecover)\[.*benchmark_120M\]'
ls "$BENCH_DIR"/.cache/opcode-archive-extract-*/eest_bal/testing/ 2>/dev/null \
| grep -cE '__test_(blake2f_benchmark|ecrecover)\[.*benchmark_120M\]'
cd "$BENCH_DIR" && sudo -n ./benchmarkoor run \
--config "$CONFIG" \
--limit-instance-id "$INSTANCE_ID" \
2>&1 | tee /tmp/benchmarkoor.log
Notes:
benchmarkoor run --help shows both --limit-instance-id (specific instance ids; what we use) and --limit-instance-client (any instance for a given client name). They coexist; pick the one matching how you've keyed your instances. Without either flag, benchmarkoor runs every instance in the config.<number of tests matched by the filter>. Pre-test orchestration (container start, gas-bump, funding) dominates over the actual test on this setup, so expect tens of seconds per test even for tiny test payloads.tail -F /tmp/benchmarkoor.log | grep -E "index=[0-9]+/|Error|FAIL|panic", or use the Monitor tool with the same filter to get notifications.sudo -n docker ps shows the active container (benchmarkoor-<runid>-<instance>-<index>).ls -t "$BENCH_DIR"/results/runs/ | head -3
Most recent dir matches <unix_timestamp>_<short_hash>_<instance-id>/. Inside, each test has its own dir:
$BENCH_DIR/results/runs/<run-id>/
├── config.json # snapshot of the YAML
├── result.json # aggregated run-level stats
├── benchmarkoor.log
├── test_<name>.py__test_<func>[<params>].txt/
│ ├── setup.result-aggregated.json
│ ├── setup.result-details.json
│ ├── test.result-aggregated.json # ← per-test MGas/s lives here
│ └── test.result-details.json
└── ... # one dir per test that ran
To confirm the run completed all matched tests, check result.json's tests_total / tests_passed (or grep Run result written ... tests_count=<N> in benchmarkoor.log).
The MGas/s value for each test is at:
.method_stats.mgas_s.engine_newPayloadV<N>.last
The V<N> suffix depends on the dataset's hardfork (V5 for Amsterdam, V4 for Prague, V3 for Cancun, etc. — matching line at top of skill). Don't guess the key — inspect one test.result-aggregated.json first, e.g.:
jq -r '.method_stats.mgas_s | keys[]' \
"$BENCH_DIR"/results/runs/<run-id>/test_*/test.result-aggregated.json | sort -u | head
A .last value (instead of .mean) is fine because each test runs exactly one such call against its payload. If you're A/B-comparing runs across hardforks, the keys differ — the comparator below will show n/a for any mismatched key.
Use a Python one-liner that reads two run dirs and produces a per-test speedup table sorted by ratio. The script auto-handles different test counts and naming patterns:
# Substitute <bench-dir>, <old-run-dirname>, <new-run-dirname> below before running.
# Python does NOT expand shell variables; use literal paths or os.environ.
import json, glob, os, re
RUNS_DIR = '<bench-dir>/results/runs' # or: os.environ['BENCH_DIR'] + '/results/runs'
before_id = '<old-run-dirname>'
after_id = '<new-run-dirname>'
def shorten(name):
m = re.search(r'test_(\w+)\.py__(test_\w+)\[(.+)\]', name)
if not m: return name
tname, params = m.group(2), m.group(3)
parts = [tname]
for label, pat in [('', r'opcode_(\w+)'), ('mod', r'mod_bits_(\d+)'),
('rounds', r'rounds_(\d+)'), ('', r'benchmark_(\d+M)')]:
mm = re.search(pat, params)
if mm: parts.append(f'{label}{mm.group(1)}')
return '-'.join(parts)
def mgas(run_id):
"""Read MGas/s from each test in a run. Auto-detects the engine_newPayloadV<N> key."""
out = {}
for f in sorted(glob.glob(os.path.join(RUNS_DIR, run_id, 'test_*/test.result-aggregated.json'))):
with open(f) as fp: d = json.load(fp)
name = os.path.basename(os.path.dirname(f)).removesuffix('.txt')
stats = d.get('method_stats', {}).get('mgas_s', {})
# Pick the first engine_newPayloadV* entry — same hardfork ⇒ same key across tests.
key = next((k for k in stats if k.startswith('engine_newPayloadV')), None)
if key and 'last' in stats[key]:
out[name] = stats[key]['last']
return out
b, a = mgas(before_id), mgas(after_id)
rows = [(shorten(n), b.get(n), a.get(n)) for n in sorted(set(b)|set(a))]
rows = [(n, bv, av,
(av / bv) if (bv is not None and av is not None and bv > 0) else None)
for n, bv, av in rows]
rows.sort(key=lambda r: r[3] if r[3] is not None else 0, reverse=True)
def fmt_num(x): return f'{x:>14.1f}' if x is not None else f'{"n/a":>14}'
def fmt_sp(s): return f'{s:>8.2f}x' if s is not None else f'{"n/a":>9}'
print(f'{"Test":<55} {"Before":>14} {"After":>14} {"Speedup":>9}')
for n, bv, av, sp in rows:
print(f'{n:<55} {fmt_num(bv)} {fmt_num(av)} {fmt_sp(sp)}')
if b and a:
print(f'\navg: {sum(b.values())/len(b):.1f} → {sum(a.values())/len(a):.1f} ({sum(a.values())/sum(b.values()):.2f}x)')
Render as a markdown table for inclusion in the response; write to /tmp/benchmark_comparison.md if the user wants to copy-paste. Important: the average is over the test set actually present in both runs. If the filter changed between runs, the averages aren't directly comparable — call that out explicitly.
| Symptom | Cause | Fix |
|---|---|---|
Failed to set turbo boost warning | CPU governor not user-controllable | Harmless; ignore. |
HEAD failed; using cached file | GitHub Actions artifact HEAD requires auth | Harmless if cache is present at $BENCH_DIR/.cache/. |
Container stopped for recreate count = N–1 (not N) at end | Last container's "stopped" log fires after the suite completion log | Verify with Run result written ... tests_count=<N> in the log. |
| Big variance between identical runs | CPU governor not pinned, or other heavy workload | Always set cpu_freq_governor: performance; don't run other CPU-heavy tasks (full make test-all, syncing nodes) simultaneously. |
Image-tag mismatch (the right IMAGE_TAG not used) | Docker cached an older layer | Rebuild the image (Step 1) explicitly; confirm docker images shows a recent CREATED time. |
Old benchmarkoor-<hash>-* container lingering | Previous run aborted before cleanup | sudo -n docker rm -f <container> + sudo -n ./benchmarkoor cleanup. |
| Run "completes" instantly with 0 tests | Wrong cwd, or filter regex matches 0 tests | Confirm cwd is $BENCH_DIR; sanity-check the filter against $BENCH_DIR/.cache/opcode-archive-extract-*/eest_bal/testing/*.txt. |
erigon_snapshot/erigon_hybrid/benchmarkoor.interop.bal.yaml/erigon-local:traced/erigon-bal-full — they vary between datasets and hosts.--limit-instance-id when comparing just one client — otherwise the run also exercises every other client in the config (geth/besu/reth/nethermind/…) which adds tens of minutes and clutters results/runs/.image tag in the YAML must match pull_policy: never — benchmarkoor will refuse to pull, so a missing local image fails immediately rather than silently downloading a stale upstream one.results/runs/index.json is regenerated by generate-index-file; don't hand-edit. After a successful run it's auto-generated when generate_results_index: true in the config. If it's stale or missing (e.g. an aborted previous run, or comparing against an older directory the index doesn't list), regenerate explicitly: sudo -n ./benchmarkoor generate-index-file --config "$CONFIG".results/suites/). Two runs with the same filter regex have the same suite hash, so comparing them is direct. A different filter ⇒ different suite hash ⇒ different test set, and shortened-name matching is the only sane cross-suite comparison — flag this to the user.Run the full Erigon test suite locally using GOGC=80 make test-all. Use this before marking a PR ready for review. Equivalent to the "All tests" CI workflow.
Run Erigon tests with Go race detector to find data races and concurrency bugs. Use this for concurrency-sensitive changes (parallel executor, p2p, txpool). Takes 30-60 minutes.
Run Erigon CI checks locally and/or trigger them remotely on a branch via GitHub Actions workflow_dispatch. Use this when you need to verify a branch passes all CI before or after pushing — especially for branches like bal-devnet-2 that don't auto-trigger on push/PR events.
Implement a new EIP for a hardfork under development in Erigon. Use when the user asks to implement, port, or wire up an EIP — covers spec lookup, dep analysis, prior-work check, implementation, lint, tests, and a wrap-up saved to `agentspecs/`.
Run an ephemeral Erigon instance with a temporary datadir. Use this whenever the user wants to spin up a temporary, throwaway, or sandboxed Erigon node for quick testing, launch a second Erigon instance alongside an existing one, clone a datadir into a temp copy for safe experimentation, or find and clean up leftover ephemeral datadirs and processes from previous sessions. Handles port conflict detection and automatic port offsetting. Trigger on any mention of temporary/throwaway/ephemeral/disposable Erigon instances, running erigon briefly for testing or debugging, starting a second/additional erigon node, or cleaning up old temp erigon data.
Reference for all Erigon network ports. Use this when running multiple Erigon instances to avoid port conflicts. Lists every CLI flag that binds a port, its default value, and the protocol used.