| name | benchmarkoor |
| description | Run benchmarkoor performance benchmarks against a locally-built Erigon binary and produce per-test MGas/s comparison tables. Covers image build, dataset reset, run invocation, result parsing, and before/after comparisons. |
| allowed-tools | Bash, Read, Write, Edit, Glob, Monitor |
Benchmarkoor: per-test throughput benchmarks
Benchmarkoor (ethpandaops/benchmarkoor) drives an execution client through the Engine API and measures engine_newPayloadV<N> throughput in MGas/s per test (the V<N> payload version depends on the dataset's hardfork ā V5 for Amsterdam, V4 for Prague, etc.). This skill teaches you to run it against a locally-built Erigon, parse results, and compare two runs.
The skill assumes the host already has the benchmarkoor binary, a config YAML, a snapshot/working datadir pair, and Docker installed ā it focuses on the agent workflow, not first-time setup.
References
(All $VAR placeholders used below ā $BENCH_DIR, $CONFIG, $INSTANCE_ID, etc. ā are defined in the next section, "Adapt to the user's environment first".)
When something in this skill is ambiguous, or for newer/unfamiliar config options, consult these upstream sources before guessing:
If you change anything in $CONFIG that isn't already in this skill, cross-check it against the docs or examples links above before running.
Adapt to the user's environment first
Before running anything, identify the host-specific values. Don't hard-code these ā different hosts and different datasets use different names. Ask the user if anything below is ambiguous.
| Placeholder | What it is | How to discover |
|---|
$BENCH_DIR | benchmarkoor host root (contains the binary, config, snapshot dirs) | The binary usually isn't in $PATH. Find it with find ~ /opt /srv -maxdepth 4 -name benchmarkoor -type f -executable 2>/dev/null and take the parent dir, or ask the user. Common pattern: ~/<dataset-name>/ |
$CONFIG | YAML config file passed to benchmarkoor run --config <ā¦> | ls $BENCH_DIR/*.yaml ā there may be more than one (one per network/dataset). Ask which to use. |
$INSTANCE_ID | The instances[].id to benchmark | grep -E '^\s*-\s+id:' $CONFIG |
$IMAGE_TAG | Docker image tag the Erigon instance references | Look at instances[].image for the chosen $INSTANCE_ID (e.g. erigon-local:traced) |
$SNAPSHOT_DIR | Read-only pristine dataset; never mutated | Look at client.datadirs.erigon.source_dir in the config, or the PRISTINE=... line in the reset script |
$WORKING_DIR | Writable copy that benchmarkoor mutates | The --datadir benchmarkoor's docker mount uses; see HYBRID=... in the reset script or the source_dir for method: direct datadirs |
$RESET_SCRIPT | Script that re-syncs $SNAPSHOT_DIR ā $WORKING_DIR | Typically reset-hybrid.sh or reset-<dataset>.sh in $BENCH_DIR. May not exist if the snapshot/working dirs are the same (then state simply persists across runs). |
Concrete example for one host (perf-devnet-3):
$BENCH_DIR = /home/erigon/perf-devnet-3-erigon-snapshot
$CONFIG = benchmarkoor.interop.bal.yaml
$INSTANCE_ID = erigon-bal-full
$IMAGE_TAG = erigon-local:traced
$SNAPSHOT_DIR = /home/erigon/perf-devnet-3-erigon-snapshot/erigon_snapshot
$WORKING_DIR = /home/erigon/perf-devnet-3-erigon-snapshot/erigon_hybrid
$RESET_SCRIPT = reset-hybrid.sh
Don't assume these names elsewhere ā discover them per session.
Datadir setup approaches
Benchmarkoor's source tree (pkg/datadir/) exposes five client.datadirs.<client>.method values: copy, overlayfs, fuse-overlayfs, zfs, direct. Only the first four are listed in upstream's docs/configuration.md; direct is present in the code (pkg/datadir/direct.go) but documented in its source as "not suitable for normal benchmarking ⦠intended for inspection / resume workflows." Choose based on host filesystem and how clean you need isolation between runs.
1. "Hybrid" (= method: direct + external reset script) ā what we use today
Strictly speaking this is method: direct pointing at a pre-prepared writable copy (erigon_hybrid/) of a read-only pristine snapshot (erigon_snapshot/). A user-owned script (reset-hybrid.sh) rsyncs snapshot ā working copy between runs, and the read-only snapshots/ subdir is bind-mounted from the pristine source to avoid duplicating immutable segment files.
- Pros: works on any filesystem; no kernel/ZFS features needed; no per-test copy-up cost during execution; full reset is one rsync (~tens of seconds).
- Cons: user-managed, not built into benchmarkoor; requires sudo for the reset (containers leave root-owned files in the working dir).
- Use when: the host has no ZFS, and you want fast, repeatable, isolated runs without paying overlayfs copy-up at runtime.
Config side: method: direct pointing at $WORKING_DIR. Reset side: external script before each run.
2. method: overlayfs ā doesn't work for our Erigon dataset
Native Linux overlayfs with $SNAPSHOT_DIR as the read-only lower layer and a /tmp/benchmarkoor-overlay-* upper. Mount itself is instant. The problem is that MDBX's open path touches enough chaindata pages during recovery/steady-sync that the kernel copies up substantial parts of the file to the upper layer, taking minutes. Benchmarkoor's RPC-readiness probe (observed at ~2 minutes; check pkg/runner/ for the current value) trips, and Erigon gets killed mid-open with Got interrupt, shutting down....
Past evidence in our results dir: runs 1779001673_b1eeb60b_⦠and 1779002095_ae096a8b_⦠(May 17) ā both timed out around the Opening Database step.
- Use when: working with clients whose datadirs don't trigger heavy copy-up (smaller / less-touched files). For Erigon with the current dataset size, skip it.
- Possible workaround if you must: increase whatever timeout benchmarkoor exposes for RPC readiness; we didn't pursue this.
3. method: zfs ā promising once the host has ZFS
ZFS snapshot + clone provides copy-on-write isolation that should avoid the overlayfs copy-up problem because COW is the native operation, not a degraded fallback.
- Requires:
$SNAPSHOT_DIR must live on a ZFS dataset; root or appropriate ZFS delegations.
- What benchmarkoor does: snapshots the source dataset, clones the snapshot to a working dataset, mounts that into the container; cleans up the clone after the run.
- Not yet tested here: the current host's root FS is ext4, so we never exercised it. When migrating to a ZFS host, this is the path to try first ā it should subsume the "hybrid" workflow with no external reset script needed.
4. method: direct (raw) ā not for benchmarking, and dangerous if pointed at the snapshot
Mounts source_dir directly into the container with no isolation. Whatever Erigon writes persists in source_dir. From benchmarkoor's own code comment: "not suitable for normal benchmarking ⦠intended for inspection / resume workflows."
- ā ļø If
source_dir is the pristine snapshot, it will be irreversibly mutated. The snapshot is ~2 TB on this host, and re-downloading it takes many hours. There is no automatic backup. Double-check client.datadirs.erigon.source_dir in the YAML before running with method: direct ā if it points at $SNAPSHOT_DIR, change it or abort.
- Use only when: you specifically want to inspect / iterate on the chain state left behind by a prior run, or you're debugging.
- The "hybrid" approach above uses
method: direct correctly because it points at a disposable working copy ($WORKING_DIR), not the pristine snapshot. That's the safe pattern.
(Bonus) method: copy and method: fuse-overlayfs
Two more methods exist but weren't explored:
copy ā parallel file copy of source_dir to a fresh working dir each run. Universal but slow for large datadirs.
fuse-overlayfs ā userspace overlayfs; documented as ~3Ć slower than native overlayfs. Fallback when native isn't available.
Layout convention (typical)
$BENCH_DIR/ # benchmarkoor host root
āāā benchmarkoor # binary (root-owned, requires sudo)
āāā $CONFIG # main config (one of possibly several)
āāā $RESET_SCRIPT # resets datadir (sudo-only); may not exist
āāā $SNAPSHOT_DIR/ # read-only pristine dataset (never touched)
āāā $WORKING_DIR/ # writable copy that benchmarkoor mutates
āāā results/runs/ # per-run output dirs
āāā index.json # generated run index
āāā <unix_ts>_<short_hash>_<instance-id>/ # one dir per run
# The Erigon clone (containing build/bin/erigon) typically lives elsewhere on
# the host ā sibling of $BENCH_DIR or a separate workspace ā and is referenced
# via cp/COPY in Step 1. Don't assume it's under $BENCH_DIR.
Prerequisites (verify before starting)
- Erigon binary built at
$BENCH_DIR/erigon/build/bin/erigon (or wherever the user's clone lives). If missing, build with make erigon from the erigon repo.
- Docker image
$IMAGE_TAG exists. Confirm with sudo -n docker images | grep <image-name>.
- Benchmarkoor binary at
$BENCH_DIR/benchmarkoor. Requires sudo -n to invoke (controls docker, cpuset, cpufreq).
- Dataset snapshot at
$SNAPSHOT_DIR (untouched) and $WORKING_DIR (working copy). $RESET_SCRIPT rsyncs the former into the latter.
- No conflicting Erigon process using the benchmark datadir. Check with
pgrep -af "build/bin/erigon" and stop any local node that has $WORKING_DIR as its --datadir (many setups have a stop.sh next to the datadir; otherwise pkill -f "datadir.*$WORKING_DIR").
- No stale benchmarkoor containers ā they don't always get cleaned up on aborted runs. If you see
benchmarkoor-<oldhash>-* containers still up, run sudo -n ./benchmarkoor cleanup (or sudo -n docker rm -f <container>) before starting.
Workflow
Step 1 ā Rebuild Docker image with the new binary
If you've changed Erigon source since the last image was built, you need a fresh image. A full Dockerfile build is slow; use a quick overlay instead:
mkdir -p /tmp/erigon-img-overlay
cp "$BENCH_DIR/erigon/build/bin/erigon" /tmp/erigon-img-overlay/erigon
cat > /tmp/erigon-img-overlay/Dockerfile <<EOF
FROM $IMAGE_TAG
COPY --chown=erigon:erigon erigon /usr/local/bin/erigon
EOF
cd /tmp/erigon-img-overlay && sudo -n docker build -t "$IMAGE_TAG" .
This re-tags $IMAGE_TAG in under a second. The original Dockerfile multi-stage build is only needed if base layers (OS, deps) changed.
If you've never built the base image, fall back to:
cd "$BENCH_DIR/erigon" && sudo -n docker build -t "$IMAGE_TAG" .
Caveat: the stock Erigon Dockerfile may not reproduce the original $IMAGE_TAG's build flags (e.g. tracing builds use extra args). If the original image was built with non-default args, ask the user how it was first built before falling back.
Step 2 ā Reset the working datadir
$RESET_SCRIPT (typically) rsyncs $SNAPSHOT_DIR/ ā $WORKING_DIR/ (excluding the read-only snapshots/ bind mount). Requires sudo because containers leave root-owned files behind.
sudo -n bash "$BENCH_DIR/$RESET_SCRIPT"
Always run this before every benchmark run. Skipping it means leftover state from the previous run pollutes results.
If no reset script exists for this dataset, the user has chosen a setup where state persists across runs ā in that case ask whether they want a manual rsync from $SNAPSHOT_DIR to $WORKING_DIR before starting, or to deliberately run on the prior state.
Step 3 ā Inspect/edit the config
Open $BENCH_DIR/$CONFIG. Don't construct YAML from scratch ā read the existing file and edit it in place; the snippet below shows the knobs you'll likely touch, not the full schema. (Full schema in the upstream examples/configuration/ reference.) Key knobs:
runner:
benchmark:
tests:
filter: 'regex:__test_(<test_name_1>|<test_name_2>|...)\[.*benchmark_<size>M\]'
client:
config:
resource_limits:
cpuset: [<6 logical CPU ids, one per distinct physical core>]
cpu_freq: "3600MHz"
cpu_turboboost: false
cpu_freq_governor: performance
memory: "32g"
instances:
- id: <instance-id>
client: erigon
image: <image-tag>
pull_policy: never
extra_args:
CPU pinning
cpuset: (explicit logical-CPU list) and cpuset_count: (random N CPUs each run) are mutually exclusive. Prefer cpuset: ā it's deterministic across runs and lets you pick topology-aware values. cpuset_count picks N logical CPUs at random each run, which on SMT hosts produces different physical-core counts each time (the random selection often double-books some physical cores via their SMT siblings), baking noise into A/B comparisons.
Pin to exactly 6 logical CPUs, one per distinct physical core, avoiding SMT sibling pairs. The "6" matches what ethpandaops's upstream reference runs use (e.g. cpuset: [6,7,8,9,10,11] at https://benchmarkoor.core.ethpandaops.io/runs/), so any A/B against published reference numbers is core-count-comparable. If the host has more than 6 physical cores, leave the extras unpinned so the docker daemon, benchmarkoor itself, and the host kernel don't compete with the bench workload. If the host has fewer than 6 physical cores, that's a deeper problem ā note it and ask the user.
Discover the host's topology:
lscpu | grep -E "^CPU|^Thread|^Core|^Socket|Model name"
for c in $(seq 0 $(($(nproc)-1))); do
printf 'cpu%s: core=%s siblings=%s\n' \
"$c" \
"$(cat /sys/devices/system/cpu/cpu$c/topology/core_id)" \
"$(cat /sys/devices/system/cpu/cpu$c/topology/thread_siblings_list)"
done
Read off 6 logical CPUs whose core_ids are distinct (i.e. skip SMT siblings). On a typical Linux topology where logical CPUs 0..N-1 are physical and N..2N-1 are SMT siblings of cores 0..N-1, pick any 6 from the lower half.
The literal cpuset numbers in the reference run (6,7,8,9,10,11) are specific to that host ā don't copy them; replicate the intent (deterministic + physical-only + count=6).
Sanity-check the filter
To know how many tests will actually run, dry-run the filter against the extracted test fixtures. Strip the regex: prefix from the YAML filter value and feed the rest to grep -E:
ls "$BENCH_DIR"/.cache/opcode-archive-extract-*/eest_bal/testing/ 2>/dev/null \
| grep -cE '__test_(blake2f_benchmark|ecrecover)\[.*benchmark_120M\]'
Step 4 ā Run benchmarkoor
cd "$BENCH_DIR" && sudo -n ./benchmarkoor run \
--config "$CONFIG" \
--limit-instance-id "$INSTANCE_ID" \
2>&1 | tee /tmp/benchmarkoor.log
Notes:
benchmarkoor run --help shows both --limit-instance-id (specific instance ids; what we use) and --limit-instance-client (any instance for a given client name). They coexist; pick the one matching how you've keyed your instances. Without either flag, benchmarkoor runs every instance in the config.
- Each test runs as its own freshly-recreated container; the suite wall-clock scales linearly with
<number of tests matched by the filter>. Pre-test orchestration (container start, gas-bump, funding) dominates over the actual test on this setup, so expect tens of seconds per test even for tiny test payloads.
- Run in background and watch progress: either tail+grep
tail -F /tmp/benchmarkoor.log | grep -E "index=[0-9]+/|Error|FAIL|panic", or use the Monitor tool with the same filter to get notifications.
- While running,
sudo -n docker ps shows the active container (benchmarkoor-<runid>-<instance>-<index>).
Step 5 ā Locate results
ls -t "$BENCH_DIR"/results/runs/ | head -3
Most recent dir matches <unix_timestamp>_<short_hash>_<instance-id>/. Inside, each test has its own dir:
$BENCH_DIR/results/runs/<run-id>/
āāā config.json # snapshot of the YAML
āāā result.json # aggregated run-level stats
āāā benchmarkoor.log
āāā test_<name>.py__test_<func>[<params>].txt/
ā āāā setup.result-aggregated.json
ā āāā setup.result-details.json
ā āāā test.result-aggregated.json # ā per-test MGas/s lives here
ā āāā test.result-details.json
āāā ... # one dir per test that ran
To confirm the run completed all matched tests, check result.json's tests_total / tests_passed (or grep Run result written ... tests_count=<N> in benchmarkoor.log).
The MGas/s value for each test is at:
.method_stats.mgas_s.engine_newPayloadV<N>.last
The V<N> suffix depends on the dataset's hardfork (V5 for Amsterdam, V4 for Prague, V3 for Cancun, etc. ā matching line at top of skill). Don't guess the key ā inspect one test.result-aggregated.json first, e.g.:
jq -r '.method_stats.mgas_s | keys[]' \
"$BENCH_DIR"/results/runs/<run-id>/test_*/test.result-aggregated.json | sort -u | head
A .last value (instead of .mean) is fine because each test runs exactly one such call against its payload. If you're A/B-comparing runs across hardforks, the keys differ ā the comparator below will show n/a for any mismatched key.
Step 6 ā Build a comparison table
Use a Python one-liner that reads two run dirs and produces a per-test speedup table sorted by ratio. The script auto-handles different test counts and naming patterns:
import json, glob, os, re
RUNS_DIR = '<bench-dir>/results/runs'
before_id = '<old-run-dirname>'
after_id = '<new-run-dirname>'
def shorten(name):
m = re.search(r'test_(\w+)\.py__(test_\w+)\[(.+)\]', name)
if not m: return name
tname, params = m.group(2), m.group(3)
parts = [tname]
for label, pat in [('', r'opcode_(\w+)'), ('mod', r'mod_bits_(\d+)'),
('rounds', r'rounds_(\d+)'), ('', r'benchmark_(\d+M)')]:
mm = re.search(pat, params)
if mm: parts.append(f'{label}{mm.group(1)}')
return '-'.join(parts)
def mgas(run_id):
"""Read MGas/s from each test in a run. Auto-detects the engine_newPayloadV<N> key."""
out = {}
for f in sorted(glob.glob(os.path.join(RUNS_DIR, run_id, 'test_*/test.result-aggregated.json'))):
with open(f) as fp: d = json.load(fp)
name = os.path.basename(os.path.dirname(f)).removesuffix('.txt')
stats = d.get('method_stats', {}).get('mgas_s', {})
key = next((k for k in stats if k.startswith('engine_newPayloadV')), None)
if key and 'last' in stats[key]:
out[name] = stats[key]['last']
return out
b, a = mgas(before_id), mgas(after_id)
rows = [(shorten(n), b.get(n), a.get(n)) for n in sorted(set(b)|set(a))]
rows = [(n, bv, av,
(av / bv) if (bv is not None and av is not None and bv > 0) else None)
for n, bv, av in rows]
rows.sort(key=lambda r: r[3] if r[3] is not None else 0, reverse=True)
def fmt_num(x): return f'{x:>14.1f}' if x is not None else f'{"n/a":>14}'
def fmt_sp(s): return f'{s:>8.2f}x' if s is not None else f'{"n/a":>9}'
print(f'{"Test":<55} {"Before":>14} {"After":>14} {"Speedup":>9}')
for n, bv, av, sp in rows:
print(f'{n:<55} {fmt_num(bv)} {fmt_num(av)} {fmt_sp(sp)}')
if b and a:
print(f'\navg: {sum(b.values())/len(b):.1f} ā {sum(a.values())/len(a):.1f} ({sum(a.values())/sum(b.values()):.2f}x)')
Render as a markdown table for inclusion in the response; write to /tmp/benchmark_comparison.md if the user wants to copy-paste. Important: the average is over the test set actually present in both runs. If the filter changed between runs, the averages aren't directly comparable ā call that out explicitly.
Failure modes & gotchas
| Symptom | Cause | Fix |
|---|
Failed to set turbo boost warning | CPU governor not user-controllable | Harmless; ignore. |
HEAD failed; using cached file | GitHub Actions artifact HEAD requires auth | Harmless if cache is present at $BENCH_DIR/.cache/. |
Container stopped for recreate count = Nā1 (not N) at end | Last container's "stopped" log fires after the suite completion log | Verify with Run result written ... tests_count=<N> in the log. |
| Big variance between identical runs | CPU governor not pinned, or other heavy workload | Always set cpu_freq_governor: performance; don't run other CPU-heavy tasks (full make test-all, syncing nodes) simultaneously. |
Image-tag mismatch (the right IMAGE_TAG not used) | Docker cached an older layer | Rebuild the image (Step 1) explicitly; confirm docker images shows a recent CREATED time. |
Old benchmarkoor-<hash>-* container lingering | Previous run aborted before cleanup | sudo -n docker rm -f <container> + sudo -n ./benchmarkoor cleanup. |
| Run "completes" instantly with 0 tests | Wrong cwd, or filter regex matches 0 tests | Confirm cwd is $BENCH_DIR; sanity-check the filter against $BENCH_DIR/.cache/opcode-archive-extract-*/eest_bal/testing/*.txt. |
A few "what to remember" rules
- Discover the host-specific names per session. Don't hard-code
erigon_snapshot/erigon_hybrid/benchmarkoor.interop.bal.yaml/erigon-local:traced/erigon-bal-full ā they vary between datasets and hosts.
- Always run the reset script (if it exists) before each run. State leakage between runs is real and silently skews numbers.
- Always pass
--limit-instance-id when comparing just one client ā otherwise the run also exercises every other client in the config (geth/besu/reth/nethermind/ā¦) which adds tens of minutes and clutters results/runs/.
- The
image tag in the YAML must match pull_policy: never ā benchmarkoor will refuse to pull, so a missing local image fails immediately rather than silently downloading a stale upstream one.
results/runs/index.json is regenerated by generate-index-file; don't hand-edit. After a successful run it's auto-generated when generate_results_index: true in the config. If it's stale or missing (e.g. an aborted previous run, or comparing against an older directory the index doesn't list), regenerate explicitly: sudo -n ./benchmarkoor generate-index-file --config "$CONFIG".
- A run dir is keyed by suite-hash (under
results/suites/). Two runs with the same filter regex have the same suite hash, so comparing them is direct. A different filter ā different suite hash ā different test set, and shortened-name matching is the only sane cross-suite comparison ā flag this to the user.
- For PR-style before/after testing: stash the change, rebuild image (Step 1), reset working dir (Step 2), run baseline; unstash, rebuild image, reset, run again. Compare the two newest run dirs. Don't compare a fresh run against an old result captured before unrelated config changes ā too many variables.