| name | gas-benchmark |
| description | Build a diag Docker image, run gas-benchmarks repricing workflow, and analyze results including dotTrace XML reports. Use when asked to "run benchmarks", "trigger gas benchmarks", "benchmark this branch", "profile block processing", or "analyze benchmark run". Supports analyze-only mode for CI integration. |
| allowed-tools | ["Bash(gh run *)","Bash(gh workflow run *)","Bash(gh release *)","Bash(gh api repos/NethermindEth/*)","Bash(git branch *)","Bash(git log *)","Bash(git status *)","Bash(cd *)","Bash(ls *)","Bash(mkdir *)","Bash(find *)","Bash(unzip *)","Bash(cat *)","Bash(tar *)","Bash(wc *)","Bash(sleep *)","Bash(until *)","Bash(date *)","Bash(bash *)","Bash(sed *)","Bash(grep *)","Bash(awk *)","Bash(sort *)","Bash(head *)","Bash(tail *)","Read","Grep","Glob"] |
| argument-hint | [--branch NAME] [--image NAME] [--filter PATTERN] [--network NETWORK] [--fork FORK] [--dottrace] [--analyze-run RUN_ID] [--compare RUN_ID] |
Gas Benchmark Pipeline
End-to-end pipeline: build diag Docker image, trigger gas-benchmarks repricing workflow, wait for completion, and analyze results (logs, timings, dotTrace XML).
Interactive mode (no arguments)
When called without arguments (/gas-benchmark), do NOT proceed with defaults. Instead, interactively gather the required information:
-
Show available releases and ask the user to pick one:
gh api repos/NethermindEth/gas-benchmarks/releases?per_page=15 \
--jq '.[] | "- `" + .tag_name + "` " + (if .draft then "(draft)" else "" end) + " — " + .name'
Ask: "Which release should I use for test data?"
-
Ask for the image: "Which Nethermind Docker image? (e.g., nethermindeth/nethermind:bal-devnet-6) Or should I build one from a branch?"
-
Ask for the network: "Which network? (perf-devnet-3, jochemnet, mainnet)"
-
Ask for filter — help the user discover available tests first:
a. After the release and network are known, list available test categories:
gh release download <tag> --repo NethermindEth/gas-benchmarks \
--pattern "generated-tests-stateful-<network>.tar.gz" -D /tmp/gb-tests --clobber
tar tzf /tmp/gb-tests/generated-tests-stateful-<network>.tar.gz \
| sed 's|.*/||' | sed 's/\.txt$//' | sed 's/\[.*//' | sort -u | grep -v "^$\|funding\|gas-bump"
b. Show the user a categorized list of available tests. Example output:
Available test categories for perf-devnet-3:
- test_account_access (CALL, STATICCALL, BALANCE, EXTCODE... variants)
- test_sload_bloated (large-state SLOAD scenarios)
- test_sstore_bloated (large-state SSTORE scenarios)
- test_storage_sload_same_key_benchmark
c. Ask: "Which tests do you want to run? You can:
- Pick a category name (e.g.,
sstore_bloated)
- Describe what you're interested in (e.g., 'storage write scenarios with existing slots')
- Leave empty to run all tests"
d. If the user gives a natural-language description, map it to the right filter pattern by inspecting the test parameter names in the archive (e.g.,
existing_slots_True, write_new_value_False, CacheStrategy.NO_CACHE).
e. To show the user the full parameter space for a category:
tar tzf /tmp/gb-tests/generated-tests-stateful-<network>.tar.gz \
| grep "setup/.*<category>" | sed 's|.*/setup/||; s/\.txt$//' | head -20
-
Ask about dotTrace: "Do you want dotTrace profiling? (requires building a diag image, adds ~2min to build)"
Then proceed with the resolved values.
When called WITH arguments, parse them and proceed directly — only ask if something essential is missing or ambiguous. If --filter contains a natural-language description (not a test name pattern), resolve it using the test discovery steps above.
Argument parsing
Parse $ARGUMENTS for these flags:
| Flag | Default | Description |
|---|
--branch | current git branch | Nethermind branch to build the Docker image from |
--image | (built from branch) | Skip Docker build; use this pre-built image directly |
--filter | (none) | Test filter pattern passed to repricing workflow |
--network | perf-devnet-3 | Network name (perf-devnet-3, jochemnet, mainnet) |
--fork | amsterdam | Fork name (amsterdam, osaka) |
--dottrace | (ask user) | Enable dotTrace profiling — builds diag image, passes diagnostics flags |
--gas-size | 100M | Gas size filter (appended as benchmark_<size> to filter). Default 100M. |
--no-restart | (false) | Disable restart-before-testing for stateful tests (restart is on by default) |
--release | (discovered) | Override release tag — skips interactive selection |
--gas-benchmarks-ref | (discovered) | Override gas-benchmarks branch — skips discovery |
--analyze-run | (none) | Skip Phases 0-3; go straight to Phase 4 analysis on an existing run. Value is a run ID or URL (e.g., --analyze-run 25725558942 or --analyze-run https://github.com/NethermindEth/gas-benchmarks/actions/runs/25725558942) |
--compare | (none) | Compare two runs. Value is a second run ID or URL. Requires --analyze-run for the first run. Downloads dotTrace XMLs from both and runs compare. |
Analyze-only mode (--analyze-run)
When --analyze-run <RUN_ID> is provided, skip Phases 0–3 entirely and jump to Phase 4 analysis. This is the primary mode for CI integration — a CI pipeline triggers the workflow itself and then invokes /gas-benchmark --analyze-run <RUN_ID> to get the analysis.
- Extract run ID from the value (strip URL prefix if given).
- Fetch run metadata:
gh run view <run-id> --repo NethermindEth/gas-benchmarks --json status,conclusion,jobs
- If the run is still in progress, poll until complete (same as Phase 3).
- Proceed to Phase 4 with the run ID.
When --compare <RUN_ID_B> is also provided:
- Analyze both runs independently (Phase 4a–4c for each).
- Download dotTrace XMLs from both runs.
- Run
bash scripts/dottrace-report.sh compare <a.xml> <b.xml> 20 for hotspot comparison.
- In the final report, show side-by-side timing tables and delta percentages.
Phase 0 — Discover gas-benchmarks branch and release
Step 0a: Resolve the release
If --release was provided, use it. Otherwise (in non-interactive mode), discover the latest release for the fork:
gh api repos/NethermindEth/gas-benchmarks/releases?per_page=15 \
--jq '[.[] | select(.tag_name | startswith("<fork>"))] | first | .tag_name'
Verify the release has data for the requested network:
gh release view <tag> --repo NethermindEth/gas-benchmarks --json assets --jq '.assets[].name'
Look for generated-tests-stateful-<network>.tar.gz.
Step 0b: Find the gas-benchmarks branch
If --gas-benchmarks-ref was provided, use it. Otherwise, extract from the release notes:
-
The release body contains a **Branch:** field that records which gas-benchmarks branch generated the test data. Parse it:
gh release view <tag> --repo NethermindEth/gas-benchmarks --json body --jq '.body' \
| grep -oP '(?<=\*\*Branch:\*\* ).*' | tr -d '`' | xargs
On Windows/Git Bash where grep -P may not work:
gh release view <tag> --repo NethermindEth/gas-benchmarks --json body --jq '.body' \
| grep "Branch:" | sed 's/.*Branch:\*\* *//; s/`//g' | xargs
-
If the release notes don't contain a branch field, fall back to listing branches:
gh api repos/NethermindEth/gas-benchmarks/branches?per_page=100 \
--jq '.[].name' | grep -E "devnets/bal|stateful-generator"
Ask the user which branch to use.
-
Verify the workflow exists on the chosen branch:
gh api repos/NethermindEth/gas-benchmarks/contents/.github/workflows/repricing-nethermind.yml?ref=<branch> --jq '.name' 2>/dev/null
Step 0c: Discover workflow inputs
Read the workflow YAML on the chosen branch to learn which inputs it supports:
gh api repos/NethermindEth/gas-benchmarks/contents/.github/workflows/repricing-nethermind.yml?ref=<branch> \
--jq '.content' | base64 -d
Note which of these inputs exist: release_tag, genesis_file, runner, diagnostics_mode, diagnostics_xml. Only pass flags the workflow declares.
Step 0d: Determine genesis file
Map network to genesis filename:
perf-devnet-3 → generator-amsterdam-perf-devnet-3.json
jochemnet → generator-amsterdam-jochemnet.json
mainnet → (no genesis_file flag)
Step 0e: Confirm with user
Before proceeding, show the resolved configuration:
Release: <tag>
Gas-benchmarks: <branch>
Network: <network>
Image: <image or "will build from <branch>">
Filter: <filter or "none (all tests)">
Gas size: <100M (default) or user-specified>
Restart on test: <yes (default for stateful) / no>
dotTrace: <yes/no>
Ask: "Proceed?"
Phase 1 — Docker image
Skip if --image is provided.
- Determine the Nethermind branch (from
--branch or git branch --show-current).
- Determine Dockerfile based on dotTrace:
- dotTrace enabled →
Dockerfile.diag, tag suffix -diag
- dotTrace disabled → regular
Dockerfile, no suffix
- Compute tag: sanitize the branch name (replace
/ with -) then append suffix:
TAG=$(echo "<branch-name>" | tr '/' '-')
Final tag: <TAG>-diag (if diag) or <TAG> (if regular).
- Capture timestamp, then trigger the docker build:
BEFORE=$(date -u +%Y-%m-%dT%H:%M:%SZ)
MSYS_NO_PATHCONV=1 gh workflow run publish-docker.yml \
--ref <branch> \
-f image-name=nethermind \
-f tag=<tag> \
-f dockerfile=<dockerfile> \
-f build-config=release
- Wait ~10s, then find the run ID using the timestamp to avoid race conditions:
gh run list --workflow=publish-docker.yml --limit 5 --json databaseId,createdAt \
--jq '[.[] | select(.createdAt > "<BEFORE>")] | first | .databaseId'
- Poll until complete:
gh run view <run-id> --json status,conclusion
- If build fails, fetch logs and report the error. Stop.
- Final image:
nethermindeth/nethermind:<tag>
Phase 2 — Trigger repricing workflow
Capture timestamp before triggering: BEFORE=$(date -u +%Y-%m-%dT%H:%M:%SZ)
Build the workflow trigger using only the inputs the workflow accepts (from Step 0c).
Gas size filtering: The filter input in run.sh supports AND logic using the and keyword. Comma-separated patterns use OR (any match), but within each entry " and " requires ALL parts to match.
Always append and benchmark_<gas-size> to the user's filter to restrict to a single gas size. Default gas size is 100M (override with --gas-size). Examples:
- User filter
bloated → effective filter sent to workflow: bloated and benchmark_100M
- User filter
sstore_bloated → effective filter: sstore_bloated and benchmark_100M
- No user filter → effective filter:
benchmark_100M
- User says "all gas sizes for bloated" → effective filter:
bloated (no gas size appended)
- User passes
--gas-size 200M with filter bloated → effective filter: bloated and benchmark_200M
Restart before testing: For stateful tests (repricings_stateful/), always pass restart_before_testing=true unless --no-restart was specified. This restarts the execution client container before each measured test for clean measurements.
MSYS_NO_PATHCONV=1 gh workflow run repricing-nethermind.yml \
--repo NethermindEth/gas-benchmarks \
--ref <gas-benchmarks-ref> \
-f test="repricings_stateful/<network>" \
-f fork="<fork>" \
-f release_tag="<release>" \
-f genesis_file="<genesis-file>" \
-f filter="<effective-filter-with-gas-size>" \
-f 'runner=["stateful-generator"]' \
-f 'images={"nethermind":"<image>"}' \
-f restart_before_testing="true"
Only add diagnostics flags when --dottrace is set AND image is a diag build:
-f diagnostics_mode="dottrace" \
-f diagnostics_xml="true"
Only add restart_before_testing when the workflow supports it (check Step 0c) and the test is stateful. Omit if --no-restart was specified.
Critical: Do NOT pass diagnostics_mode=dottrace if the image was not built with Dockerfile.diag — the container will crash with exec: dottrace: not found.
Report the run URL to the user immediately after triggering.
Phase 3 — Wait for completion
- Find the run ID using the timestamp captured before triggering (same approach as Phase 1 step 5):
gh run list --repo NethermindEth/gas-benchmarks --workflow=repricing-nethermind.yml \
--limit 5 --json databaseId,createdAt \
--jq '[.[] | select(.createdAt > "<BEFORE>")] | first | .databaseId'
- Poll:
gh run view <run-id> --repo NethermindEth/gas-benchmarks --json status,conclusion every 30 seconds.
- Timeout after 2 hours (240 polls). If exceeded, report "timed out" and provide the run URL for manual inspection. Stop.
- Report to the user when the run completes with success or failure.
Phase 4 — Analyze results
THIS PHASE IS MANDATORY. Always run it in full, even if the workflow reported success. Never skip or abbreviate it. A "success" workflow conclusion does NOT mean the blocks processed correctly — Nethermind exceptions can occur mid-run without failing the workflow.
4a. Exception scan (NEVER SKIP)
Fetch job logs: gh run view --job=<job-id> --repo NethermindEth/gas-benchmarks --log
Strip ANSI escape codes: sed 's/\x1b\[[0-9;]*m//g'
Scan for ALL of these patterns. Report every match with the full log line:
grep -iE "Exception|Invalid Block|InvalidBlock|Rejected invalid" | grep -v "node-exporter\|pip install\|apt-get\|npm warn\|orphan process\|docker-compose\|nuget\.org"
Note: do NOT exclude dotnet — real Nethermind exceptions contain .NET runtime frames.
Any match means the run has issues. Classify:
HeaderGasUsedMismatch → gas schedule mismatch between image and test data (wrong branch/fork)
InvalidBlockLevelAccessListHash → BAL pre-state corruption (code bug)
InvalidBlockLevelAccessListException → address/slot not in BAL (missing BAL entries)
Rejected invalid block ... reason: block is a part of an invalid chain → cascade from earlier failure
- Any other
Exception → report verbatim
Always report the exception summary in the final report, even when there are zero exceptions. Write "Exceptions: none" explicitly.
Confirm shutdown: grep for Nethermind is shut down — if absent, the node crashed or was killed.
4b. Timing analysis — use results artifacts (NOT raw logs)
Always use the results artifacts for timing data. Do NOT parse Processed lines from raw logs — with restart-before-testing, block numbers repeat across test cycles, making log-based correlation unreliable.
Step 1 — Download results artifacts:
mkdir -p /tmp/gb-results
gh run download <run-id> --repo NethermindEth/gas-benchmarks \
-n "results-1-nethermind-<cleaned-test-path>" -D /tmp/gb-results
cd /tmp/gb-results && unzip -o *.zip
Step 2 — Extract per-test timings from result files:
Each test produces a nethermind_results_1_<test-name>.txt file containing engine_newPayloadV5 timing (the actual block processing time).
cd /tmp/gb-results/results
ls nethermind_results_1_*.txt | while IFS= read -r f; do
ms=$(grep -A3 "engine_newPayloadV5:" "$f" | grep "Average:" | awk '{print $2}')
name=$(echo "$f" | sed 's/nethermind_results_1_//;s/\.txt$//')
echo "$ms $name"
done | sort -rn
Step 3 — Compute aggregates:
ls nethermind_results_1_*.txt | while IFS= read -r f; do
ms=$(grep -A3 "engine_newPayloadV5:" "$f" | grep "Average:" | awk '{print $2}')
[ -n "$ms" ] && echo "$ms"
done | awk '{sum+=$1; vals[NR]=$1; n=NR} END {
asort(vals)
printf "COUNT:%d AVG:%.1f MEDIAN:%.1f P90:%.1f P95:%.1f MAX:%.1f\n",
n, sum/n, vals[int(n/2+0.5)], vals[int(n*0.9+0.5)], vals[int(n*0.95+0.5)], vals[n]
}'
4b-compare. Comparing two runs (artifact-based)
When comparing two runs (e.g., PR vs baseline), download both results artifacts to separate directories, then compare per-test timings:
mkdir -p /tmp/gb-pr /tmp/gb-base
gh run download <pr-run-id> --repo NethermindEth/gas-benchmarks \
-n "results-1-nethermind-<cleaned-test-path>" -D /tmp/gb-pr
gh run download <base-run-id> --repo NethermindEth/gas-benchmarks \
-n "results-1-nethermind-<cleaned-test-path>" -D /tmp/gb-base
cd /tmp/gb-pr && unzip -o *.zip
cd /tmp/gb-base && unzip -o *.zip
cd /tmp/gb-pr/results
ls nethermind_results_1_*.txt | while IFS= read -r f; do
pr_ms=$(grep -A3 "engine_newPayloadV5:" "$f" | grep "Average:" | awk '{print $2}')
base_ms=$(grep -A3 "engine_newPayloadV5:" "/tmp/gb-base/results/$f" 2>/dev/null \
| grep "Average:" | awk '{print $2}')
if [ -n "$pr_ms" ] && [ -n "$base_ms" ]; then
short=$(echo "$f" | sed 's/nethermind_results_1_//;s/\.txt$//')
delta=$(awk "BEGIN{printf \"%.1f\", (($pr_ms-$base_ms)/$base_ms)*100}")
echo "$delta|$pr_ms|$base_ms|$short"
fi
done | sort -t'|' -k1 -n | while IFS='|' read -r d p b name; do
printf "%7s%% | PR: %9s ms | Base: %9s ms | %s\n" "$d" "$p" "$b" "$name"
done
Present results as a markdown table sorted by delta, then show aggregates (AVG, MEDIAN, P90, P95, MAX) for both runs with delta percentages.
4c. Block stats
Extract block operation counts:
grep -E "Block.*sload|Block.*sstore" <logs> | grep -v "sstore 10"
Report sload/sstore/create counts for the heaviest test blocks.
4d. Opcode tracing comparison (only when comparing two runs)
When the user asks to compare runs, download the opcode tracing from each release:
gh release download <tag> --repo NethermindEth/gas-benchmarks \
--pattern "opcodes_tracing-stateful-<network>.json" -D /tmp/tracing-<tag> --clobber
Parse the JSON and compare opcode counts for the specific test to confirm the workload is identical.
4e. dotTrace analysis (when --dottrace was enabled)
Always check if the dotTrace XML artifact exists for the run:
gh api repos/NethermindEth/gas-benchmarks/actions/runs/<run-id>/artifacts \
--jq '.artifacts[].name' | grep "dottrace-xml"
If present:
-
Download the XML report:
gh run download <run-id> --repo NethermindEth/gas-benchmarks \
-n "repricing-nethermind-dottrace-xml-<run-id>" -D /tmp/dottrace-<run-id>
-
Find the report file:
find /tmp/dottrace-<run-id> -name "*.xml" -not -name "*pattern*" -not -name "*conversion*"
-
Top hotspots — show the top 20 functions by OwnTime (self-time excluding callees):
bash scripts/dottrace-report.sh top <report.xml> 20
Output columns: Function | OwnTime | TotalTime
- OwnTime = time spent in the function body itself (the hotspot indicator)
- TotalTime = time including all callees
- Sort by OwnTime descending — the top entries are where CPU time is actually spent
-
Compare two runs (when baseline available):
bash scripts/dottrace-report.sh compare <baseline.xml> <new.xml> 20
Output shows REGRESSIONS (B slower) and IMPROVEMENTS (B faster) with:
[A] Own / [B] Own = OwnTime in each run
Delta = absolute change (positive = regression)
Change = percentage change
-
Interpretation guide:
- Functions with high OwnTime in
Nethermind.State, Nethermind.Trie, Nethermind.Db.Rocks indicate storage/state bottlenecks
RocksDbSharp functions indicate disk I/O pressure
Nethermind.Evm.VirtualMachine functions indicate EVM execution overhead
- System/GC functions (
System.GC, JIT_New) indicate allocation pressure
- Compare sstore/sload counts (from 4c) against OwnTime to distinguish I/O-bound vs compute-bound
-
Never load full XML into context — files are 50-70MB. Always use scripts/dottrace-report.sh.
Phase 5 — Report
Always include the block phase breakdown first:
### Block Phases
| Phase | Block Range | Count | Description |
|-------|------------|-------|-------------|
| Gas bump | 24358001–24363001 | 5001 | Empty gas-limit ramp blocks |
| Setup | 24363002–24363003 | 180 | Pre-state preparation |
| Testing | 24363004–24363183 | 180 | Actual benchmark execution |
If testing block count is 0, display prominently:
⚠️ RELEASE DATA ISSUE: No testing blocks found for filter <filter>. The release <tag> may not contain testing payloads for this filter/network combination. All timings below are setup overhead only — not meaningful for benchmarking.
Then the summary table (timings ONLY from testing blocks):
| Metric | Value |
|--------|-------|
| Branch | ... |
| Image | ... |
| Gas-benchmarks ref | ... |
| Release | ... |
| Gas size | 100M |
| Restart on test | yes/no |
| Run URL | ... |
| Status | success/failure |
| Exceptions | none / list |
| Testing blocks | N |
| AVG processing | X ms |
| MEDIAN | X ms |
| P95 | X ms |
| P99 | X ms |
| MAX | X ms |
Then the top 10 heaviest test blocks table with scenario names.
If comparing against a baseline, include both timings and the delta/speedup percentage.
CI integration
The workflow .github/workflows/gas-benchmark-analysis.yml runs the full gas-benchmark pipeline in CI via Claude Code. It executes ALL phases (build → trigger → wait → analyze) and posts results as a PR comment.
Authorization: Only members of the NethermindEth/core GitHub team can trigger via PR comments (verified via API team membership check).
Trigger 1: PR comment (full run)
Comment on a PR to run the complete pipeline on the PR branch:
@claude-bench # full run, all tests, dotTrace enabled
@claude-bench --filter sstore_bloated # full run with test filter
@claude-bench --no-dottrace # full run without dotTrace
@claude-bench --image nethermindeth/nethermind:my-tag # skip build, use existing image
Trigger 1b: PR comment (analyze-only)
To analyze an already-completed run instead of starting a new one:
@claude-bench --analyze-run 25725558942
@claude-bench --analyze-run 25725558942 --compare 25700000000
Trigger 2: Manual dispatch
gh workflow run gas-benchmark-analysis.yml \
-f branch=my-feature-branch \
-f filter=sstore_bloated \
-f dottrace=true \
-f pr_number=12345
Trigger 3: Repository dispatch (from gas-benchmarks repo)
gh api repos/NethermindEth/nethermind/dispatches \
-f event_type=gas-benchmark-analysis \
-f 'client_payload={"branch":"my-branch","filter":"bloated","pr_number":"12345"}'
All modes run this skill with the appropriate flags and post results as a PR comment (if a PR number is available) or to the workflow step summary.
Filter reference
How to discover available tests
List test categories from a release archive:
gh release download <tag> --repo NethermindEth/gas-benchmarks \
--pattern "generated-tests-stateful-<network>.tar.gz" -D /tmp/gb-tests --clobber
tar tzf /tmp/gb-tests/generated-tests-stateful-<network>.tar.gz \
| sed 's|.*/||' | sed 's/\.txt$//' | sed 's/\[.*//' | sort -u | grep -v "^$\|funding\|gas-bump"
How to explore parameters for a test category
tar tzf /tmp/gb-tests/generated-tests-stateful-<network>.tar.gz \
| grep "setup/.*<category>" | sed 's|.*/setup/||; s/\.txt$//' | head -20
Filter patterns
The filter input is a substring match against the test fixture filenames. Examples:
sstore_bloated — all sstore_bloated variants
sload_bloated — all sload_bloated variants
account_access — all account access tests
existing_slots_True — only tests with pre-existing storage slots
NO_CACHE — only tests with no caching strategy
- (empty) — run all tests
Note: The gas-size constraint (benchmark_100M by default) is always appended automatically. You do not need to include it in the filter manually.