| name | benchmark |
| description | Benchmark runtime performance on hardware. If the current branch has commits ahead of upstream/main or uncommitted changes, compares against the fork point (merge-base). Otherwise benchmarks current state only. Use when the user asks to benchmark, measure performance, or compare latency. |
Benchmark Workflow
Benchmark runtime performance on Ascend hardware. Automatically detects whether to run a single benchmark or a comparison.
Modes
| Condition | Mode | What happens |
|---|
| 0 commits ahead AND no uncommitted changes | Single | Benchmark current state, report Elapsed + Orch times |
| >= 1 commits ahead OR uncommitted changes | Compare | Benchmark merge-base (worktree) AND current workspace, show comparison table |
Input
Optional benchmark arguments forwarded to tools/benchmark_rounds.sh:
/benchmark
/benchmark -d 4 -n 50
/benchmark -d 4 -d 6
Extra arguments (-n, -r, etc.) are forwarded to tools/benchmark_rounds.sh.
Device arguments (-d)
The -d flag specifies NPU device IDs.
Hard rule: one benchmark process per device at any time. Never run two benchmark processes on the same -d device concurrently — not two runtimes, not baseline + current, nothing. This prevents resource contention and ensures stable measurements.
-d count | Compare mode behavior |
|---|
One device (-d 4) | Sequential: baseline first, then current, both on the same device. Multiple runtimes also run serially on that device. |
Two devices (-d 4 -d 6) | Parallel per-runtime: for each runtime, baseline on first device and current on second device can run in parallel (different devices). Multiple runtimes still run serially — finish one runtime on both devices before starting the next. |
| Zero (not specified) | Auto-detect idle devices (see Step 2) |
Defaults (when not specified): use benchmark_rounds.sh defaults (device 0, 100 rounds, a2a3, tensormap_and_ringbuffer).
Runtime Selection
tools/benchmark_rounds.sh supports -r <runtime>:
tensormap_and_ringbuffer (default)
The example list is defined at the top of the script (TMR_EXAMPLE_CASES).
Step 1: Detect Mode
git fetch upstream main --quiet
COMMITS_AHEAD=$(git rev-list HEAD --not upstream/main --count 2>/dev/null || echo "0")
HAS_CHANGES=$(git status --porcelain)
if [ "$COMMITS_AHEAD" -eq 0 ] && [ -z "$HAS_CHANGES" ]; then
MODE="single"
else
MODE="compare"
MERGE_BASE=$(git merge-base upstream/main HEAD)
fi
Step 2: Device Detection
If -d was provided in args, skip detection and use the user-specified device(s).
Otherwise, detect idle NPU devices (HBM-Usage = 0):
npu-smi info
Pick devices with HBM-Usage = 0. Find the longest consecutive sub-range (at most 4). If no idle device is found, prompt user to specify a device ID.
Step 3: Pin PTO-ISA
Extract pinned commit from .github/workflows/ci.yml:
PTO_ISA_COMMIT=$(grep -oP '(?<=--pto-isa-commit )\w+' .github/workflows/ci.yml | head -1)
Append -c $PTO_ISA_COMMIT to benchmark args so the underlying python test_*.py invocation picks it up.
Step 4: Prepare — Compute Absolute Paths
The Bash tool resets its working directory to the project root on every call. Relative paths like cd worktree && ... are fragile and easy to forget. Compute absolute paths once, then use them everywhere.
PROJECT_ROOT="$(pwd)"
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
WORKTREE_ABS="${PROJECT_ROOT}/tmp/worktree_baseline_${TIMESTAMP}"
mkdir -p "${PROJECT_ROOT}/tmp"
Store PROJECT_ROOT and WORKTREE_ABS as shell variables in every Bash call that needs them (the Bash tool does not persist variables across calls). Use this pattern:
WORKTREE_ABS="/home/user/simpler/tmp/worktree_baseline_20260331_102302"
"${WORKTREE_ABS}/tools/benchmark_rounds.sh" -d 2 ...
Do NOT use cd + relative ./tools/... — this is the #1 source of silent errors (running the wrong workspace).
Step 5: Run Benchmarks
Single Mode
./tools/benchmark_rounds.sh $BENCH_ARGS -r "$RUNTIME" 2>&1 | tee "tmp/benchmark_${TIMESTAMP}.txt"
Compare Mode
Use a git worktree for the baseline so the current workspace is never disturbed.
CRITICAL: Worktree needs its own build environment
The worktree is a fresh checkout at merge-base — it has no pre-built runtime binaries and no compiled nanobind extension. Two things must be built:
- Runtime
.so binaries (build/lib/) — loaded via ctypes by bindings.py
- Nanobind
_task_interface extension — compiled C++ Python bindings
Pure Python files under simpler_setup/ (e.g. scene_test.py, kernel_compiler.py) are resolved via sys.path from the worktree when an editable install is active there, so they correctly come from the worktree. But _task_interface.*.so is installed into site-packages by pip install -e . and is shared system-wide. Without isolation, the worktree would use the main workspace's nanobind extension — which may have incompatible API changes.
Solution: always create a venv in the worktree (~26s overhead). This builds both the nanobind extension AND runtime binaries, fully isolating the baseline.
5a. Create worktree, venv, and build
Inline the absolute worktree path (copy-paste the value, do not rely on shell variables persisting):
git worktree add "$WORKTREE_ABS" "$MERGE_BASE" --quiet
python3 -m venv "${WORKTREE_ABS}/.venv" --system-site-packages
"${WORKTREE_ABS}/.venv/bin/pip" install -e "${WORKTREE_ABS}" -q 2>&1 | tail -3
This gives the worktree its own _task_interface.*.so in .venv/lib/python3.*/site-packages/, completely independent from the main workspace.
5b. Run baseline
Activate the venv so benchmark_rounds.sh (which calls python3) picks up the worktree's nanobind extension and Python bindings:
cd "$WORKTREE_ABS" && source .venv/bin/activate && pwd && ./tools/benchmark_rounds.sh -d $BASELINE_DEVICE -c $PTO_ISA_COMMIT -r "$RUNTIME" \
2>&1 | tee "${PROJECT_ROOT}/tmp/benchmark_baseline_${TIMESTAMP}_${RUNTIME}.txt"
Always include pwd && after cd to verify you are in the correct directory. If pwd does not print the worktree path, something went wrong — do not proceed.
5c. Run current
./tools/benchmark_rounds.sh -d $CURRENT_DEVICE -c $PTO_ISA_COMMIT -r "$RUNTIME" \
2>&1 | tee "tmp/benchmark_current_${TIMESTAMP}_${RUNTIME}.txt"
5d. Cleanup
git worktree remove "$WORKTREE_ABS" --force
If git worktree remove fails (e.g., cwd was inside the deleted worktree), use:
git -C "$PROJECT_ROOT" worktree remove "$WORKTREE_ABS" --force
Parallel execution (two devices)
When two devices are available, run baseline and current for the same runtime in parallel on separate devices. The venv ensures the worktree has its own nanobind extension, so both workspaces are fully independent.
for RUNTIME in "${RUNTIMES_TO_BENCH[@]}"; do
(cd "$WORKTREE_ABS" && source .venv/bin/activate && pwd && ./tools/benchmark_rounds.sh -d $DEVICE_BASELINE -r "$RUNTIME" ...) &
./tools/benchmark_rounds.sh -d $DEVICE_CURRENT -r "$RUNTIME" ... &
wait
done
Never launch the next runtime until the current one finishes on all devices.
Sequential execution (one device)
cd "$WORKTREE_ABS" && source .venv/bin/activate && pwd && ./tools/benchmark_rounds.sh -d $DEVICE -c $PTO_ISA_COMMIT -r "$RUNTIME" \
2>&1 | tee "${PROJECT_ROOT}/tmp/benchmark_baseline_${TIMESTAMP}_${RUNTIME}.txt"
./tools/benchmark_rounds.sh -d $DEVICE -c $PTO_ISA_COMMIT -r "$RUNTIME" \
2>&1 | tee "tmp/benchmark_current_${TIMESTAMP}_${RUNTIME}.txt"
git -C "$PROJECT_ROOT" worktree remove "$WORKTREE_ABS" --force
Step 6: Report Results
Parse all five <Metric> Trimmed Avg: lines per example (Host, Device, Total, Sched, Orch) from benchmark output.
| Metric | Source | What it captures |
|---|
| Host | RunTiming.host_wall_us | steady_clock around dispatch (Python overhead included) |
| Device | RunTiming.device_wall_us | AICPU mailbox orch_start → orch_end |
| Total | device log | full span across sched_* / orch_* events |
| Sched | device log | sched_start → sched_end |
| Orch | device log | orch_start → orch_end |
Single Mode
Benchmark at: <short SHA>
Args: -d 4 -n 100
Example Host (us) Device (us) Total (us) Sched (us) Orch (us)
------------------------------- --------- ----------- ---------- ---------- ---------
alternating_matmul_add 480000.0 9050.0 1235.5 1235.4 820.3
benchmark_bgemm 370000.0 7100.0 892.1 892.0 650.2
...
Compare Mode
Show comparison table per metric (one row per metric per example), grouped by runtime. Total is the headline metric used in the overall summary; the other four are sub-rows for context:
Merge-base: <short SHA> → HEAD: <short SHA> (+ uncommitted)
Args: -d 4 -n 100
Device: baseline=4, current=4 (or baseline=4, current=6)
### tensormap_and_ringbuffer
Example Base (us) HEAD (us) Delta (us) Change (%)
--------------------------- --------- --------- ---------- ----------
alternating_matmul_add 1240.1 1235.5 -4.6 -0.37%
(host) 480000.0 470000.0 -10000.0 -2.08%
(device) 9000.0 8800.0 -200.0 -2.22%
(sched) 1240.0 1235.4 -4.6 -0.37%
(orch) 830.0 820.3 -9.7 -1.17%
benchmark_bgemm 890.3 892.1 +1.8 +0.20%
(host) 370000.0 370500.0 +500.0 +0.14%
(device) 7100.0 7080.0 -20.0 -0.28%
(sched) 890.2 892.0 +1.8 +0.20%
(orch) 650.0 650.2 +0.2 +0.03%
...
Overall: X of Y examples improved, Z regressed (based on Total)
If baseline and current ran on different devices, add a note:
Note: Baseline and current ran on different NPU devices (4 vs 6). Results within ±2% may reflect device-to-device variance rather than code changes. For definitive comparison, re-run on the same device with /benchmark -d <single_device>.
Interpretation:
| Change (%) | Assessment |
|---|
| < -2% | Notable improvement |
| -2% to +2% | Within noise margin |
| > +2% | Potential regression — flag for review |
If any example shows > 5% regression, highlight it explicitly.
Error Handling
| Error | Action |
|---|
No idle device and no -d specified | Prompt user to specify device ID |
| Benchmark script fails | Report which examples failed; continue with remaining |
| No timing data | Warn: "No timing markers — ensure PTO2_PROFILING is enabled" |
| All examples fail | Check: did you run pip install -e . in the worktree venv? |
| Worktree creation fails | Fall back to stash/checkout approach or report error |
Pre-built runtime binaries not found | The venv pip install -e . should have built these; re-run it |
ModuleNotFoundError: _task_interface | Venv not activated; add source .venv/bin/activate && before the command |
Checklist