Run any Skill in Manus with one click

benchmark

Stars24

Forks62

UpdatedMay 21, 2026 at 02:55

Benchmark runtime performance on hardware. If the current branch has commits ahead of upstream/main or uncommitted changes, compares against the fork point (merge-base). Otherwise benchmarks current state only. Use when the user asks to benchmark, measure performance, or compare latency.

Installation

Install with Codex or Claude Copy this prompt, paste it into Codex, Claude, or another assistant, and let it review the skill page and install it for you.

Run Skill in Manus

Source

hw-native-sys

hw-native-sys/simpler

View GitHub Repository View Creator Repositories

Download

Run Skill in Manus

Related occupationsSOC

Based on SOC occupation classification

Software Quality Assurance Analysts and TestersComputer and Mathematical Occupations·SOC 15-1253

SKILL.md

readonly

Benchmark Workflow

Benchmark runtime performance on Ascend hardware. Automatically detects whether to run a single benchmark or a comparison.

Modes

Condition	Mode	What happens
0 commits ahead AND no uncommitted changes	Single	Benchmark current state, report Elapsed + Orch times
>= 1 commits ahead OR uncommitted changes	Compare	Benchmark merge-base (worktree) AND current workspace, show comparison table

Input

Optional benchmark arguments forwarded to tools/benchmark_rounds.sh:

/benchmark
/benchmark -d 4 -n 50
/benchmark -d 4 -d 6

Extra arguments (-n, -r, etc.) are forwarded to tools/benchmark_rounds.sh.

Device arguments (`-d`)

The -d flag specifies NPU device IDs.

Hard rule: one benchmark process per device at any time. Never run two benchmark processes on the same -d device concurrently — not two runtimes, not baseline + current, nothing. This prevents resource contention and ensures stable measurements.

`-d` count	Compare mode behavior
One device (`-d 4`)	Sequential: baseline first, then current, both on the same device. Multiple runtimes also run serially on that device.
Two devices (`-d 4 -d 6`)	Parallel per-runtime: for each runtime, baseline on first device and current on second device can run in parallel (different devices). Multiple runtimes still run serially — finish one runtime on both devices before starting the next.
Zero (not specified)	Auto-detect idle devices (see Step 2)

Defaults (when not specified): use benchmark_rounds.sh defaults (device 0, 100 rounds, a2a3, tensormap_and_ringbuffer).

Runtime Selection

tools/benchmark_rounds.sh supports -r <runtime>:

tensormap_and_ringbuffer (default)

The example list is defined at the top of the script (TMR_EXAMPLE_CASES).

Step 1: Detect Mode

git fetch upstream main --quiet
COMMITS_AHEAD=$(git rev-list HEAD --not upstream/main --count 2>/dev/null || echo "0")
HAS_CHANGES=$(git status --porcelain)

if [ "$COMMITS_AHEAD" -eq 0 ] && [ -z "$HAS_CHANGES" ]; then
  MODE="single"
else
  MODE="compare"
  MERGE_BASE=$(git merge-base upstream/main HEAD)
fi

Step 2: Device Detection

If -d was provided in args, skip detection and use the user-specified device(s).

Otherwise, detect idle NPU devices (HBM-Usage = 0):

npu-smi info

Pick devices with HBM-Usage = 0. Find the longest consecutive sub-range (at most 4). If no idle device is found, prompt user to specify a device ID.

Step 3: Pin PTO-ISA

Extract pinned commit from .github/workflows/ci.yml:

PTO_ISA_COMMIT=$(grep -oP '(?<=--pto-isa-commit )\w+' .github/workflows/ci.yml | head -1)

Append -c $PTO_ISA_COMMIT to benchmark args so the underlying python test_*.py invocation picks it up.

Step 4: Prepare — Compute Absolute Paths

The Bash tool resets its working directory to the project root on every call. Relative paths like cd worktree && ... are fragile and easy to forget. Compute absolute paths once, then use them everywhere.

PROJECT_ROOT="$(pwd)"                    # e.g. /home/user/simpler
TIMESTAMP=$(date +%Y%m%d_%H%M%S)
WORKTREE_ABS="${PROJECT_ROOT}/tmp/worktree_baseline_${TIMESTAMP}"
mkdir -p "${PROJECT_ROOT}/tmp"

Store PROJECT_ROOT and WORKTREE_ABS as shell variables in every Bash call that needs them (the Bash tool does not persist variables across calls). Use this pattern:

# Correct — self-contained, uses absolute path
WORKTREE_ABS="/home/user/simpler/tmp/worktree_baseline_20260331_102302"
"${WORKTREE_ABS}/tools/benchmark_rounds.sh" -d 2 ...

Do NOT use cd + relative ./tools/... — this is the #1 source of silent errors (running the wrong workspace).

Step 5: Run Benchmarks

Single Mode

./tools/benchmark_rounds.sh $BENCH_ARGS -r "$RUNTIME" 2>&1 | tee "tmp/benchmark_${TIMESTAMP}.txt"

Compare Mode

Use a git worktree for the baseline so the current workspace is never disturbed.

CRITICAL: Worktree needs its own build environment

The worktree is a fresh checkout at merge-base — it has no pre-built runtime binaries and no compiled nanobind extension. Two things must be built:

Runtime .so binaries (build/lib/) — loaded via ctypes by bindings.py
Nanobind _task_interface extension — compiled C++ Python bindings

Pure Python files under simpler_setup/ (e.g. scene_test.py, kernel_compiler.py) are resolved via sys.path from the worktree when an editable install is active there, so they correctly come from the worktree. But _task_interface.*.so is installed into site-packages by pip install -e . and is shared system-wide. Without isolation, the worktree would use the main workspace's nanobind extension — which may have incompatible API changes.

Solution: always create a venv in the worktree (~26s overhead). This builds both the nanobind extension AND runtime binaries, fully isolating the baseline.

5a. Create worktree, venv, and build

Inline the absolute worktree path (copy-paste the value, do not rely on shell variables persisting):

# Create worktree
git worktree add "$WORKTREE_ABS" "$MERGE_BASE" --quiet

# Create venv with system site-packages (for torch, numpy, etc.)
python3 -m venv "${WORKTREE_ABS}/.venv" --system-site-packages

# Install into venv — builds nanobind extension + runtime binaries
"${WORKTREE_ABS}/.venv/bin/pip" install -e "${WORKTREE_ABS}" -q 2>&1 | tail -3

This gives the worktree its own _task_interface.*.so in .venv/lib/python3.*/site-packages/, completely independent from the main workspace.

5b. Run baseline

Activate the venv so benchmark_rounds.sh (which calls python3) picks up the worktree's nanobind extension and Python bindings:

# WORKTREE_ABS must be the literal absolute path (e.g. /home/user/simpler/tmp/worktree_baseline_20260331)
cd "$WORKTREE_ABS" && source .venv/bin/activate && pwd && ./tools/benchmark_rounds.sh -d $BASELINE_DEVICE -c $PTO_ISA_COMMIT -r "$RUNTIME" \
  2>&1 | tee "${PROJECT_ROOT}/tmp/benchmark_baseline_${TIMESTAMP}_${RUNTIME}.txt"

Always include pwd && after cd to verify you are in the correct directory. If pwd does not print the worktree path, something went wrong — do not proceed.

5c. Run current

# Runs from the main workspace (Bash tool default cwd)
./tools/benchmark_rounds.sh -d $CURRENT_DEVICE -c $PTO_ISA_COMMIT -r "$RUNTIME" \
  2>&1 | tee "tmp/benchmark_current_${TIMESTAMP}_${RUNTIME}.txt"

5d. Cleanup

git worktree remove "$WORKTREE_ABS" --force

If git worktree remove fails (e.g., cwd was inside the deleted worktree), use:

git -C "$PROJECT_ROOT" worktree remove "$WORKTREE_ABS" --force

Parallel execution (two devices)

When two devices are available, run baseline and current for the same runtime in parallel on separate devices. The venv ensures the worktree has its own nanobind extension, so both workspaces are fully independent.

# For each runtime (serially):
for RUNTIME in "${RUNTIMES_TO_BENCH[@]}"; do
  # Baseline on device A (from worktree with venv), current on device B (from main) — parallel
  (cd "$WORKTREE_ABS" && source .venv/bin/activate && pwd && ./tools/benchmark_rounds.sh -d $DEVICE_BASELINE -r "$RUNTIME" ...) &
  ./tools/benchmark_rounds.sh -d $DEVICE_CURRENT -r "$RUNTIME" ... &
  wait  # Both finish before starting next runtime
done

Never launch the next runtime until the current one finishes on all devices.

Sequential execution (one device)

# 1. Worktree + venv already created in step 5a

# 2. For each runtime (serially — one device, one process at a time):
#    Baseline first (from worktree with venv activated)
cd "$WORKTREE_ABS" && source .venv/bin/activate && pwd && ./tools/benchmark_rounds.sh -d $DEVICE -c $PTO_ISA_COMMIT -r "$RUNTIME" \
  2>&1 | tee "${PROJECT_ROOT}/tmp/benchmark_baseline_${TIMESTAMP}_${RUNTIME}.txt"

#    Then current (from main workspace — default cwd, no venv)
./tools/benchmark_rounds.sh -d $DEVICE -c $PTO_ISA_COMMIT -r "$RUNTIME" \
  2>&1 | tee "tmp/benchmark_current_${TIMESTAMP}_${RUNTIME}.txt"

# 3. Cleanup
git -C "$PROJECT_ROOT" worktree remove "$WORKTREE_ABS" --force

Step 6: Report Results

Parse all five <Metric> Trimmed Avg: lines per example (Host, Device, Total, Sched, Orch) from benchmark output.

Metric	Source	What it captures
Host	`RunTiming.host_wall_us`	steady_clock around dispatch (Python overhead included)
Device	`RunTiming.device_wall_us`	AICPU mailbox `orch_start` → `orch_end`
Total	device log	full span across `sched_` / `orch_` events
Sched	device log	`sched_start` → `sched_end`
Orch	device log	`orch_start` → `orch_end`

Single Mode

Benchmark at: <short SHA>
Args: -d 4 -n 100

Example                          Host (us)   Device (us)   Total (us)   Sched (us)   Orch (us)
-------------------------------  ---------   -----------   ----------   ----------   ---------
alternating_matmul_add           480000.0        9050.0       1235.5       1235.4       820.3
benchmark_bgemm                  370000.0        7100.0        892.1        892.0       650.2
...

Compare Mode

Show comparison table per metric (one row per metric per example), grouped by runtime. Total is the headline metric used in the overall summary; the other four are sub-rows for context:

Merge-base: <short SHA>  →  HEAD: <short SHA> (+ uncommitted)
Args: -d 4 -n 100
Device: baseline=4, current=4  (or baseline=4, current=6)

### tensormap_and_ringbuffer

Example                      Base (us)   HEAD (us)   Delta (us)   Change (%)
---------------------------  ---------   ---------   ----------   ----------
alternating_matmul_add         1240.1      1235.5        -4.6       -0.37%
  (host)                     480000.0    470000.0    -10000.0       -2.08%
  (device)                     9000.0      8800.0       -200.0       -2.22%
  (sched)                      1240.0      1235.4        -4.6       -0.37%
  (orch)                        830.0       820.3        -9.7       -1.17%
benchmark_bgemm                 890.3       892.1        +1.8       +0.20%
  (host)                     370000.0    370500.0      +500.0       +0.14%
  (device)                     7100.0      7080.0       -20.0       -0.28%
  (sched)                       890.2       892.0        +1.8       +0.20%
  (orch)                        650.0       650.2        +0.2       +0.03%
...

Overall: X of Y examples improved, Z regressed   (based on Total)

If baseline and current ran on different devices, add a note:

Note: Baseline and current ran on different NPU devices (4 vs 6). Results within ±2% may reflect device-to-device variance rather than code changes. For definitive comparison, re-run on the same device with /benchmark -d <single_device>.

Interpretation:

Change (%)	Assessment
< -2%	Notable improvement
-2% to +2%	Within noise margin
> +2%	Potential regression — flag for review

If any example shows > 5% regression, highlight it explicitly.

Error Handling

Error	Action
No idle device and no `-d` specified	Prompt user to specify device ID
Benchmark script fails	Report which examples failed; continue with remaining
No timing data	Warn: "No timing markers — ensure `PTO2_PROFILING` is enabled"
All examples fail	Check: did you run `pip install -e .` in the worktree venv?
Worktree creation fails	Fall back to stash/checkout approach or report error
`Pre-built runtime binaries not found`	The venv `pip install -e .` should have built these; re-run it
`ModuleNotFoundError: _task_interface`	Venv not activated; add `source .venv/bin/activate &&` before the command

name	benchmark
description	Benchmark runtime performance on hardware. If the current branch has commits ahead of upstream/main or uncommitted changes, compares against the fork point (merge-base). Otherwise benchmarks current state only. Use when the user asks to benchmark, measure performance, or compare latency.

benchmark

More from this repository

Benchmark Workflow

Modes

Input

Device arguments (-d)

Runtime Selection

Step 1: Detect Mode

Step 2: Device Detection

Step 3: Pin PTO-ISA

Step 4: Prepare — Compute Absolute Paths

Step 5: Run Benchmarks

Single Mode

Compare Mode

CRITICAL: Worktree needs its own build environment

5a. Create worktree, venv, and build

5b. Run baseline

5c. Run current

5d. Cleanup

Parallel execution (two devices)

Sequential execution (one device)

Step 6: Report Results

Single Mode

Compare Mode

Error Handling

Checklist

Benchmark Workflow

Modes

Input

Device arguments (-d)

Runtime Selection

Step 1: Detect Mode

Step 2: Device Detection

Step 3: Pin PTO-ISA

Step 4: Prepare — Compute Absolute Paths

Step 5: Run Benchmarks

Single Mode

Compare Mode

CRITICAL: Worktree needs its own build environment

5a. Create worktree, venv, and build

5b. Run baseline

5c. Run current

5d. Cleanup

Parallel execution (two devices)

Sequential execution (one device)

Step 6: Report Results

Single Mode

Compare Mode

Error Handling

Checklist

More from this repository

Device arguments (`-d`)

Device arguments (`-d`)