| name | perf-topdown |
| description | Use when you need to classify why code is slow (front-end vs back-end vs speculation), when hunting branch misprediction sites, after /bench-compare or /perf-regression finds a regression needing root cause, or when building an isolated hot-loop harness. Cross-arch TMA and branch tracing. |
| user-invocable | true |
Top-Down Profiling — Classify, Trace, Fix
Structured workflow for diagnosing why code is slow using hardware performance
counters. Works on both x86-64 (Intel/AMD) and AArch64 (ARM Neoverse/Cortex).
The recipe: build harness → classify bottleneck → trace branches → apply fix.
When to Use
- Need to classify why code is slow (front-end vs back-end vs speculation vs retiring)
- Need branch-level traces to find misprediction sites
- Need an isolated hot-loop harness for stable profiling
- After
/bench-compare or /perf-regression finds a regression and you need root cause
- Want the structured classify→trace→fix workflow instead of ad-hoc counter exploration
When NOT to Use
- For deep ARM cache/TLB/lock drill-down →
/linux-perf-profile (Modes 3-7)
- For assembly-level codegen optimization →
/asm-forge
- For Criterion before/after measurement →
/bench-compare
- For analyzing existing profiling data →
/rust-perf-triage
Prerequisites — Build & Stability
Build flags (both architectures)
RUSTFLAGS='-C opt-level=3 -C target-cpu=native -C force-frame-pointers=yes -C debuginfo=2' \
cargo build --release --bin <target>
-C force-frame-pointers=yes — enables lightweight call-graph capture alongside LBR
-C debuginfo=2 — maps samples back to Rust source lines without affecting optimization
Stability setup (prioritized by impact)
Apply these before profiling. Each reduces measurement variance:
sudo cpupower frequency-set --governor performance
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
echo 0 | sudo tee /sys/devices/system/cpu/cpufreq/boost
taskset -c 2 ./target/release/binary
setarch $(uname -m) -R ./target/release/binary
echo 0 | sudo tee /proc/sys/kernel/nmi_watchdog
echo -1 | sudo tee /proc/sys/kernel/perf_event_paranoid
Graviton note: AWS Graviton processors run at fixed frequency with no
turbo/boost control. Step 2 is unnecessary; steps 1, 3-6 still apply.
Mode 1: Harness Scaffold
Use when no existing Criterion benchmark covers the suspected hot path.
If a benchmark already exists, use /bench-compare instead.
Create src/bin/tiny_hot.rs:
use std::hint::black_box;
use std::time::{Duration, Instant};
use std::thread;
fn hot_loop(iter: usize) -> u64 {
let mut acc: u64 = 0;
for i in 0..iter {
acc = acc.wrapping_add((i as u64).rotate_left(13) ^ 0x9E3779B97F4A7C15);
}
acc
}
fn main() {
let start = Instant::now();
let target = Duration::from_secs(60);
let mut reps = 0u64;
let mut sink = 0u64;
while start.elapsed() < target {
sink ^= hot_loop(black_box(2_000_000));
reps += 1;
}
eprintln!("done reps={reps} sink={sink}");
thread::sleep(Duration::from_millis(10));
}
Build and run:
RUSTFLAGS='-C opt-level=3 -C target-cpu=native -C force-frame-pointers=yes -C debuginfo=2' \
cargo build --release --bin tiny_hot
taskset -c 2 ./target/release/tiny_hot
Mode 2: Quick CPI & Cache Sanity
Fast triage to determine if the workload is compute-bound, memory-bound, or
branch-bound. Run 5 repetitions for statistical stability.
x86-64:
sudo perf stat -r 5 -e cycles,instructions,branch-misses,cache-misses,LLC-loads,icache_misses \
taskset -c 2 ./target/release/tiny_hot
AArch64:
sudo perf stat -r 5 -e cpu_cycles,inst_retired,br_mis_pred_retired,l1d_cache_refill,l2d_cache_refill,l1i_cache_refill \
taskset -c 2 ./target/release/tiny_hot
Derived metrics and decision tree
| Metric | Formula | Healthy | Investigate |
|---|
| CPI | cycles / instructions | < 1.0 | > 1.0 |
| Branch miss rate | branch-misses / branches | < 2% | > 5% |
| L1d miss rate | l1d misses / l1d accesses | < 5% | > 10% |
| ICache miss rate | icache_misses / instructions | < 0.1% | > 1% |
Decision tree:
- CPI > 1.0 → proceed to Mode 3 (Top-Down Classification)
- High
branch-misses → suspect predictor/layout, proceed to Mode 4 (Branch Trace)
- High
icache_misses → suspect code size, see references/tma-diagnosis-actions.md FE-Bound section
- High
cache-misses or LLC-loads → memory/locality issue, see /linux-perf-profile Mode 3 (ARM) or BE-Bound remediation
Cross-ref rust-perf-triage/references/profiling-tools.md for detailed counter interpretation.
Mode 3: Top-Down Classification
Classify the bottleneck into four categories: Retiring, Bad Speculation,
Frontend Bound, Backend Bound. The commands differ by architecture and vendor.
x86-64 Intel
sudo perf stat --topdown -a -- taskset -c 2 ./target/release/tiny_hot
sudo perf stat --topdown --td-level 2 -- taskset -c 2 ./target/release/tiny_hot
The --topdown flag requires system-wide mode (-a) on pre-Ice Lake CPUs.
Ice Lake and later allow per-thread topdown collection.
Alternative (deeper drill-down via JSON metrics, kernel 6.1+):
perf stat -M TopdownL1 -- taskset -c 2 ./target/release/tiny_hot
perf stat -M tma_backend_bound_group -- taskset -c 2 ./target/release/tiny_hot
perf stat -M tma_core_bound_group -- taskset -c 2 ./target/release/tiny_hot
x86-64 AMD (Zen 4+, kernel 6.2+)
AMD does NOT support --topdown. Use -M PipelineL1 instead:
perf stat -M PipelineL1 -- taskset -c 2 ./target/release/tiny_hot
perf stat -M PipelineL2 -- taskset -c 2 ./target/release/tiny_hot
perf stat -M frontend_bound_group -- taskset -c 2 ./target/release/tiny_hot
AMD pipeline widths: Zen 4 = 6-wide, Zen 5 = 8-wide. Raw slot counts are
not comparable across Intel (4-wide) and AMD.
AArch64 (ARM Neoverse)
ARM has no --topdown flag. Use manual event-based topdown:
Neoverse N2/V2+ (slot-based topdown, kernel metric support):
perf stat -M frontend_bound,backend_bound,bad_speculation,retiring \
-- taskset -c 2 ./target/release/tiny_hot
perf stat -r 5 -e cpu_cycles,inst_retired,stall_slot_frontend,stall_slot_backend,stall_slot,op_retired,op_spec,br_mis_pred \
-- taskset -c 2 ./target/release/tiny_hot
Neoverse N1/V1 (cycle-based only, no slot events):
perf stat -r 5 -e cpu_cycles,inst_retired,stall_frontend,stall_backend,stall_backend_mem,br_mis_pred_retired \
-- taskset -c 2 ./target/release/tiny_hot
See linux-perf-profile Mode 1 for the full ARM topdown derived metrics table.
Unified interpretation
| Category | Threshold | Meaning | Next step |
|---|
| Retiring high | > 80% | Near peak efficiency | Mode 2 for µop reduction; see references/tma-diagnosis-actions.md |
| Bad Speculation high | > 15% | Branch mispredictions | Mode 4 for branch traces |
| Frontend Bound high | > 20% | ICache / decode stalls | references/tma-diagnosis-actions.md FE section |
| Backend Bound high | > 40% | Memory / execution ports | references/tma-diagnosis-actions.md BE section |
See references/tma-diagnosis-actions.md for the complete diagnosis-to-action mapping.
Mode 4: Branch Trace Recording & Decoding
After Mode 3 identifies Bad Speculation or you see high branch-misses in Mode 2,
record branch traces to find the exact misprediction sites.
x86-64 (Intel LBR, Skylake+ = 32 entries)
sudo perf record -o perf.data -c 100000 -b -e cycles:u \
-- taskset -c 2 ./target/release/tiny_hot
perf report --sort symbol_from,symbol_to,mispredict --stdio
perf script -F ip,sym,brstack | rustfilt | head -200
addr2line -e ./target/release/tiny_hot 0x<ADDRESS>
objdump -dr --no-show-raw-insn ./target/release/tiny_hot | rustfilt | less
perf annotate --symbol=<function_name> --stdio
Branch type filtering (narrow capture to specific branch types):
perf record -j cond,u ./binary
perf record -j any_call,any_ret,u ./binary
perf record -j ind_call,u ./binary
brstack output format: FROM/TO/M_or_P/INTX/ABORT/CYCLES/TYPE/SPEC
M = mispredicted, P = predicted
- CYCLES = elapsed cycles since previous recorded branch
x86-64 (AMD Zen 4+ LbrExtV2, kernel 6.1+)
Same perf commands as Intel LBR. AMD Zen 4 supports hardware branch filtering
and misprediction flags. Zen 3 BRS is limited (16 entries, no filtering, no
prediction info) — use Zen 4+ for serious LBR work.
AArch64 (ARM SPE branch sampling)
ARM SPE provides statistical branch sampling. For branch misprediction profiling:
sudo perf record -e arm_spe/branch_filter=1,event_filter=0x80/ \
-- taskset -c 2 ./target/release/tiny_hot
sudo perf record -e arm_spe/branch_filter=1/ \
-- taskset -c 2 ./target/release/tiny_hot
perf report --stdio --percent-limit=1.0
perf script
SPE vs LBR: SPE is a statistical sampler (like Intel PEBS) — it samples
individual operations with rich metadata (addresses, latency, cache level).
It does NOT provide a continuous branch history like LBR. ARM BRBE (ARMv9.2,
FEAT_BRBE) is the true LBR equivalent but is only available on the newest cores.
Fallback (no SPE):
perf record -e br_mis_pred_retired -c 1000 -g --call-graph dwarf \
-- taskset -c 2 ./target/release/tiny_hot
perf report --stdio --percent-limit=1.0
SPE-supported cores: Neoverse N1/N2/V1/V2/V3, Cortex-X1/X2/X3/X4/X925,
Cortex-A715/A720/A725, Ampere1A. Covers AWS Graviton 2/3/4.
See references/branch-trace-cookbook.md for detailed decode procedures.
Event Groups for Copy-Paste
x86-64 Intel
# TMA Level 1 raw events
topdown-retiring,topdown-bad-spec,topdown-fe-bound,topdown-be-bound
# LBR recording
cycles:u with -b flag, or br_inst_retired.any:u
# ICache pressure
icache_64b.iftag_miss,frontend_retired.l1i_miss,frontend_retired.l2_miss
x86-64 AMD (Zen 4+)
Use -M PipelineL1 and -M PipelineL2 metric groups (AMD events have
vendor-specific names accessed through the JSON metric system).
AArch64
# Topdown (N1/V1 cycle-based)
cpu_cycles,inst_retired,stall_frontend,stall_backend,stall_backend_mem
# Topdown (N2/V2+ slot-based)
cpu_cycles,inst_retired,stall_slot_frontend,stall_slot_backend,stall_slot,op_retired,op_spec
# SPE branch misprediction
arm_spe/branch_filter=1,event_filter=0x80/
# SPE all branches with timestamps
arm_spe/branch_filter=1,ts_enable=1/
Cross-ref linux-perf-profile for ARM cache/TLB/branch event sets.
Cross-ref rust-perf-triage/scripts/perf_counters.sh for generic presets.
Output Format
Report findings using this structure:
## Top-Down Profile: [target / scenario]
### Environment
- CPU: [model], [arch] (x86-64 / AArch64)
- Build: `RUSTFLAGS="-C target-cpu=native -C debuginfo=2 -C force-frame-pointers=yes"` release
- Stability: governor=performance, turbo=off, pinned to core N
### TMA Level 1
| Category | Value | Assessment |
|----------|-------|------------|
| Retiring | X.X% | [ok/high — near peak] |
| Bad Speculation | X.X% | [ok/high — branch misses] |
| Frontend Bound | X.X% | [ok/high — icache/decode] |
| Backend Bound | X.X% | [ok/high — memory/ports] |
### Branch Trace Hotspots (if Mode 4 was used)
| Rank | From → To | Mispredict Rate | Type | Likely Cause |
|------|-----------|-----------------|------|--------------|
| 1 | func_a+0x42 → func_b | 23% | COND | unpredictable match arm |
### Diagnosis & Remediation
[TMA category] is the bottleneck.
**Root cause**: [specific code pattern tied to PMU data]
**Fix applied**: [source-level change with rationale]
### Validation Plan
[Commands to re-run after applying fixes to confirm improvement]
Caveats
- Intel
--topdown: Requires kernel 4.8+. System-wide (-a) required pre-Ice Lake. Disable NMI watchdog.
- AMD: No
--topdown flag. Use -M PipelineL1 (kernel 6.2+ for Zen 4, ~6.9+ for Zen 5).
- LBR depth: 32 entries (Skylake+), 16 (Haswell). AMD Zen 4 depth is CPUID-enumerated. Zen 3 BRS is NOT true LBR.
- ARM SPE: Available on Neoverse N1+ / Cortex-X1+. Not a branch history buffer — it's a statistical sampler.
- ARM BRBE: True LBR equivalent, but requires ARMv9.2 (newest cores only).
- ARM topdown: N1/V1 = cycle-based only. N2/V2+ = full slot-based topdown.
- Multiplexing: At Level 2+, perf may time-share counters. Output shows
(XX.XX%) duty cycle. Keep groups small.
- Metric naming: Stabilized in kernel 6.1. Older kernels may use inconsistent names.
- Graviton: Fixed frequency (no turbo control needed). ~3 simultaneous HW counters in VMs — expect multiplexing.
- Cloud/VMs: LBR often disabled. Use BOLT instrumentation mode or SPE where available.
Related Skills
/linux-perf-profile — Deep ARM counter drill-down (cache/TLB/lock, Modes 3-7)
/asm-forge — Assembly-level follow-up after identifying hot functions
/bench-compare — Criterion before/after measurement
/perf-regression — Full regression workflow with acceptance criteria
/rust-perf-triage — Post-hoc analysis of collected perf data