Rank hot paths by CPU, memory, I/O, and contention; hand the optimization skill a scored target list. Use when: profile, flamegraph, hotspot, bottleneck, p95/p99, IOPS, fsync, "why is this slow".
Installation
Install with Codex or Claude Copy this prompt, paste it into Codex, Claude, or another assistant, and let it review the skill page and install it for you.
Rank hot paths by CPU, memory, I/O, and contention; hand the optimization skill a scored target list. Use when: profile, flamegraph, hotspot, bottleneck, p95/p99, IOPS, fsync, "why is this slow".
Profiling Software Performance
One Rule: Ranked evidence before any optimization. No hotspot list → no change.
Table of Contents
One Rule · The Loop · Pre-Flight Checklist · Quick Triggers · Build Flags · Scenario Fingerprint · Samplers · Metrics · Instrumentation · OS Tuning · Hotspot Table · Cheat Sheet · Anti-Patterns · Hand-off · Reference Index
This skill produces the input that extreme-software-optimization consumes: a ranked hotspot table, a baseline fingerprint, and a hypothesis ledger. It does not change code to make things faster — that's the next skill. Measurement-only instrumentation is allowed when needed to attribute cost; keep it behind an env flag, document it as profiling-only, and do not mix it with optimization changes. Stop at the hand-off.
The Loop
1. DEFINE → scenario, metric, budget, golden output — declare what "fast" means
2. ENVIRONMENT → fingerprint host; ASK before tuning kernel/governor
3. BASELINE → hyperfine / criterion / bench_baseline.sh → p50/p95/p99/p99.9, ops/sec, RSS
4. INSTRUMENT → spans, histograms, sentinel frames, structured `perf.profile.*` logs
5. PROFILE → CPU + alloc + I/O + off-CPU + contention samplers
6. INTERPRET → ranked hotspots + scaling law + hypothesis ledger
7. HAND OFF → extreme-software-optimization with Opportunity Matrix inputs
Each phase emits a concrete artifact under tests/artifacts/perf/ (or equivalent). No artifact → phase isn't done.
Pre-Flight Checklist
Run through this before touching any profiler — the most common "bad profile" root cause is skipping one of these.
Scenario is written down — name, inputs, expected output, success metric (p95/throughput/RSS). "It's slow" is not a scenario.
Build profile is profilable — debug info on, frame pointers on, strip=false (see Build Flags below).
Fingerprint captured — scripts/env_fingerprint.sh > fingerprint.json (CPU, governor, kernel, toolchain, FS). Without this, the number isn't comparable to anything.
Baseline recorded — ≥ 20 runs via scripts/bench_baseline.sh; variance within envelope (check with scripts/variance_envelope.py).
Same-host discipline — no other heavy jobs, no cross-host comparisons, same power profile.
One lever per run — don't change code and profiler settings together; you lose attribution.
Kernel tuning asked-and-approved — scripts/profile_init.sh prints the plan and prompts before applying.
If any box is empty, stop and resolve it first. Profiling under a broken precondition wastes the whole session.
Quick Triggers
User says
First move
"why is this slow?"
Define scenario + capture baseline before touching code
"optimize X"
Refuse to optimize until a profile identifies X as top-5
"add a flamegraph"
Check debug symbols + frame pointers are on, then samply / cargo flamegraph / py-spy
"it uses too much memory"
Pick peak RSS or heap high-watermark; they're different measurements
"disk is the bottleneck" / "IOPS" / "fsync is slow"
Prove it first (vmstat wa, /usr/bin/time -v %CPU, off-CPU flame), then iostat -xm 1, biolatency-bpfcc, fio — see IO-AND-TRADEOFFS.md
"too many small files" / "btrfs is fragmented"
df -i, filefrag, btrfs filesystem usage, then the small-file + btrfs playbooks in IO-AND-TRADEOFFS.md
"could we cache this in RAM?"
Use the RAM-for-speed tradeoff table (headroom × hit rate × invalidation story) in IO-AND-TRADEOFFS.md
"it's slower in prod"
Reach for continuous profiling (Pyroscope / Parca) or an on-demand pprof endpoint
"no numbers, just feels slow"
Build a minimal repro scenario first; profiling noise without a scenario
Build Flags That Unbreak Profilers
Release builds strip the symbols profilers need. Always add a profiling (or release-perf) profile — never profile the size-optimized release binary.
Rust (Cargo.toml):
[profile.release-perf]# or "profiling"inherits = "release"opt-level = 3lto = "thin"codegen-units = 1debug = "line-tables-only"# cheap, preserves frames; use `true` for deepest unwindsstrip = false
These kernel knobs visibly affect numbers but require sudo and change global state. Always present the list and ask "apply these?" before running. Revert commands are in OS-TUNING.md.
Minimum set for Linux profiling accuracy:
sudo sysctl -w kernel.perf_event_paranoid=-1 # or 1 for user-space profilingsudo sysctl -w kernel.kptr_restrict=0 # resolve kernel symbols in stackssudo sysctl -w kernel.nmi_watchdog=0 # frees a PMU countersudo cpupower frequency-set -g performance # no P-state jitterecho 1 | sudotee /sys/devices/system/cpu/intel_pstate/no_turbo
taskset -c 2,3 ./bin # pin to isolated cores (needs isolcpus= boot flag)sync && echo 3 | sudotee /proc/sys/vm/drop_caches # cold-cache runs only
macOS: SIP cripples dtrace; prefer samply, xctrace record, sample <pid>, spindump. Full matrix + restore commands in OS-TUNING.md.
"Benchmark truthfulness" — normalize API usage before trusting a head-to-head result
Report only time ./bin
Missing tails; user wire-time (p99) usually ≠ mean
Tune kernel without asking
Global state change; get approval, record revert
Optimize before hand-off
This skill stops at the hotspot table — hand to extreme-software-optimization
Hand-off
When the hotspot table + hypothesis ledger + baseline fingerprint are in place, stop.
Say to the user:
Baseline captured. Top 5 hotspots ranked with evidence in tests/artifacts/perf/<run-id>/. Ready for extreme-software-optimization to score targets (Impact × Confidence / Effort ≥ 2.0) and apply one lever at a time.