Rank hot paths by CPU, memory, I/O, and contention; hand the optimization skill a scored target list. Use when: profile, flamegraph, hotspot, bottleneck, p95/p99, IOPS, fsync, "why is this slow".
Instalación
Instalar con Codex o Claude Copia este prompt, pégalo en Codex, Claude u otro asistente, y deja que revise la página de la skill y la instale por ti.
Rank hot paths by CPU, memory, I/O, and contention; hand the optimization skill a scored target list. Use when: profile, flamegraph, hotspot, bottleneck, p95/p99, IOPS, fsync, "why is this slow".
Profiling Software Performance
One Rule: Ranked evidence before any optimization. No hotspot list → no change.
Table of Contents
One Rule · The Loop · Pre-Flight Checklist · Quick Triggers · Build Flags · Scenario Fingerprint · Samplers · Metrics · Instrumentation · OS Tuning · Hotspot Table · Cheat Sheet · Anti-Patterns · Hand-off · Reference Index
This skill produces the input that extreme-software-optimization consumes: a ranked hotspot table, a baseline fingerprint, and a hypothesis ledger. It does not change code to make things faster — that's the next skill. Measurement-only instrumentation is allowed when needed to attribute cost; keep it behind an env flag, document it as profiling-only, and do not mix it with optimization changes. Stop at the hand-off.
The Loop
1. DEFINE → scenario, metric, budget, golden output — declare what "fast" means
2. ENVIRONMENT → fingerprint host; ASK before tuning kernel/governor
3. BASELINE → hyperfine / criterion / bench_baseline.sh → p50/p95/p99/p99.9, ops/sec, RSS
4. INSTRUMENT → spans, histograms, sentinel frames, structured `perf.profile.*` logs
5. PROFILE → CPU + alloc + I/O + off-CPU + contention samplers
6. INTERPRET → ranked hotspots + scaling law + hypothesis ledger
7. HAND OFF → extreme-software-optimization with Opportunity Matrix inputs
Each phase emits a concrete artifact under tests/artifacts/perf/ (or equivalent). No artifact → phase isn't done.
Pre-Flight Checklist
Run through this before touching any profiler — the most common "bad profile" root cause is skipping one of these.
Scenario is written down — name, inputs, expected output, success metric (p95/throughput/RSS). "It's slow" is not a scenario.
Build profile is profilable — debug info on, frame pointers on, strip=false (see Build Flags below).
Fingerprint captured — scripts/env_fingerprint.sh > fingerprint.json (CPU, governor, kernel, toolchain, FS). Without this, the number isn't comparable to anything.
Baseline recorded — ≥ 20 runs via scripts/bench_baseline.sh; variance within envelope (check with scripts/variance_envelope.py).
Same-host discipline — no other heavy jobs, no cross-host comparisons, same power profile.
One lever per run — don't change code and profiler settings together; you lose attribution.
Kernel tuning asked-and-approved — scripts/profile_init.sh prints the plan and prompts before applying.
If any box is empty, stop and resolve it first. Profiling under a broken precondition wastes the whole session.
Quick Triggers
User says
First move
"why is this slow?"
Define scenario + capture baseline before touching code
"optimize X"
Refuse to optimize until a profile identifies X as top-5
"add a flamegraph"
Check debug symbols + frame pointers are on, then samply / cargo flamegraph / py-spy
"it uses too much memory"
Pick peak RSS or heap high-watermark; they're different measurements
"disk is the bottleneck" / "IOPS" / "fsync is slow"
Prove it first (vmstat wa, /usr/bin/time -v %CPU, off-CPU flame), then iostat -xm 1, biolatency-bpfcc, fio — see IO-AND-TRADEOFFS.md
"too many small files" / "btrfs is fragmented"
df -i, filefrag, btrfs filesystem usage, then the small-file + btrfs playbooks in IO-AND-TRADEOFFS.md
"could we cache this in RAM?"
Use the RAM-for-speed tradeoff table (headroom × hit rate × invalidation story) in IO-AND-TRADEOFFS.md
"it's slower in prod"
Reach for continuous profiling (Pyroscope / Parca) or an on-demand pprof endpoint
"no numbers, just feels slow"
Build a minimal repro scenario first; profiling noise without a scenario
Build Flags That Unbreak Profilers
Release builds strip the symbols profilers need. Always add a profiling (or release-perf) profile — never profile the size-optimized release binary.
Rust (Cargo.toml):
[profile.release-perf]# or "profiling"inherits = "release"opt-level = 3lto = "thin"codegen-units = 1debug = "line-tables-only"# cheap, preserves frames; use `true` for deepest unwindsstrip = false
These kernel knobs visibly affect numbers but require sudo and change global state. Always present the list and ask "apply these?" before running. Revert commands are in OS-TUNING.md.
Minimum set for Linux profiling accuracy:
sudo sysctl -w kernel.perf_event_paranoid=-1 # or 1 for user-space profilingsudo sysctl -w kernel.kptr_restrict=0 # resolve kernel symbols in stackssudo sysctl -w kernel.nmi_watchdog=0 # frees a PMU countersudo cpupower frequency-set -g performance # no P-state jitterecho 1 | sudotee /sys/devices/system/cpu/intel_pstate/no_turbo
taskset -c 2,3 ./bin # pin to isolated cores (needs isolcpus= boot flag)sync && echo 3 | sudotee /proc/sys/vm/drop_caches # cold-cache runs only
macOS: SIP cripples dtrace; prefer samply, xctrace record, sample <pid>, spindump. Full matrix + restore commands in OS-TUNING.md.
"Benchmark truthfulness" — normalize API usage before trusting a head-to-head result
Report only time ./bin
Missing tails; user wire-time (p99) usually ≠ mean
Tune kernel without asking
Global state change; get approval, record revert
Optimize before hand-off
This skill stops at the hotspot table — hand to extreme-software-optimization
Hand-off
When the hotspot table + hypothesis ledger + baseline fingerprint are in place, stop.
Say to the user:
Baseline captured. Top 5 hotspots ranked with evidence in tests/artifacts/perf/<run-id>/. Ready for extreme-software-optimization to score targets (Impact × Confidence / Effort ≥ 2.0) and apply one lever at a time.