| name | performance-profiling |
| description | Performance: golden signals, p50/p95/p99, flame graphs, load testing. Triggers: performance, slow, latency, p99, flame graph, bottleneck, memory leak. |
| effort | medium |
| allowed-tools | Read, Grep |
| user-invocable | false |
Performance Profiling Skill
Optimization Golden Rule
"Don't optimize without a baseline."
Always measure -> change -> measure.
Critical Metrics (The 4 Golden Signals)
- Latency: Time it takes to serve a request. (p50, p95, p99)
- Traffic: Demand on your system (req/sec).
- Errors: Rate of requests that fail.
- Saturation: How "full" your service is (CPU/Memory usage).
Profiling Tools & Techniques
Python
Node.js
Database (SQL)
Frontend (Browser)
- Lighthouse: Core Web Vitals (LCP, CLS, INP).
- Chrome Performance Tab: Main thread blocking time.
- Network Waterfall: Time to First Byte (TTFB).
Optimization Hierarchy
- Database/IO: (Indexing, Caching, Batching) - Biggest Gains
- Algorithm: (O(n²) -> O(n log n))
- Memory: (Allocation churn, GC pressure)
- Micro-optimization: (Loop unrolling, etc.) - Smallest Gains
Common Rationalizations
| Excuse | Why It's Wrong |
|---|
| "It feels slow, let me optimize this function" | Feelings aren't data ā profile first, then optimize the actual bottleneck |
| "We should optimize everything" | Premature optimization is the root of all evil ā focus on the critical path |
| "Caching will fix it" | Caching masks problems and adds complexity ā fix the root cause first |
| "It's fast enough in dev" | Dev has 1 user ā production has thousands and cold caches |
| "We'll optimize later" | Performance debt compounds ā a 100ms regression per sprint = 5s in a year |
Example
py-spy record -o profile.svg --duration 30 --pid "$(pgrep -f my-service)"
py-spy top --pid "$(pgrep -f my-service)"
Then, per the optimization hierarchy, start with DB/IO fixes (indexing, batching, caching the right layer) before touching algorithm-level changes.
Rules
- MUST capture a baseline measurement before proposing any change
- NEVER optimize code without profiler data pointing at it as the bottleneck
- CRITICAL: report p95/p99, not just p50 ā averages hide real user pain
- MANDATORY: follow the hierarchy ā DB/IO before algorithm before micro-optimization
Gotchas
py-spy needs CAP_SYS_PTRACE on Linux and SIP-disabled codesigning on macOS to attach to another process. Containerized services usually run without ptrace privileges ā profiling requires a --cap-add=SYS_PTRACE on the container or an in-process alternative (cProfile, yappi).
- Production hosts frequently set
/proc/sys/kernel/perf_event_paranoid=2 or higher, which disables user-space perf events. Tools that rely on perf (perf, bcc, bpftrace) silently produce empty output ā check cat /proc/sys/kernel/perf_event_paranoid first.
- Node.js
--prof output gets interleaved across worker threads and child processes. A single isolate-*.log mixes samples from multiple isolates unless each worker writes its own ā filter by PID or use clinic flame which handles the split.
- Chrome DevTools samples at ~1kHz; operations faster than ~1ms vanish. For microbenchmarks, prefer
performance.now() with manual markers, not the Performance tab.
EXPLAIN ANALYZE on Postgres executes the query, including INSERT/UPDATE/DELETE ā wrap write queries in a transaction that you roll back, or use EXPLAIN (ANALYZE, BUFFERS) ... ; ROLLBACK; in one statement.
When NOT to Use
- For correctness bugs (wrong output) ā use
/debug
- For frontend render bugs without timing data ā measure with DevTools first
- For infrastructure capacity planning ā use load testing, not profiling
- For generic code quality ā use
/analyze