| name | profiling-guide |
| description | Performance profiling methodologies, interpreting profiler output, and benchmarking techniques.
Activate when: profiling application performance, reading flame graphs, benchmarking code,
identifying bottlenecks, measuring CPU usage, analyzing memory allocation, I/O profiling.
|
Performance Profiling Guide
Profiling Methodology
1. Define the goal:
- What is slow? (specific endpoint, operation, user workflow)
- How slow is it? (current measurement)
- How fast should it be? (target)
2. Measure first:
- Profile BEFORE optimizing (avoid optimizing the wrong thing)
- Measure end-to-end, then narrow down
- Use realistic data and load patterns
3. Identify the bottleneck:
- Is it CPU-bound? (computation, parsing, serialization)
- Is it I/O-bound? (disk, network, database)
- Is it memory-bound? (allocation pressure, GC pauses, swapping)
4. Optimize the bottleneck:
- Fix the biggest bottleneck first (Amdahl's Law)
- Measure again after each change
- Stop when the target is met
CPU Profiling
Python: cProfile
import cProfile
import pstats
profiler = cProfile.Profile()
profiler.enable()
profiler.disable()
stats = pstats.Stats(profiler)
stats.sort_stats("cumulative")
stats.print_stats(20)
Interpreting cProfile Output
Column meanings:
ncalls ā Number of times the function was called
tottime ā Time spent IN this function (excluding subfunctions)
percall ā tottime / ncalls
cumtime ā Time spent in this function INCLUDING subfunctions
percall ā cumtime / ncalls
filename ā File and function name
What to look for:
High tottime ā This function itself is slow (optimize its code)
High cumtime ā This function or something it calls is slow
High ncalls ā Function called too many times (N+1 query, tight loop)
tottime ā cumtime ā Leaf function (doesn't call other slow things)
tottime << cumtime ā Time is spent in subfunctions (dig deeper)
Python: py-spy (Sampling Profiler)
py-spy top --pid <PID>
py-spy record -o profile.svg --pid <PID>
py-spy record -o profile.svg -- python my_script.py
py-spy advantages:
- No performance overhead on the profiled application
- Can attach to running processes
- Generates flame graphs directly
- No code modifications required
- Works with C extensions
Line-Level Profiling
@profile
def slow_function(data):
result = []
for item in data:
processed = transform(item)
result.append(processed)
return result
Output shows time per line:
Line # Hits Time Per Hit % Time Line Contents
5 1000 500.0 0.5 2.0 for item in data:
6 1000 20000.0 20.0 80.0 processed = transform(item)
7 1000 4500.0 4.5 18.0 result.append(processed)
Interpretation:
- Line 6 uses 80% of the time ā optimize transform()
- Line 7 uses 18% ā consider pre-allocating the list
Memory Profiling
Python: tracemalloc
import tracemalloc
tracemalloc.start()
snapshot = tracemalloc.take_snapshot()
top_stats = snapshot.statistics("lineno")
print("Top 10 memory allocations:")
for stat in top_stats[:10]:
print(stat)
Python: memory_profiler
from memory_profiler import profile
@profile
def memory_heavy_function():
data = [i ** 2 for i in range(1000000)]
filtered = [x for x in data if x % 2 == 0]
return filtered
Memory Profiling Patterns
What to look for:
Growing memory over time ā Memory leak (objects not freed)
Spike then release ā Normal allocation pattern
High baseline ā Large data structures kept alive unnecessarily
GC pauses ā Too many short-lived objects
Common causes of high memory:
- Loading entire files into memory (use streaming/generators)
- Unbounded caches (use LRU with max size)
- Circular references preventing garbage collection
- String concatenation in loops (creates many intermediate strings)
- Large query results not paginated
I/O Profiling
Database Query Profiling
import logging
logging.getLogger("sqlalchemy.engine").setLevel(logging.INFO)
from django.db import connection
print(f"Queries executed: {len(connection.queries)}")
for query in connection.queries:
print(f" {query['time']}s: {query['sql'][:100]}")
What to look for:
N+1 queries:
1 query for a list of users
+ N queries for each user's profile
ā Use eager loading (joinedload, prefetch_related)
Slow queries:
ā Check EXPLAIN ANALYZE output
ā Look for sequential scans on large tables
ā Add indexes on filtered/joined columns
Too many queries:
ā Batch operations (bulk insert, bulk update)
ā Use caching for frequently accessed data
ā Denormalize read-heavy data
Network I/O Profiling
Diagnosis tools:
curl -w "@curl-timing.txt" <url> # Detailed timing breakdown
time python my_script.py # Wall clock time vs CPU time
Timing breakdown:
DNS lookup: ā Cache DNS, use connection pooling
TCP connect: ā Use keep-alive connections
TLS handshake: ā Use session resumption
Time to first byte: ā Server processing time
Transfer time: ā Compression, payload size
If wall_time >> cpu_time:
ā Application is I/O bound (waiting on network, disk, or database)
ā Optimize by reducing I/O, adding concurrency, or caching
Flame Graphs
Reading Flame Graphs
Anatomy:
- X-axis: Proportion of time (NOT chronological)
- Y-axis: Call stack depth (bottom = entry point, top = leaf functions)
- Width of bar: Percentage of total time in that function
- Color: Usually random or grouped by module (no special meaning)
How to read:
1. Look at the widest bars at the TOP ā most time-consuming leaf functions
2. Look for "plateaus" ā functions that take time themselves (not just calling others)
3. Ignore narrow spikes ā they're fast, don't optimize them
4. Look for unexpected wide bars ā functions you didn't expect to be slow
Common patterns:
Wide plateau at top ā Single function is the bottleneck
Many narrow columns ā Death by a thousand cuts (many small costs)
Deep, narrow tower ā Deep recursion or call chain
Wide bar in the middle ā Framework/library overhead
Generating Flame Graphs
py-spy record -o flame.svg -- python my_script.py
node --prof my_script.js
node --prof-process isolate-*.log > processed.txt
perf record -g -- ./my_program
perf script | stackcollapse-perf.pl | flamegraph.pl > flame.svg
Benchmarking Techniques
Microbenchmarking
import timeit
result = timeit.timeit(
stmt="sorted(data)",
setup="data = list(range(1000, 0, -1))",
number=10000,
)
print(f"Average: {result / 10000 * 1000:.3f} ms")
time_a = timeit.timeit("'-'.join(str(i) for i in range(100))", number=10000)
time_b = timeit.timeit("'-'.join(map(str, range(100)))", number=10000)
print(f"Generator: {time_a:.3f}s, Map: {time_b:.3f}s")
Benchmarking Best Practices
1. Warm up:
- Run the code several times before measuring
- JIT compilers, caches, and connection pools need warming
2. Multiple iterations:
- Run many iterations and report median (not mean)
- Report percentiles: P50, P95, P99
- Show standard deviation to assess consistency
3. Realistic conditions:
- Use production-like data sizes
- Include realistic concurrency levels
- Account for cold start vs warm state
4. Isolate variables:
- Change one thing at a time
- Compare apples to apples (same data, same machine)
- Control for background processes and system load
5. Statistical significance:
- Small differences (<5%) may be noise
- Run enough iterations to be confident
- Use statistical tests if making claims (t-test)
6. Avoid common pitfalls:
- Dead code elimination (compiler removes unused results)
- Constant folding (compiler pre-computes constant expressions)
- Caching effects (second run is faster due to warm cache)
- GC pauses (include GC time in measurements)
Load Testing
Tools:
wrk: wrk -t12 -c400 -d30s http://localhost:8080/api
hey: hey -n 10000 -c 100 http://localhost:8080/api
k6: k6 run load_test.js
locust: Python-based, scriptable load testing
Key metrics:
Throughput: Requests per second (RPS)
Latency: P50, P95, P99 response times
Error rate: Percentage of failed requests
Saturation: CPU, memory, connection utilization
Load test patterns:
Baseline: Single user, measure per-request latency
Ramp-up: Gradually increase load until errors appear
Stress: Exceed expected capacity, observe degradation
Soak: Sustained load over hours (find memory leaks)
Spike: Sudden burst of traffic (test auto-scaling)
Profiling Decision Tree
Is it slow?
āāā Measure end-to-end response time
āāā Is most time in CPU?
ā āāā Yes ā CPU profiler (cProfile, py-spy)
ā ā āāā Single hot function ā Optimize algorithm
ā ā āāā Many small costs ā Look for N+1 patterns
ā āāā No ā I/O profiler
ā āāā Database ā Query profiler (EXPLAIN ANALYZE)
ā āāā Network ā Connection profiling (keep-alive, pooling)
ā āāā Disk ā I/O stats (iostat, strace)
āāā Is memory growing?
ā āāā Yes ā Memory profiler (tracemalloc, memory_profiler)
ā ā āāā Leak ā Find unreleased references
ā ā āāā High usage ā Reduce data in memory (streaming, pagination)
ā āāā No ā Check GC pressure (many allocations/deallocations)
āāā Is it intermittently slow?
āāā GC pauses ā Reduce allocation rate
āāā Lock contention ā Reduce lock scope
āāā Resource exhaustion ā Check connection pools, file descriptors