| name | linux-perf |
| description | Linux perf performance analysis expert covering the complete workflow from environment setup through data collection, analysis, and bottleneck diagnosis. Expert in CPU profiling, call graph analysis, syscall tracing, and performance optimization. Use this skill whenever profiling C/C++/Rust/Go applications, diagnosing CPU bottlenecks, analyzing syscall overhead, or optimizing hot paths. IMPORTANT: Check permissions first. User-space profiling works without sudo, but kernel symbols require sudo to modify kptr_restrict and perf_event_paranoid. Focus on user-space optimization rather than kernel.
|
Linux Perf Performance Analysis Skill
This skill provides expert guidance for using Linux perf tool to profile and
optimize application performance. Focuses on practical, non-sudo workflows that
work in development and CI/CD environments.
Quick Reference Commands
Environment Setup
User-space profiling works without sudo (after one-time capability setup):
perf record -p <PID> - Record user-space samples
perf report -i perf.data --dsos <app> - Analyze user-space code
Kernel symbols require sudo (security mechanism, cannot bypass):
- Need sudo to modify:
kptr_restrict and perf_event_paranoid
- Without sudo: Kernel shows as
[unknown] or hex addresses
cat /proc/sys/kernel/perf_event_paranoid
cat /proc/sys/kernel/kptr_restrict
sudo setcap "cap_perfmon,cap_sys_admin,cap_sys_ptrace+ep" /usr/bin/perf
sudo sysctl -w kernel.perf_event_paranoid=-1 kernel.kptr_restrict=0
echo "kernel.perf_event_paranoid = -1" | sudo tee -a /etc/sysctl.d/99-perf.conf
echo "kernel.kptr_restrict = 0" | sudo tee -a /etc/sysctl.d/99-perf.conf
Data Collection
perf record -p <PID> -g -e cycles -o perf.data -- sleep 30
perf record -p <PID> -g -F 99 -o perf.data -- sleep 30
perf record -p <PID> -g -e 'cpu-cycles' -e 'instructions' -o perf.data -- sleep 30
Data Analysis
perf report -i perf.data -g graph
perf report --stdio -i perf.data -g none --sort symbol | head -50
perf report --stdio -i perf.data -g graph --max-stack 20 | head -200
perf report --stdio -i perf.data --dsos <app_name> -g graph
perf report --stdio -i perf.data --symbol-filter 'tcp_sendmsg' -g graph
Data Export
perf script -i perf.data > perf_script.txt
perf report --stdio -i perf.data --show-total-period --show-nr-samples
Complete Workflow Phases
Phase 1: Environment Preparation
1.1 Permission Check
cat /proc/sys/kernel/perf_event_paranoid
cat /proc/sys/kernel/kptr_restrict
1.2 Full Kernel Access Setup (Requires sudo - one time only)
Run this once to enable full kernel symbol support:
sudo setcap "cap_perfmon,cap_sys_admin,cap_sys_ptrace+ep" /usr/bin/perf
sudo sysctl -w kernel.perf_event_paranoid=-1 kernel.kptr_restrict=0
echo "kernel.perf_event_paranoid = -1" | sudo tee /etc/sysctl.d/99-perf.conf > /dev/null
echo "kernel.kptr_restrict = 0" | sudo tee -a /etc/sysctl.d/99-perf.conf > /dev/null
1.3 Without sudo - User-space Only Mode
If you cannot use sudo, you can still profile user-space code:
perf record -p <PID> -g -o perf.data -- sleep 30
perf report --stdio -i perf.data --dsos yourapp -g graph
Trade-off: Kernel shows as [unknown] - but user-space profiling still works!
1.4 Compile with Debug Symbols
cmake -DCMAKE_BUILD_TYPE=RelWithDebInfo ..
gcc -g -O2 -o app app.c
cargo build --release
1.4 Performance Build Rules
- Always test without ASAN (2-5x overhead)
- Use RelWithDebInfo or -g (not -O0, not stripped)
- Verify symbols:
nm <app> | grep <function> or objdump -t <app> | grep <function>
Phase 2: Data Collection
2.1 Basic Recording
pidof <app_name>
ps aux | grep <app_name>
perf record -p <PID> -g -o perf.data -- sleep 30
2.2 Recording During Load Test
perf record -p <PID> -g -o perf.data -- sleep 30 &
redis-benchmark -h 127.0.0.1 -p 6379 -t set,get -n 500000 -c 500 -q
wait
2.3 Advanced Recording Options
perf record -p <PID> -g -e cycles,instructions,cache-misses -o perf.data -- sleep 30
perf record -p <PID> -g -F 99 -o perf.data -- sleep 30
perf record -p $(pidof <app_name>) -g -o perf.data -- sleep 30
2.4 Key Parameters
| Parameter | Meaning | Typical Value |
|---|
-p <PID> | Target process ID | $(pidof app) |
-g | Record call graph | Always use |
-e cycles | Sampling event | cycles, instructions, cache-misses |
-F 99 | Sampling frequency (Hz) | 99 (prod), 997 (dev) |
-o perf.data | Output file | perf.data |
-- sleep N | Duration | 10-30 seconds |
Phase 3: Data Analysis
3.1 Interactive Analysis (Best First Step)
perf report -i perf.data -g graph
3.2 Flat Profile (Top Hotspots)
perf report --stdio -i perf.data -g none --sort symbol | head -50
perf report --stdio -i perf.data -g none -n --sort symbol | head -50
3.3 Call Graph Analysis
perf report --stdio -i perf.data -g graph --max-stack 20 | head -200
perf report --stdio -i perf.data -g graph --symbol-filter 'tcp_sendmsg'
3.4 Module-Level Analysis
perf report --stdio -i perf.data --sort dso -g none
perf report --stdio -i perf.data --dsos <app_name>,libc.so.6 -g graph
perf report --stdio -i perf.data --dsos '[kernel]' -g graph
3.5 Statistical Reports
perf report --stdio -i perf.data --show-total-period --show-nr-samples
perf report --stdio -i perf.data --sort dso,symbol -g none
Phase 4: Bottleneck Diagnosis
4.1 Common Bottleneck Patterns
| Symptom | Likely Cause | Optimization Focus |
|---|
| High syscall % / entry_SYSCALL_64 | Frequent system calls | Reduce syscall count, batch operations |
| High futex_wait / futex_wake | Lock contention | Reduce lock scope, use lock-free algorithms |
| High tcp_sendmsg / __send | Small packets, frequent sends | Batching, buffering, zero-copy |
| High epoll_wait | I/O event processing | Event aggregation, async I/O |
| High memmove / memcpy | Memory copies | Reduce copies, use views/references |
| High operator new / malloc | Frequent allocation | Object pools, allocators, reuse |
| High schedule / __schedule | Context switches | Reduce thread count, affinity pinning |
| High strcmp / strlen | String operations | Use string views, pre-hash |
4.2 User vs Kernel Analysis
Rule: User space is the "cause", kernel space is the "effect"
perf report --stdio -i perf.data --sort dso -g none | head -20
perf report --stdio -i perf.data -g graph --max-stack 30 | grep -A 5 -B 5 'kernel'
4.3 Hotspot Investigation Workflow
perf report --stdio -i perf.data -g none --sort symbol | head -10
perf report --stdio -i perf.data -g graph --symbol-filter '<hotspot>'
Phase 5: Comparative Analysis
5.1 ASAN vs Non-ASAN Comparison
perf record -p <ASAN_PID> -g -o perf-asan.data -- sleep 15
perf record -p <NORMAL_PID> -g -o perf-normal.data -- sleep 15
perf diff perf-asan.data perf-normal.data
5.2 Load Comparison
for c in 100 500 1000; do
perf record -p $PID -g -o perf-$c.data -- sleep 10 &
redis-benchmark -c $c -t set,get -n 100000 -q
wait
done
perf diff perf-100.data perf-500.data perf-1000.data
5.3 Optimization Validation
perf record -p $PID_BEFORE -g -o before.data -- sleep 15
perf report --stdio -i before.data -g none > before.txt
perf record -p $PID_AFTER -g -o after.data -- sleep 15
perf report --stdio -i after.data -g none > after.txt
diff before.txt after.txt
Common Mistakes
Permission-Related Mistakes
- Running perf without setup → Always check
/proc/sys/kernel/perf_event_paranoid first
- Using sudo for every command → Use
setcap one-time setup instead
- Forgetting to chown perf.data → If you must use sudo, run
sudo chown $(whoami):$(whoami) perf.data
Data Collection Mistakes
- Sampling with ASAN enabled → ASAN adds 2-5x overhead, distorts results
- Using -O0 builds → Optimization changes hotspots, profile optimized code
- Stripped binaries → No symbols = no function names, useless data
- Too short recordings (< 5s) → Insufficient samples, statistical noise
- Recording idle processes → Record during actual load, not idle time
Analysis Mistakes
- Looking at Children% instead of Self% → Self% = actual CPU, Children% = callee cost
- Blaming kernel for slowness → Kernel overhead is symptom, optimize user-space callers
- Optimizing without measuring → Profile first, then optimize, then verify
- Focusing on micro-optimizations → Look for algorithmic/structural changes first
Analysis Best Practices
1. Always Use Self% for Prioritization
perf report --stdio -i perf.data -g none -n --sort symbol | head -20
2. Follow the Call Chain
perf report --stdio -i perf.data -g graph --max-stack 20 | less
3. Focus on Your Code
perf report --stdio -i perf.data --dsos <your_app> -g graph
perf report --stdio -i perf.data --dsos='![kernel],[vdso]' -g graph
4. Validate with Multiple Runs
for i in 1 2 3; do
perf record -p $PID -g -o run-$i.data -- sleep 10 &
load_test
wait
done
Performance Tuning Checklist
Use this checklist when optimizing based on perf results:
Advanced Analysis Scenarios
CPU Cache Analysis
perf record -p $PID -g -e cache-misses,cache-references -o cache.data -- sleep 15
perf report --stdio -i cache.data -g none --sort symbol | head -20
perf stat -e cache-misses,cache-references -p $PID -- sleep 10
Instruction Efficiency
perf record -p $PID -g -e cycles,instructions -o ipc.data -- sleep 15
perf report --stdio -i ipc.data -g none
perf stat -e cycles,instructions -p $PID -- sleep 10
Context Switch Analysis
perf record -p $PID -g -e context-switches,cs -o ctx.data -- sleep 15
perf report --stdio -i ctx.data -g none
Integration with Other Tools
Flame Graphs
perf script -i perf.data | stackcollapse-perf.pl | flamegraph.pl > flamegraph.svg
sudo apt-get install linux-tools-common linux-tools-generic
git clone https://github.com/brendangregg/FlameGraph
perf + GDB
perf report --stdio -i perf.data -g graph | grep <function>
perf annotate -i perf.data <function>
perf + pprof
perf script -i perf.data > perf.script
FAQ
Q: Why do I need sudo for perf?
A: Linux has two security mechanisms:
perf_event_paranoid: Controls event access (default: 2, restricted)
kptr_restrict: Controls kernel symbol visibility (default: 1, restricted)
setcap lets perf run without sudo for capability checks, but kernel symbols still need these settings modified.
Q: Can I profile without debug symbols?
A: Yes, but you'll only see addresses, not function names. Always compile with -g or RelWithDebInfo.
Q: Why does ASAN slow down my app so much?
A: ASAN adds runtime checks that add 2-5x overhead. Profile non-ASAN builds for accurate performance data.
Q: Should I look at Self% or Children%?
A: Always Self% - it's the actual CPU time spent in that function. Children% includes callee costs.
Q: Kernel shows as [unknown] or hex addresses, why?
A: This is the expected behavior without sudo. You have two options:
- No sudo: Filter to user-space only (
--dsos <app>) - good for app-specific optimization
- With sudo:
sudo sysctl -w kernel.perf_event_paranoid=-1 kernel.kptr_restrict=0 - shows kernel symbols
Q: My app is 80% in kernel, what do I do?
A: Don't optimize kernel directly. Filter user-space (--dsos <app>) and trace back to find which user-space function is making all the syscalls, optimize that caller instead.
Q: How long should I record?
A: 10-30 seconds for development, 60+ seconds for production-like loads. Longer is better for statistical significance.
Q: Can I record multiple processes?
A: Yes: perf record -p <PID1>,<PID2>,<PID3> -g -o perf.data -- sleep 30
Q: What if perf is not installed?
A: Install with: sudo apt-get install linux-tools-common linux-tools-$(uname -r) (Ubuntu/Debian)
Troubleshooting
perf: Permission denied
cat /proc/sys/kernel/perf_event_paranoid
sudo setcap "cap_perfmon,cap_sys_admin,cap_sys_ptrace+ep" /usr/bin/perf
perf: No symbols found
file <app>
nm <app> | grep <function>
perf: Operation not permitted
getcap /usr/bin/perf
sudo setcap "cap_perfmon,cap_sys_admin,cap_sys_ptrace+ep" /usr/bin/perf
perf.data: Permission denied
sudo chown $(whoami):$(whoami) perf.data
Summary: The 5-Step Workflow
- Setup: Check permissions, use
setcap, compile with -g
- Record:
perf record -p $PID -g -o perf.data -- sleep 30
- Analyze:
perf report -i perf.data -g graph (interactive first)
- Diagnose: Focus on Self%, trace to user-space caller
- Validate: Re-profile after optimization to confirm improvement