This guide covers essential profiling tools and techniques for analyzing C++ performance.
flowchart TD
A[Identify Performance Issue] --> B[Profile with perf]
B --> C[Generate FlameGraph]
C --> D[Identify Hot Functions]
D --> E{CPU or Memory bound?}
E -->|CPU| F{Vectorizable?}
E -->|Memory| G{High cache misses?}
F -->|Yes| H[SIMD Optimization]
F -->|No| I[Algorithm Change]
G -->|Yes| J[Data Layout SOA]
G -->|No| K[Prefetching]
H --> L[Implement Fix]
I --> L
J --> L
K --> L
L --> M[Re-run Benchmark]
M --> N{Faster?}
N -->|Yes| O[Document & Commit]
N -->|No| P[Try Different Approach]
P --> B
style A fill:#ff6b6b
style O fill:#6bcb77
style N fill:#ffd93d
Performance optimization follows a simple cycle:
- Measure - Profile to find bottlenecks
- Analyze - Understand the root cause
- Optimize - Apply targeted improvements
- Verify - Measure again to confirm improvement
perf is the standard Linux profiling tool.
sequenceDiagram
participant Dev as Developer
participant Perf as perf
participant App as Application
participant FG as FlameGraph
Dev->>Perf: perf record -g ./benchmark
Perf->>App: Execute with sampling
App-->>Perf: Profile data (perf.data)
Dev->>Perf: perf script
Perf-->>Dev: Call stacks
Dev->>FG: stackcollapse + flamegraph.pl
FG-->>Dev: flamegraph.svg
Dev->>Dev: Identify hotspots
# Ubuntu/Debian
sudo apt-get install linux-tools-common linux-tools-generic
# Fedora
sudo dnf install perf# Record CPU samples
perf record -g ./your_benchmark
# View report
perf report
# Show annotated source
perf annotate# CPU cycles breakdown
perf stat ./your_benchmark
# Cache miss analysis
perf stat -e cache-references,cache-misses,L1-dcache-load-misses ./your_benchmark
# Branch prediction
perf stat -e branches,branch-misses ./your_benchmark
# Record with call graph (dwarf for C++)
perf record -g --call-graph dwarf ./your_benchmarkFlameGraphs provide intuitive visualization of where time is spent.
# Generate FlameGraph for a benchmark
./tools/performance/generate_flamegraph.sh ./build/release/examples/02-memory-cache/bench/aos_soa_bench
# View the result
firefox flamegraph.svg# Clone FlameGraph tools (if not already done)
git clone https://github.com/brendangregg/FlameGraph.git
# Record with perf
perf record -F 99 -g ./your_benchmark
# Generate FlameGraph
perf script | ./FlameGraph/stackcollapse-perf.pl | ./FlameGraph/flamegraph.pl > flamegraph.svg- Width = Time spent (wider = more time)
- Height = Call stack depth
- Color = Random (no meaning)
- Top = Currently executing function
- Bottom = Entry point (main)
Look for:
- Wide plateaus (hot functions)
- Deep stacks (excessive call depth)
- Unexpected functions taking time
Valgrind provides detailed memory and cache analysis.
# Run cache simulation
valgrind --tool=cachegrind ./your_benchmark
# View results
cg_annotate cachegrind.out.*Output shows:
- I1 cache misses (instruction cache)
- D1 cache misses (L1 data cache)
- LL cache misses (last-level cache)
# Run call graph profiling
valgrind --tool=callgrind ./your_benchmark
# View with KCachegrind (GUI)
kcachegrind callgrind.out.*VTune provides the most detailed analysis on Intel CPUs.
Download from Intel oneAPI.
# Hotspots analysis
vtune -collect hotspots ./your_benchmark
# Memory access analysis
vtune -collect memory-access ./your_benchmark
# Microarchitecture analysis
vtune -collect uarch-exploration ./your_benchmark
# View results
vtune-gui- Start with
perf statfor overview - Use
perf record+ FlameGraph to find hot functions - Use
perf annotateto see hot instructions - Check vectorization with compiler reports
# Check if code is vectorized
g++ -O3 -march=native -fopt-info-vec-optimized your_code.cpp- Check cache misses with
perf stat - Use Cachegrind for detailed cache analysis
- Look for:
- High L1 miss rate (> 5%)
- High LLC miss rate (> 1%)
- Poor spatial locality
# Quick cache check
perf stat -e L1-dcache-load-misses,L1-dcache-loads ./your_benchmark- Check for false sharing
- Analyze lock contention
- Verify thread scaling
# Check for cache line bouncing (false sharing indicator)
perf stat -e cache-misses ./your_benchmark
# Run with different thread counts
OMP_NUM_THREADS=1 ./your_benchmark
OMP_NUM_THREADS=2 ./your_benchmark
OMP_NUM_THREADS=4 ./your_benchmarkSymptoms:
- High L1/L2/L3 miss rates
- Memory bandwidth saturation
Solutions:
- Improve data locality (SOA layout)
- Use prefetching
- Reduce working set size
Symptoms:
- High branch-misses count
- Unpredictable control flow
Solutions:
- Use branchless code
- Sort data to improve prediction
- Use CMOV instructions
Symptoms:
- Poor multi-threaded scaling
- High cache-to-cache transfers
Solutions:
- Pad data to cache line boundaries
- Use thread-local storage
- Reduce shared state
Symptoms:
- Scalar code in hot loops
- No SIMD instructions in assembly
Solutions:
- Align data
- Use
restrictpointers - Simplify loop structure
- Use explicit SIMD intrinsics
// Prevent dead code elimination
benchmark::DoNotOptimize(result);
// Force memory writes to be visible
benchmark::ClobberMemory();// Run a few iterations before measuring
for (int i = 0; i < warmup_iterations; ++i) {
do_work();
}# Disable CPU frequency scaling
sudo cpupower frequency-set --governor performance
# Pin to specific CPU
taskset -c 0 ./your_benchmark
# Disable ASLR for reproducibility
echo 0 | sudo tee /proc/sys/kernel/randomize_va_space- Run multiple iterations
- Report mean, median, and standard deviation
- Use Google Benchmark's built-in statistics
# Run with statistics
./your_benchmark --benchmark_repetitions=10 --benchmark_report_aggregates_only=true| Task | Tool | Command |
|---|---|---|
| CPU hotspots | perf | perf record -g ./bench && perf report |
| Cache misses | perf | perf stat -e cache-misses ./bench |
| Visual profile | FlameGraph | ./tools/performance/generate_flamegraph.sh ./bench |
| Detailed cache | Valgrind | valgrind --tool=cachegrind ./bench |
| Call graph | Valgrind | valgrind --tool=callgrind ./bench |
| Vectorization | GCC | -fopt-info-vec-optimized |
| Vectorization | Clang | -Rpass=loop-vectorize |
- Learning Path - Follow the curriculum
- Best Practices - Optimization patterns
- Troubleshooting - Common issues