Profiling Guide

This guide covers essential profiling tools and techniques for analyzing C++ performance.

Optimization Workflow

flowchart TD
    A[Identify Performance Issue] --> B[Profile with perf]
    B --> C[Generate FlameGraph]
    C --> D[Identify Hot Functions]
    D --> E{CPU or Memory bound?}
    
    E -->|CPU| F{Vectorizable?}
    E -->|Memory| G{High cache misses?}
    
    F -->|Yes| H[SIMD Optimization]
    F -->|No| I[Algorithm Change]
    
    G -->|Yes| J[Data Layout SOA]
    G -->|No| K[Prefetching]
    
    H --> L[Implement Fix]
    I --> L
    J --> L
    K --> L
    
    L --> M[Re-run Benchmark]
    M --> N{Faster?}
    N -->|Yes| O[Document & Commit]
    N -->|No| P[Try Different Approach]
    P --> B
    
    style A fill:#ff6b6b
    style O fill:#6bcb77
    style N fill:#ffd93d

Overview

Performance optimization follows a simple cycle:

Measure - Profile to find bottlenecks
Analyze - Understand the root cause
Optimize - Apply targeted improvements
Verify - Measure again to confirm improvement

Tools

perf (Linux)

perf is the standard Linux profiling tool.

Profiling Workflow with perf

sequenceDiagram
    participant Dev as Developer
    participant Perf as perf
    participant App as Application
    participant FG as FlameGraph
    
    Dev->>Perf: perf record -g ./benchmark
    Perf->>App: Execute with sampling
    App-->>Perf: Profile data (perf.data)
    Dev->>Perf: perf script
    Perf-->>Dev: Call stacks
    Dev->>FG: stackcollapse + flamegraph.pl
    FG-->>Dev: flamegraph.svg
    Dev->>Dev: Identify hotspots

Installation

# Ubuntu/Debian
sudo apt-get install linux-tools-common linux-tools-generic

# Fedora
sudo dnf install perf

Basic Usage

# Record CPU samples
perf record -g ./your_benchmark

# View report
perf report

# Show annotated source
perf annotate

Useful Commands

# CPU cycles breakdown
perf stat ./your_benchmark

# Cache miss analysis
perf stat -e cache-references,cache-misses,L1-dcache-load-misses ./your_benchmark

# Branch prediction
perf stat -e branches,branch-misses ./your_benchmark

# Record with call graph (dwarf for C++)
perf record -g --call-graph dwarf ./your_benchmark

FlameGraph

FlameGraphs provide intuitive visualization of where time is spent.

Using the Project Script

# Generate FlameGraph for a benchmark
./tools/performance/generate_flamegraph.sh ./build/release/examples/02-memory-cache/bench/aos_soa_bench

# View the result
firefox flamegraph.svg

Manual Generation

# Clone FlameGraph tools (if not already done)
git clone https://github.com/brendangregg/FlameGraph.git

# Record with perf
perf record -F 99 -g ./your_benchmark

# Generate FlameGraph
perf script | ./FlameGraph/stackcollapse-perf.pl | ./FlameGraph/flamegraph.pl > flamegraph.svg

Reading FlameGraphs

Width = Time spent (wider = more time)
Height = Call stack depth
Color = Random (no meaning)
Top = Currently executing function
Bottom = Entry point (main)

Look for:

Wide plateaus (hot functions)
Deep stacks (excessive call depth)
Unexpected functions taking time

Valgrind

Valgrind provides detailed memory and cache analysis.

Cachegrind (Cache Simulation)

# Run cache simulation
valgrind --tool=cachegrind ./your_benchmark

# View results
cg_annotate cachegrind.out.*

Output shows:

I1 cache misses (instruction cache)
D1 cache misses (L1 data cache)
LL cache misses (last-level cache)

Callgrind (Call Graph Profiling)

# Run call graph profiling
valgrind --tool=callgrind ./your_benchmark

# View with KCachegrind (GUI)
kcachegrind callgrind.out.*

Intel VTune (Advanced)

VTune provides the most detailed analysis on Intel CPUs.

Installation

Download from Intel oneAPI.

Basic Usage

# Hotspots analysis
vtune -collect hotspots ./your_benchmark

# Memory access analysis
vtune -collect memory-access ./your_benchmark

# Microarchitecture analysis
vtune -collect uarch-exploration ./your_benchmark

# View results
vtune-gui

Profiling Strategies

CPU-Bound Code

Start with perf stat for overview
Use perf record + FlameGraph to find hot functions
Use perf annotate to see hot instructions
Check vectorization with compiler reports

# Check if code is vectorized
g++ -O3 -march=native -fopt-info-vec-optimized your_code.cpp

Memory-Bound Code

Check cache misses with perf stat
Use Cachegrind for detailed cache analysis
Look for:
- High L1 miss rate (> 5%)
- High LLC miss rate (> 1%)
- Poor spatial locality

# Quick cache check
perf stat -e L1-dcache-load-misses,L1-dcache-loads ./your_benchmark

Multi-threaded Code

Check for false sharing
Analyze lock contention
Verify thread scaling

# Check for cache line bouncing (false sharing indicator)
perf stat -e cache-misses ./your_benchmark

# Run with different thread counts
OMP_NUM_THREADS=1 ./your_benchmark
OMP_NUM_THREADS=2 ./your_benchmark
OMP_NUM_THREADS=4 ./your_benchmark

Common Performance Issues

1. Cache Misses

Symptoms:

High L1/L2/L3 miss rates
Memory bandwidth saturation

Solutions:

Improve data locality (SOA layout)
Use prefetching
Reduce working set size

2. Branch Mispredictions

Symptoms:

High branch-misses count
Unpredictable control flow

Solutions:

Use branchless code
Sort data to improve prediction
Use CMOV instructions

3. False Sharing

Symptoms:

Poor multi-threaded scaling
High cache-to-cache transfers

Solutions:

Pad data to cache line boundaries
Use thread-local storage
Reduce shared state

4. Vectorization Failures

Symptoms:

Scalar code in hot loops
No SIMD instructions in assembly

Solutions:

Align data
Use restrict pointers
Simplify loop structure
Use explicit SIMD intrinsics

Benchmark Best Practices

Avoid Measurement Errors

// Prevent dead code elimination
benchmark::DoNotOptimize(result);

// Force memory writes to be visible
benchmark::ClobberMemory();

Warm Up Caches

// Run a few iterations before measuring
for (int i = 0; i < warmup_iterations; ++i) {
    do_work();
}

Control Environment

# Disable CPU frequency scaling
sudo cpupower frequency-set --governor performance

# Pin to specific CPU
taskset -c 0 ./your_benchmark

# Disable ASLR for reproducibility
echo 0 | sudo tee /proc/sys/kernel/randomize_va_space

Statistical Significance

Run multiple iterations
Report mean, median, and standard deviation
Use Google Benchmark's built-in statistics

# Run with statistics
./your_benchmark --benchmark_repetitions=10 --benchmark_report_aggregates_only=true

Quick Reference

Task	Tool	Command
CPU hotspots	perf	`perf record -g ./bench && perf report`
Cache misses	perf	`perf stat -e cache-misses ./bench`
Visual profile	FlameGraph	`./tools/performance/generate_flamegraph.sh ./bench`
Detailed cache	Valgrind	`valgrind --tool=cachegrind ./bench`
Call graph	Valgrind	`valgrind --tool=callgrind ./bench`
Vectorization	GCC	`-fopt-info-vec-optimized`
Vectorization	Clang	`-Rpass=loop-vectorize`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Profiling Guide

Optimization Workflow

Overview

Tools

perf (Linux)

Profiling Workflow with perf

Installation

Basic Usage

Useful Commands

FlameGraph

Using the Project Script

Manual Generation

Reading FlameGraphs

Valgrind

Cachegrind (Cache Simulation)

Callgrind (Call Graph Profiling)

Intel VTune (Advanced)

Installation

Basic Usage

Profiling Strategies

CPU-Bound Code

Memory-Bound Code

Multi-threaded Code

Common Performance Issues

1. Cache Misses

2. Branch Mispredictions

3. False Sharing

4. Vectorization Failures

Benchmark Best Practices

Avoid Measurement Errors

Warm Up Caches

Control Environment

Statistical Significance

Quick Reference

See Also

FilesExpand file tree

profiling-guide.md

Latest commit

History

profiling-guide.md

File metadata and controls

Profiling Guide

Optimization Workflow

Overview

Tools

perf (Linux)

Profiling Workflow with perf

Installation

Basic Usage

Useful Commands

FlameGraph

Using the Project Script

Manual Generation

Reading FlameGraphs

Valgrind

Cachegrind (Cache Simulation)

Callgrind (Call Graph Profiling)

Intel VTune (Advanced)

Installation

Basic Usage

Profiling Strategies

CPU-Bound Code

Memory-Bound Code

Multi-threaded Code

Common Performance Issues

1. Cache Misses

2. Branch Mispredictions

3. False Sharing

4. Vectorization Failures

Benchmark Best Practices

Avoid Measurement Errors

Warm Up Caches

Control Environment

Statistical Significance

Quick Reference

See Also