This guide provides a recommended order for studying the HPC optimization examples, organized from beginner to advanced topics.
Understanding the memory hierarchy is fundamental to performance optimization:
graph TB
subgraph "CPU Memory Hierarchy"
REG[CPU Registers<br/>~1 cycle<br/>~512 bytes]:::success
L1[L1 Cache<br/>~4 cycles<br/>32-64 KB]:::process
L2[L2 Cache<br/>~12 cycles<br/>256-512 KB]:::decision
L3[L3 Cache<br/>~40 cycles<br/>8-32 MB]:::warning
RAM[Main Memory<br/>~200 cycles<br/>GBs]:::error
end
REG --> L1 --> L2 --> L3 --> RAM
classDef success fill:var(--wp-diagram-success),stroke:var(--wp-diagram-success-stroke),color:var(--wp-diagram-success-text)
classDef process fill:var(--wp-diagram-process),stroke:var(--wp-diagram-process-stroke),color:var(--wp-diagram-process-text)
classDef decision fill:var(--wp-diagram-decision),stroke:var(--wp-diagram-decision-stroke),color:var(--wp-diagram-decision-text)
classDef warning fill:var(--wp-diagram-warning),stroke:var(--wp-diagram-warning-stroke),color:var(--wp-diagram-warning-text)
classDef error fill:var(--wp-diagram-error),stroke:var(--wp-diagram-error-stroke),color:var(--wp-diagram-error-text)
Key Insight: Each level is ~10x slower than the previous. Optimizations that improve cache utilization yield the biggest gains.
flowchart LR
A[Week 1<br/>Build System]:::start
B[Week 2<br/>Memory Basics]:::process
C[Week 3<br/>Modern C++]:::success
D[Week 4<br/>SIMD]:::action
E[Week 5<br/>Concurrency]:::warning
F[Week 6<br/>Profiling]:::complete
A --> B --> C --> D --> E --> F
classDef start fill:var(--wp-diagram-start),stroke:var(--wp-diagram-start-stroke),color:var(--wp-diagram-start-text)
classDef process fill:var(--wp-diagram-process),stroke:var(--wp-diagram-process-stroke),color:var(--wp-diagram-process-text)
classDef success fill:var(--wp-diagram-success),stroke:var(--wp-diagram-success-stroke),color:var(--wp-diagram-success-text)
classDef action fill:var(--wp-diagram-action),stroke:var(--wp-diagram-action-stroke),color:var(--wp-diagram-action-text)
classDef warning fill:var(--wp-diagram-warning),stroke:var(--wp-diagram-warning-stroke),color:var(--wp-diagram-warning-text)
classDef complete fill:var(--wp-diagram-complete),stroke:var(--wp-diagram-complete-stroke),color:var(--wp-diagram-complete-text)
Before starting, ensure you have:
- Basic C++ knowledge (classes, templates, STL)
- Familiarity with command-line tools
- Understanding of basic computer architecture concepts
See Prerequisites for details.
1.1 Modern CMake (examples/01-cmake-modern)
Start here to understand the project structure and build system.
Topics:
- Why target-based CMake is better than directory-based
- Using
target_include_directoriesvsinclude_directories - FetchContent for dependency management
- CMake presets for reproducible builds
Exercises:
- Build the project using different presets
- Add a new example module using the template
- Compare the anti-pattern and best-practice CMakeLists.txt files
2.1 Data Layout - AOS vs SOA (examples/02-memory-cache)
Understanding data layout is fundamental to cache optimization.
Key Concepts:
- Cache lines and spatial locality
- Array of Structures vs Structure of Arrays
- When to use each layout
Benchmark:
./build/release/examples/02-memory-cache/bench/aos_soa_benchLearn how alignment affects SIMD performance.
Key Concepts:
alignasspecifier- Aligned memory allocation
- SIMD alignment requirements
Critical for multi-threaded performance.
Key Concepts:
- Cache line contention
alignas(64)for cache line padding- Detecting false sharing with perf
Advanced memory optimization technique.
Key Concepts:
__builtin_prefetchusage- Prefetch distance tuning
- When prefetching helps (and when it doesn't)
3.1 Compile-Time Computation (examples/03-modern-cpp)
Move computation from runtime to compile time.
Key Concepts:
constexprfunctions and variablesconstevalfor guaranteed compile-time evaluation- Template metaprogramming basics
Avoid unnecessary copies.
Key Concepts:
- Rvalue references
- Move constructors and assignment
std::moveusage
Optimize container usage.
Key Concepts:
reserve()vs automatic growth- Allocation counting
- Capacity vs size
Modern iteration patterns.
Key Concepts:
- Range adaptors and views
- Lazy evaluation
- Performance comparison with raw loops
4.1 Auto-Vectorization (examples/04-simd-vectorization)
Let the compiler do the work.
Key Concepts:
- Vectorization-friendly code patterns
- Compiler vectorization reports
- Common vectorization blockers
Compiler flags:
# GCC vectorization report
-fopt-info-vec-optimized
# Clang vectorization report
-Rpass=loop-vectorizeRepository workflow:
cmake --preset=release -DHPC_VECTORIZE_REPORT=ON
cmake --build build/release --target auto_vectorizeHPC_VECTORIZE_REPORT enables the same compiler-specific diagnostics for the
example target while keeping the default preset list stable. For sanitizer-led
verification after SIMD changes, see
Validation & Sanitizers.
Manual vectorization for maximum control.
Key Concepts:
- SSE, AVX2, AVX-512 instruction sets
- Intrinsic functions
- Data alignment for SIMD
Readable SIMD code.
Key Concepts:
- Abstracting intrinsics
- Scalar fallback implementations
- Type-safe SIMD operations
- Runtime dispatch for mixed CPU fleets
5.1 Atomic Operations (examples/05-concurrency)
Foundation of lock-free programming.
Key Concepts:
std::atomicbasics- Memory orderings (relaxed, acquire, release, seq_cst)
- When to use each ordering
Practical lock-free data structure.
Key Concepts:
- SPSC queue design
- Memory ordering in practice
- Correctness verification
Simple parallelization.
Key Concepts:
#pragma omp parallel for- Reductions
- Thread scaling
Learn to measure accurately.
Topics:
- Google Benchmark usage
DoNotOptimizeandClobberMemory- Parameterized benchmarks
Find performance bottlenecks.
Tools:
perffor CPU profiling- FlameGraph visualization
- Cache miss analysis
See Profiling Guide for detailed instructions.
| Week | Topics |
|---|---|
| 1 | Phase 1 + Phase 2.1-2.2 |
| 2 | Phase 2.3-2.4 + Phase 3.1-3.2 |
| 3 | Phase 3.3-3.4 + Phase 4.1 |
| 4 | Phase 4.2-4.3 |
| 5 | Phase 5.1-5.2 |
| 6 | Phase 5.3 + Phase 6 |
After completing this learning path:
- Profile your own code to find bottlenecks
- Apply relevant optimizations
- Measure the improvement
- Contribute new examples to this project!
- Best Practices - Industry-tested patterns
- API Reference - Utility functions
- FAQ - Common questions