SIMD Wrapper API

C++ wrapper for SIMD intrinsics providing a clean, portable interface for vectorized operations.

Overview

Header: examples/04-simd-vectorization/include/simd_wrapper.hpp

Namespace: hpc::simd

SIMD Level Detection

The library automatically detects available SIMD instruction sets at compile time:

Macro	Instruction Set	Width
`HPC_HAS_SSE2`	SSE2	128-bit (4 floats)
`HPC_HAS_AVX`	AVX	256-bit (8 floats)
`HPC_HAS_AVX2`	AVX2	256-bit (8 floats)
`HPC_HAS_AVX512`	AVX-512	512-bit (16 floats)

SimdVec Class

Template Parameters

template<typename T, size_t Width>
class SimdVec;

T - Element type (currently float is specialized)
Width - Number of elements (4, 8, or 16)

Common Interface

All SIMD vector types share this interface:

Construction

// Default constructor - zero initialized
SimdVec();

// Broadcast a single value to all lanes
explicit SimdVec(float val);

// Load from unaligned memory
SimdVec(const float* ptr);

// Load from aligned memory (static method)
static SimdVec load_aligned(const float* ptr);

Storage

// Store to unaligned memory
void store(float* ptr) const;

// Store to aligned memory
void store_aligned(float* ptr) const;

Element Access

// Get element at index (slow, for debugging)
float operator[](size_t i) const;

Arithmetic Operators

SimdVec operator+(const SimdVec& other) const;
SimdVec operator-(const SimdVec& other) const;
SimdVec operator*(const SimdVec& other) const;
SimdVec operator/(const SimdVec& other) const;

SimdVec& operator+=(const SimdVec& other);
SimdVec& operator-=(const SimdVec& other);
SimdVec& operator*=(const SimdVec& other);

Mathematical Operations

// Sum all lanes into a single value
float horizontal_sum() const;

// Fused multiply-add: a * b + c
static SimdVec fmadd(const SimdVec& a, const SimdVec& b, const SimdVec& c);

// Element-wise square root
SimdVec sqrt() const;

// Element-wise minimum
SimdVec min(const SimdVec& other) const;

// Element-wise maximum
SimdVec max(const SimdVec& other) const;

Type Aliases

FloatVec

using FloatVec = SimdVec<float, WIDTH>;  // WIDTH depends on available SIMD

Default SIMD vector type, automatically selects the widest available instruction set.

Available SIMD	FloatVec Width
AVX-512	16 floats
AVX2	8 floats
SSE2	4 floats
None	4 floats (scalar fallback)

FLOAT_VEC_WIDTH

constexpr size_t FLOAT_VEC_WIDTH;  // 4, 8, or 16

Number of floats in the default FloatVec type.

High-Level Operations

add_arrays_wrapped

void add_arrays_wrapped(const float* a, const float* b, float* c, size_t n);

Add two arrays element-wise: c[i] = a[i] + b[i]

Example:

float a[1024], b[1024], c[1024];
// ... initialize a and b ...

hpc::simd::add_arrays_wrapped(a, b, c, 1024);

dot_product_wrapped

float dot_product_wrapped(const float* a, const float* b, size_t n);

Compute dot product: sum(a[i] * b[i])

Example:

float a[1024], b[1024];
// ... initialize ...

float result = hpc::simd::dot_product_wrapped(a, b, 1024);

scale_array_wrapped

void scale_array_wrapped(float* arr, float scalar, size_t n);

Scale array by scalar: arr[i] *= scalar

clamp_array_wrapped

void clamp_array_wrapped(float* arr, float min_val, float max_val, size_t n);

Clamp array values to range: arr[i] = clamp(arr[i], min_val, max_val)

Usage Examples

Basic Vector Operations

#include "simd_wrapper.hpp"

using namespace hpc::simd;

void process_arrays(float* a, float* b, float* result, size_t n) {
    size_t i = 0;
    
    // Process in SIMD-width chunks
    for (; i + FLOAT_VEC_WIDTH <= n; i += FLOAT_VEC_WIDTH) {
        FloatVec va(&a[i]);
        FloatVec vb(&b[i]);
        
        // result = a * 2 + b
        FloatVec scaled = va * FloatVec(2.0f);
        FloatVec vr = scaled + vb;
        
        vr.store(&result[i]);
    }
    
    // Handle remaining elements
    for (; i < n; ++i) {
        result[i] = a[i] * 2.0f + b[i];
    }
}

Using Fused Multiply-Add

float compute_weighted_sum(const float* values, const float* weights, 
                           float bias, size_t n) {
    FloatVec sum(bias);
    size_t i = 0;
    
    for (; i + FLOAT_VEC_WIDTH <= n; i += FLOAT_VEC_WIDTH) {
        FloatVec v(&values[i]);
        FloatVec w(&weights[i]);
        sum = FloatVec::fmadd(v, w, sum);  // sum += v * w
    }
    
    float result = sum.horizontal_sum();
    
    // Handle remainder
    for (; i < n; ++i) {
        result += values[i] * weights[i];
    }
    
    return result;
}

Aligned Memory for Best Performance

#include "memory_utils.hpp"
#include "simd_wrapper.hpp"

void aligned_operations() {
    // Allocate aligned memory
    auto a = hpc::memory::make_aligned<float>(1024, 64);
    auto b = hpc::memory::make_aligned<float>(1024, 64);
    auto c = hpc::memory::make_aligned<float>(1024, 64);
    
    // ... initialize ...
    
    size_t i = 0;
    for (; i + FLOAT_VEC_WIDTH <= 1024; i += FLOAT_VEC_WIDTH) {
        // Use aligned loads for better performance
        FloatVec va = FloatVec::load_aligned(&a[i]);
        FloatVec vb = FloatVec::load_aligned(&b[i]);
        FloatVec vc = va + vb;
        vc.store_aligned(&c[i]);
    }
}

Performance Considerations

Alignment

For best performance:

Use load_aligned() and store_aligned() when data is 64-byte aligned
Aligned loads avoid extra instructions on some architectures

Remainder Handling

Always handle elements that don't fit in a full SIMD vector:

size_t i = 0;
for (; i + FLOAT_VEC_WIDTH <= n; i += FLOAT_VEC_WIDTH) {
    // SIMD loop
}
for (; i < n; ++i) {
    // Scalar remainder
}

Memory Bandwidth

SIMD is most beneficial when:

Data is in cache (memory-bound operations won't benefit as much)
Operations are compute-intensive
Data access is sequential

Quick Reference

Operation	Method	SIMD Equivalent
Add	`a + b`	`_mm_add_ps`
Subtract	`a - b`	`_mm_sub_ps`
Multiply	`a * b`	`_mm_mul_ps`
Divide	`a / b`	`_mm_div_ps`
FMA	`fmadd(a,b,c)`	`_mm_fmadd_ps`
Sqrt	`a.sqrt()`	`_mm_sqrt_ps`
Horizontal sum	`a.horizontal_sum()`	Manual reduction
Min	`a.min(b)`	`_mm_min_ps`
Max	`a.max(b)`	`_mm_max_ps`

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SIMD Wrapper API

Overview

SIMD Level Detection

SimdVec Class

Template Parameters

Common Interface

Construction

Storage

Element Access

Arithmetic Operators

Mathematical Operations

Type Aliases

FloatVec

FLOAT_VEC_WIDTH

High-Level Operations

add_arrays_wrapped

dot_product_wrapped

scale_array_wrapped

clamp_array_wrapped

Usage Examples

Basic Vector Operations

Using Fused Multiply-Add

Aligned Memory for Best Performance

Performance Considerations

Alignment

Remainder Handling

Memory Bandwidth

Quick Reference

See Also

FilesExpand file tree

simd-wrapper.md

Latest commit

History

simd-wrapper.md

File metadata and controls

SIMD Wrapper API

Overview

SIMD Level Detection

SimdVec Class

Template Parameters

Common Interface

Construction

Storage

Element Access

Arithmetic Operators

Mathematical Operations

Type Aliases

FloatVec

FLOAT_VEC_WIDTH

High-Level Operations

add_arrays_wrapped

dot_product_wrapped

scale_array_wrapped

clamp_array_wrapped

Usage Examples

Basic Vector Operations

Using Fused Multiply-Add

Aligned Memory for Best Performance

Performance Considerations

Alignment

Remainder Handling

Memory Bandwidth

Quick Reference

See Also