Skip to content

Latest commit

 

History

History
332 lines (230 loc) · 6.82 KB

File metadata and controls

332 lines (230 loc) · 6.82 KB

SIMD Wrapper API

C++ wrapper for SIMD intrinsics providing a clean, portable interface for vectorized operations.


Overview

Header: examples/04-simd-vectorization/include/simd_wrapper.hpp

Namespace: hpc::simd


SIMD Level Detection

The library automatically detects available SIMD instruction sets at compile time:

Macro Instruction Set Width
HPC_HAS_SSE2 SSE2 128-bit (4 floats)
HPC_HAS_AVX AVX 256-bit (8 floats)
HPC_HAS_AVX2 AVX2 256-bit (8 floats)
HPC_HAS_AVX512 AVX-512 512-bit (16 floats)

SimdVec Class

Template Parameters

template<typename T, size_t Width>
class SimdVec;
  • T - Element type (currently float is specialized)
  • Width - Number of elements (4, 8, or 16)

Common Interface

All SIMD vector types share this interface:

Construction

// Default constructor - zero initialized
SimdVec();

// Broadcast a single value to all lanes
explicit SimdVec(float val);

// Load from unaligned memory
SimdVec(const float* ptr);

// Load from aligned memory (static method)
static SimdVec load_aligned(const float* ptr);

Storage

// Store to unaligned memory
void store(float* ptr) const;

// Store to aligned memory
void store_aligned(float* ptr) const;

Element Access

// Get element at index (slow, for debugging)
float operator[](size_t i) const;

Arithmetic Operators

SimdVec operator+(const SimdVec& other) const;
SimdVec operator-(const SimdVec& other) const;
SimdVec operator*(const SimdVec& other) const;
SimdVec operator/(const SimdVec& other) const;

SimdVec& operator+=(const SimdVec& other);
SimdVec& operator-=(const SimdVec& other);
SimdVec& operator*=(const SimdVec& other);

Mathematical Operations

// Sum all lanes into a single value
float horizontal_sum() const;

// Fused multiply-add: a * b + c
static SimdVec fmadd(const SimdVec& a, const SimdVec& b, const SimdVec& c);

// Element-wise square root
SimdVec sqrt() const;

// Element-wise minimum
SimdVec min(const SimdVec& other) const;

// Element-wise maximum
SimdVec max(const SimdVec& other) const;

Type Aliases

FloatVec

using FloatVec = SimdVec<float, WIDTH>;  // WIDTH depends on available SIMD

Default SIMD vector type, automatically selects the widest available instruction set.

Available SIMD FloatVec Width
AVX-512 16 floats
AVX2 8 floats
SSE2 4 floats
None 4 floats (scalar fallback)

FLOAT_VEC_WIDTH

constexpr size_t FLOAT_VEC_WIDTH;  // 4, 8, or 16

Number of floats in the default FloatVec type.


High-Level Operations

add_arrays_wrapped

void add_arrays_wrapped(const float* a, const float* b, float* c, size_t n);

Add two arrays element-wise: c[i] = a[i] + b[i]

Example:

float a[1024], b[1024], c[1024];
// ... initialize a and b ...

hpc::simd::add_arrays_wrapped(a, b, c, 1024);

dot_product_wrapped

float dot_product_wrapped(const float* a, const float* b, size_t n);

Compute dot product: sum(a[i] * b[i])

Example:

float a[1024], b[1024];
// ... initialize ...

float result = hpc::simd::dot_product_wrapped(a, b, 1024);

scale_array_wrapped

void scale_array_wrapped(float* arr, float scalar, size_t n);

Scale array by scalar: arr[i] *= scalar


clamp_array_wrapped

void clamp_array_wrapped(float* arr, float min_val, float max_val, size_t n);

Clamp array values to range: arr[i] = clamp(arr[i], min_val, max_val)


Usage Examples

Basic Vector Operations

#include "simd_wrapper.hpp"

using namespace hpc::simd;

void process_arrays(float* a, float* b, float* result, size_t n) {
    size_t i = 0;
    
    // Process in SIMD-width chunks
    for (; i + FLOAT_VEC_WIDTH <= n; i += FLOAT_VEC_WIDTH) {
        FloatVec va(&a[i]);
        FloatVec vb(&b[i]);
        
        // result = a * 2 + b
        FloatVec scaled = va * FloatVec(2.0f);
        FloatVec vr = scaled + vb;
        
        vr.store(&result[i]);
    }
    
    // Handle remaining elements
    for (; i < n; ++i) {
        result[i] = a[i] * 2.0f + b[i];
    }
}

Using Fused Multiply-Add

float compute_weighted_sum(const float* values, const float* weights, 
                           float bias, size_t n) {
    FloatVec sum(bias);
    size_t i = 0;
    
    for (; i + FLOAT_VEC_WIDTH <= n; i += FLOAT_VEC_WIDTH) {
        FloatVec v(&values[i]);
        FloatVec w(&weights[i]);
        sum = FloatVec::fmadd(v, w, sum);  // sum += v * w
    }
    
    float result = sum.horizontal_sum();
    
    // Handle remainder
    for (; i < n; ++i) {
        result += values[i] * weights[i];
    }
    
    return result;
}

Aligned Memory for Best Performance

#include "memory_utils.hpp"
#include "simd_wrapper.hpp"

void aligned_operations() {
    // Allocate aligned memory
    auto a = hpc::memory::make_aligned<float>(1024, 64);
    auto b = hpc::memory::make_aligned<float>(1024, 64);
    auto c = hpc::memory::make_aligned<float>(1024, 64);
    
    // ... initialize ...
    
    size_t i = 0;
    for (; i + FLOAT_VEC_WIDTH <= 1024; i += FLOAT_VEC_WIDTH) {
        // Use aligned loads for better performance
        FloatVec va = FloatVec::load_aligned(&a[i]);
        FloatVec vb = FloatVec::load_aligned(&b[i]);
        FloatVec vc = va + vb;
        vc.store_aligned(&c[i]);
    }
}

Performance Considerations

Alignment

For best performance:

  • Use load_aligned() and store_aligned() when data is 64-byte aligned
  • Aligned loads avoid extra instructions on some architectures

Remainder Handling

Always handle elements that don't fit in a full SIMD vector:

size_t i = 0;
for (; i + FLOAT_VEC_WIDTH <= n; i += FLOAT_VEC_WIDTH) {
    // SIMD loop
}
for (; i < n; ++i) {
    // Scalar remainder
}

Memory Bandwidth

SIMD is most beneficial when:

  • Data is in cache (memory-bound operations won't benefit as much)
  • Operations are compute-intensive
  • Data access is sequential

Quick Reference

Operation Method SIMD Equivalent
Add a + b _mm_add_ps
Subtract a - b _mm_sub_ps
Multiply a * b _mm_mul_ps
Divide a / b _mm_div_ps
FMA fmadd(a,b,c) _mm_fmadd_ps
Sqrt a.sqrt() _mm_sqrt_ps
Horizontal sum a.horizontal_sum() Manual reduction
Min a.min(b) _mm_min_ps
Max a.max(b) _mm_max_ps

See Also