vector operations library

This small library enables acceleration of bulk calls of certain math functions using SIMD instructions. Currently supported operations are exp, exp2, exp10, log, sigmoid and tanh. The library is designed with extensibility in mind. Optimized helper functions are found in fastops/core/FastIntrinsics.h and you are welcome to contribute your own.

Supported architectures:

x86 (SSE/AVX/AVX2): runtime CPU dispatch selects the best available instruction set. Pre-AVX fallback uses the fmath library.
AArch64 (NEON/SVE): runtime dispatch selects SVE when available (any vector width), otherwise falls back to 128-bit NEON with native FMA. SVE benefits even at 128-bit width thanks to predicated tail handling that eliminates scalar remainder loops.

fastops/fastops.h header provides the interface for the best version of each function. All functions are approximate, yet quite precise. Accuracy of each operation is detailed below along with operation description. All implementation architectures share the same polynomial coefficients and evaluation scheme, so accuracy is consistent across platforms.

Core implementation (fastops/core/FastIntrinsics.h) contains versions for fixed-size arrays that produce completely unrolled code for uncompromized performance. These may slowdown compilation and thus are currently not exposed via high-level dispatched interfaces. Be careful when using these versions: long fixed-size arrays may lead to etxreme code bloat. The regular versions perform on par with these ones if your arrays are larger than 512 bytes.

The quote from Mikhail Parakhin, the Yandex CTO and library creator: //In its spirit the library is aimed to aid vectorization of any compute. Just code up the performer class similar to the existing ones and let it fly. Current performers include not only compute, but also memset/memcopy/memmove operations that are always inlined and so work much faster for short arrays. On short strings gain may be as high as 2x.//

How to build

The library requires C++ compiler with c++17 support. The library itslef only depends on fmath, which is single-header library. The supporting code - eval and benchmark programs depend on TCLAP options parsing library. These dependencies are placed into contrib/libs inside root fastops directory.

The library is built using cmake and the build precess is simple and straighforward:

$ mkdir build install
$ cd build
$ cmake ../
$ make
$ make install

Tools

Two tools are provided along the library:

tools/eval - let one check the accuracy of operations under different conditions.
tools/benchmark - compares performance of optimized SIMD versions (AVX/AVX2 on x86, NEON/SVE on AArch64) with baseline implementations. Use --help for set of supported options.

On x86, please note that running these tools on pre-AVX hardware makes little sense.

tools/benchmark will refuse to run on it and won't call AVX2 functions on pure AVX hardware.
tools/eval allows selecting instruction set via command line and does not perform any checks. It will just crash if ran on incompatible hardware.

On AArch64, both tools use the SVE or NEON implementation via the same runtime dispatch.

Functions

The dispathed interfaces are available via fastops/fastops.h header file. There are single and double precision versions for each operation. Template parameters include speed/accuracy and alignment controls.

Speed/accuracy is bool letting you choose faster or more precise version of algorithm.
Alignment control allows to select whether the output array is aligned or not (32-byte on x86 AVX). On AArch64 all NEON/SVE loads and stores are naturally unaligned, so the alignment flag has no effect. On x86, aligned SIMD operations on unaligned data may crash.

All the library functionality is directly available via fastops/core/FastIntrinsics.h header, but then you should care about hardware compatibility yourself. On x86, a tiny AVX and AVX2 hardware detection utility is available via fastops/core/avx_id.h. On AArch64, NEON is always present; SVE availability is detected at runtime via HaveSve() in fastops/sve/ops_sve.h.

Below we use the following terms:

* x - input value

* EPS - relative error: EPS = abs(approx - real) / (abs(real) + 1e-100);

Exp

Compute exponent function.

template <bool I_Exact=false, bool I_OutAligned=false>
void Exp(const float* from, size_t size, float* to);

template <bool I_Exact=false, bool I_OutAligned=false>
void Exp(const double* from, size_t size, double* to);

Accuracy by version

float, inexact:

* x < -87: accuracy degrades sharply, exp(x) <= 1.0001 * true_exp(x), usually significantly less. This is due to saturation of the single precision range in inexact version. If denormals are banned the true_exp() will exhibit the same behavior.

* x >= -87: EPS <= 7.21e-06 2. float, exact:

* x < -87: for the most cases result is accurate. The corner cases are observed only due to different rounding directions of true_exp() and our imlementation. The results may differ up to 2x, but this is acceptable in denornals: the results are quite approximate in any case, these are just slightly different approximations.

* x >= -87: EPS <= 4e-06 3. double, inexact

* x < -708.39: exp(x) <= 1.0001 * true_exp(x), usually significantly less.

* x >= -708.39: EPS <= 3.5e-06 4. double, exact:

* Entire range: EPS <= 2.3e-9

Exp2

Compute base-2 exponent function: exp2(x) = 2^x. Internally this is a direct call to the Pow2V kernel with no input scaling, so it is slightly faster than Exp.

template <bool I_Exact=false, bool I_OutAligned=false>
void Exp2(const float* from, size_t size, float* to);

template <bool I_Exact=false, bool I_OutAligned=false>
void Exp2(const double* from, size_t size, double* to);

Accuracy by version

Same polynomial evaluation as Exp, so accuracy characteristics are identical — only the input saturation boundaries differ.

float, inexact:

* x < -125: accuracy degrades sharply due to saturation of the single precision range.

* x >= -125: EPS <= 7.21e-06 2. float, exact:

* x < -126: corner cases near denormals, same as `Exp`.

* x >= -126: EPS <= 4e-06 3. double, inexact

* x < -1020: accuracy degrades sharply.

* x >= -1020: EPS <= 3.5e-06 4. double, exact:

* Entire range: EPS <= 2.3e-9

Exp10

Compute base-10 exponent function: exp10(x) = 10^x. Internally this is Pow2V(x * log2(10)), sharing the same kernel as Exp.

template <bool I_Exact=false, bool I_OutAligned=false>
void Exp10(const float* from, size_t size, float* to);

template <bool I_Exact=false, bool I_OutAligned=false>
void Exp10(const double* from, size_t size, double* to);

Accuracy by version

Same polynomial evaluation as Exp, but the extra multiply by log2(10) adds a small additional error.

float, inexact:

* x < -37.5: accuracy degrades sharply due to saturation of the single precision range.

* x >= -37.5: EPS <= 8e-06 2. float, exact:

* x < -38: corner cases near denormals, same as `Exp`.

* x >= -38: EPS <= 5e-06 3. double, inexact

* x < -307: accuracy degrades sharply.

* x >= -307: EPS <= 4e-06 4. double, exact:

* Entire range: EPS <= 3e-9

Log

Computes natural log function.

template <bool I_Exact=false, bool I_OutAligned=false>
void Log(const float* from, size_t size, float* to);

template <bool I_Exact=false, bool I_OutAligned=false>
void Log(const double* from, size_t size, double* to);

Accuracy by version

float, inexact:

* x < 1.17613e-38: the result almost stops decreasing around the value of -88, while actual log function still does. This leads result to become significantly greater than actual value. If denormals support is disabled the function will return -inf same as precise one.

* x >= 1.17613e-38: EPS <= 1e-5 2. float, exact:

* Entire range: EPS <= 4e-7 3. double, inexact

* x < 2.99279772e-308: the result almost stops decreasing around the value of -708, while actual log function still does. This leads result to become significantly greater than actual value.

* x >= 2.99279772e-308: EPS <= 1e-5 4. double, exact:

* Entire range: EPS <= 2e-7

Sigmoid

Computes sigmoid function: sigm(x) = 1.0 / (1.0 + exp(-x)).

template <bool I_Exact=false, bool I_OutAligned=false>
void Sigmoid(const float* from, size_t size, float* to);

template <bool I_Exact=false, bool I_OutAligned=false>
void Sigmoid(const double* from, size_t size, double* to);

Accuracy by version

float, inexact:

* Entire range: EPS <= 8e-6

* x >= 1.17613e-38: EPS <= 1e-5 2. float, exact:

* Entire range: EPS <= 4.5e-6 3. double, inexact

* Entire range: EPS <= 4e-6 4. double, exact:

* Entire range: EPS <= 1e-12

Tanh

Computes hyperbolic tangent (tanh) function

template <bool I_Exact=false, bool I_OutAligned=false>
void Tanh(const float* from, size_t size, float* to);

template <bool I_Exact=false, bool I_OutAligned=false>
void Tanh(const double* from, size_t size, double* to);

Accuracy by version

Due to behavior of tanh around 0 the reative error there is unstable. The computational algorithm is close to sigmoid, so this instability is related to error measuring rather than to function computation. Due to this reason absolute error may provide more adequate indication of accuracy than relative one in some cases.

float, inexact:

* [-1, 1]: maximal absolute error is around 1e-06

* Outside [-1, 1]: EPS < 1.1e-06 2. float, exact:

* [-1, 1]: maximal absolute error is around 1.e-06

* Outside [-1, 1]: EPS < 2.5e-07 3. double, inexact

* Outside [-1, 1]: EPS < 1e-06

* [-1, -0.1] V [0.1, 1]: EPS < 1.6e-05

* [-0.1, -0.01] V [0.01, 0.1]: EPS < 1.2e-04

* [-0.01, 0.01]: EPS <= 3e-04 or maximal absolute error is 1e-06 4. double, exact:

* Entire range: EPS <= 1e-11

Name		Name	Last commit message	Last commit date
Latest commit History 53 Commits
contrib/libs		contrib/libs
fastops		fastops
tools		tools
.gitignore		.gitignore
AUTHORS		AUTHORS
CMakeLists.txt		CMakeLists.txt
LICENSE		LICENSE
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vector operations library

How to build

Tools

Functions

Exp

Accuracy by version

Exp2

Accuracy by version

Exp10

Accuracy by version

Log

Accuracy by version

Sigmoid

Accuracy by version

Tanh

Accuracy by version

About

Uh oh!

Releases

Packages

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

vector operations library

How to build

Tools

Functions

Exp

Accuracy by version

Exp2

Accuracy by version

Exp10

Accuracy by version

Log

Accuracy by version

Sigmoid

Accuracy by version

Tanh

Accuracy by version

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Packages