Skip to content

ffi: add V8 fast-call path#63140

Draft
bengl wants to merge 4 commits intonodejs:mainfrom
bengl:bengl/ffi-fastcalls
Draft

ffi: add V8 fast-call path#63140
bengl wants to merge 4 commits intonodejs:mainfrom
bengl:bengl/ffi-fastcalls

Conversation

@bengl
Copy link
Copy Markdown
Member

@bengl bengl commented May 6, 2026

HEADS UP: This likely isn't done yet. It's mainly here to compare approaches with @ShogunPanda.


Add a parallel dispatch path that uses V8 fast API calls instead of libffi for eligible native calls. At DynamicLibrary.getFunction time, generate a per-function JIT'd trampoline that strips V8's receiver argument and tail-calls the target. Signatures with callbacks, unsupported argument types, or more register-passed args than the platform ABI permits transparently fall back to libffi.

Stub emitters cover Linux/macOS/FreeBSD on x86_64 and AArch64, Windows on x86_64 and AArch64, and Linux on AArch32. JIT memory is allocated per isolate via direct mmap with MAP_JIT on macOS and W^X enforcement elsewhere. The JS wrapper validates each argument per declared type, mirroring the libffi slow callback so the contract is identical across both paths and across V8 optimization tiers.

The path is gated behind --experimental-ffi and can be disabled at build time with --without-ffi-fastcall. The previous shared-buffer JS fast path is removed, replaced by this fast-call path.


Some benchmarks:

                                   confidence improvement accuracy (*)     (**)    (***)
  ffi/add-f64.js n=10000000               ***    746.34 %      ±46.27%  ±64.21%  ±89.28%
  ffi/add-i32.js n=10000000               ***    807.83 %      ±44.39%  ±61.59%  ±85.62%
  ffi/getpid.js n=10000000                ***   1173.51 %     ±156.05% ±216.56% ±301.12%
  ffi/many-args.js n=10000000             ***    481.97 %       ±6.47%   ±8.96%  ±12.42%
  ffi/pointer-bigint.js n=10000000        ***    901.32 %      ±61.08%  ±84.75% ±117.82%
  ffi/sum-buffer.js n=10000000             **      9.51 %       ±5.38%   ±7.44%  ±10.28%
  

@nodejs-github-bot
Copy link
Copy Markdown
Collaborator

Review requested:

  • @nodejs/ffi
  • @nodejs/gyp
  • @nodejs/startup
  • @nodejs/tsc

@nodejs-github-bot nodejs-github-bot added lib / src Issues and PRs related to general changes in the lib or src directory. needs-ci PRs that need a full CI run. labels May 6, 2026
@bengl bengl force-pushed the bengl/ffi-fastcalls branch from a8f73a3 to 8c767ff Compare May 6, 2026 04:12
@ronag
Copy link
Copy Markdown
Member

ronag commented May 6, 2026

With this. How does ffi compare to napi? Do we have any guidelines in regards to when to use which?

@ShogunPanda
Copy link
Copy Markdown
Contributor

@ronag On my PR (which yields similar results as Bryan's) we're beating it by like 100x.

@ronag
Copy link
Copy Markdown
Member

ronag commented May 6, 2026

So ffi is 100x faster than napi?

@ShogunPanda
Copy link
Copy Markdown
Contributor

So ffi is 100x faster than napi?

Sorry, I thought you meant https://github.com/node-ffi-napi/node-ffi-napi.

We probably have to measure that once we agree on the direction.

@dmazzella
Copy link
Copy Markdown

So ffi is 100x faster than napi?

Sorry, I thought you meant https://github.com/node-ffi-napi/node-ffi-napi.

We probably have to measure that once we agree on the direction.

node-ffi-napi isn't a good benchmark for performance, but koffi is. Version 3 performs even better than version 2; it's still in beta (but can already be used for testing).

@addaleax addaleax added ffi Issues and PRs related to experimental Foreign Function Interface support. labels May 6, 2026
@bengl bengl force-pushed the bengl/ffi-fastcalls branch 2 times, most recently from ff4f871 to 50fcbb2 Compare May 6, 2026 18:40
bengl added 4 commits May 6, 2026 14:43
Add a parallel dispatch path that uses V8 fast API calls instead of
libffi for eligible native calls. At DynamicLibrary.getFunction time,
generate a per-function JIT'd trampoline that strips V8's receiver
argument and tail-calls the target. Signatures with callbacks,
unsupported argument types, or more register-passed args than the
platform ABI permits transparently fall back to libffi.

Stub emitters cover Linux/macOS/FreeBSD on x86_64 and AArch64, Windows
on x86_64 and AArch64, and Linux on AArch32. JIT memory is allocated per
isolate via direct mmap with MAP_JIT on macOS and W^X enforcement
elsewhere. The JS wrapper validates each argument per declared type,
mirroring the libffi slow callback so the contract is identical across
both paths and across V8 optimization tiers.

The path is gated behind --experimental-ffi and can be disabled at
build time with --without-ffi-fastcall. The previous shared-buffer JS
fast path is removed, replaced by this fast-call path.

Signed-off-by: Bryan English <bryan@bryanenglish.com>
Adds CFunctionInfo::HasReceiver = kNo. When set, V8's TurboFan and
Turboshaft fast-call lowering omits the JS receiver from the C call
— the C function pointer is invoked with user args only, no receiver
in the first parameter register.

For Node's FFI fast-call path this means the receiver-strip JIT stub
is no longer needed: dlsym'd target functions can be registered
directly with V8 as the C address. Eliminates ~7 instructions per
call (AArch64) plus V8's own receiver-into-arg0 setup. Yields +3-24%
over the prior validators+stub path on FFI microbenchmarks (largest
gains on many-args and pointer-bigint), and beats the pre-fix
silent-truncation thin wrapper on every numeric and pointer benchmark
while preserving strict validation.

The change is gated by an enum on CFunctionInfo (default kYes,
backward-compatible). Existing fast-call users (DOM bindings, V8
internals) are unaffected. Patches in deps/v8 cover the API header,
the constructor, the GetFastApiCallTarget overload-matching, and the
js-call-reducer input-layout setup; the simplified-lowering and
turboshaft graph-builder loops already iterate from input 0 over
ArgumentCount() inputs and pick up the new layout automatically.

Signed-off-by: Bryan English <bryan@bryanenglish.com>
With v8 fast-call patched to support HasReceiver=kNo (previous commit),
the dlsym'd target function pointer is registered directly with v8 as
the C function address — no JIT'd receiver-strip trampoline is needed.
This removes the entire stub-emitter and JIT-memory infrastructure,
along with the platform-specific argument-cap logic (v8's own 8-arg
fast-call cap takes over) and the boot-time self-test that verified
JIT pages were executable.

Deleted: src/ffi/fastcall/jit_memory.{h,cc}, stub_emitter.h, the four
per-platform stub_emitter_*.cc files (aarch64, arm, x64_sysv, x64_win),
test/cctest/test_ffi_fastcall_{emitter,jit}.cc, and the related node.gyp
/ config.gypi entries. The CFunctionInfoBundle drops its arg_classes
and result_class members (they only existed to feed the stub emitters).
IsFastCallEligible loses the per-platform GP/FP/Win64 caps and the
AArch32 i64-arg rejection — v8 now handles those uniformly. FastCall-
State drops stub_entry/stub_alloc_size; its destructor no longer needs
to free JIT pages. Net change: ~2000 lines deleted across src/ and
test/cctest/.

Functional behavior is unchanged: all 15 FFI tests and the FFI cctests
still pass. The benchmark gain over the prior shipping wrapper is
+2-24% on AArch64 macOS (largest on many-args at +24% — saved register-
shifts in the deleted stub).

Signed-off-by: Bryan English <bryan@bryanenglish.com>
@bengl bengl force-pushed the bengl/ffi-fastcalls branch from 50fcbb2 to 1fd74c1 Compare May 6, 2026 18:43
@bengl
Copy link
Copy Markdown
Member Author

bengl commented May 6, 2026

Okay, so, new benchmark:

                                   confidence improvement accuracy (*)     (**)    (***)
  ffi/add-f64.js n=10000000               ***   2828.52 %     ±200.38% ±278.11% ±386.81%
  ffi/add-i32.js n=10000000               ***   2696.98 %     ±205.33% ±284.98% ±396.37%
  ffi/getpid.js n=10000000                ***   1353.74 %       ±7.47%  ±10.09%  ±13.43%
  ffi/many-args.js n=10000000             ***   5098.46 %      ±20.96%  ±29.08%  ±40.44%
  ffi/pointer-bigint.js n=10000000        ***   1040.00 %       ±4.04%   ±5.46%   ±7.26%
  ffi/sum-buffer.js n=10000000            ***     14.96 %       ±5.37%   ±7.42%  ±10.27%

The main reasons this improved are:

  1. I minimized the the impact of the wrapper functions.
  2. I took @addaleax's suggestion on @ShogunPanda's PR to see if we can get some wins from not having to do the extra trampoline on the native side. It makes things simpler (apart from the V8 change) and reduces V8 fast API calls to effectively its own FFI implementation.

The net result is that there's not a whole lot extra going on that V8 wouldn't have to do regardless, so I'd guess we're approaching limits here.

I'll hold off on making the V8 upstream patch until we also see @ShogunPanda's latest approach and decide which way we want to go. I also want to put together a benchmark repo that also tests this against existing userland FFI implementations, and also against NAPI (without and without the C++ wrapper lib), and non-NAPI addons.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ffi Issues and PRs related to experimental Foreign Function Interface support. lib / src Issues and PRs related to general changes in the lib or src directory. needs-ci PRs that need a full CI run.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants