ffi: add V8 fast-call path#63140
Conversation
|
Review requested:
|
|
With this. How does ffi compare to napi? Do we have any guidelines in regards to when to use which? |
|
@ronag On my PR (which yields similar results as Bryan's) we're beating it by like 100x. |
|
So ffi is 100x faster than napi? |
Sorry, I thought you meant https://github.com/node-ffi-napi/node-ffi-napi. We probably have to measure that once we agree on the direction. |
node-ffi-napi isn't a good benchmark for performance, but koffi is. Version 3 performs even better than version 2; it's still in beta (but can already be used for testing). |
ff4f871 to
50fcbb2
Compare
Add a parallel dispatch path that uses V8 fast API calls instead of libffi for eligible native calls. At DynamicLibrary.getFunction time, generate a per-function JIT'd trampoline that strips V8's receiver argument and tail-calls the target. Signatures with callbacks, unsupported argument types, or more register-passed args than the platform ABI permits transparently fall back to libffi. Stub emitters cover Linux/macOS/FreeBSD on x86_64 and AArch64, Windows on x86_64 and AArch64, and Linux on AArch32. JIT memory is allocated per isolate via direct mmap with MAP_JIT on macOS and W^X enforcement elsewhere. The JS wrapper validates each argument per declared type, mirroring the libffi slow callback so the contract is identical across both paths and across V8 optimization tiers. The path is gated behind --experimental-ffi and can be disabled at build time with --without-ffi-fastcall. The previous shared-buffer JS fast path is removed, replaced by this fast-call path. Signed-off-by: Bryan English <bryan@bryanenglish.com>
Adds CFunctionInfo::HasReceiver = kNo. When set, V8's TurboFan and Turboshaft fast-call lowering omits the JS receiver from the C call — the C function pointer is invoked with user args only, no receiver in the first parameter register. For Node's FFI fast-call path this means the receiver-strip JIT stub is no longer needed: dlsym'd target functions can be registered directly with V8 as the C address. Eliminates ~7 instructions per call (AArch64) plus V8's own receiver-into-arg0 setup. Yields +3-24% over the prior validators+stub path on FFI microbenchmarks (largest gains on many-args and pointer-bigint), and beats the pre-fix silent-truncation thin wrapper on every numeric and pointer benchmark while preserving strict validation. The change is gated by an enum on CFunctionInfo (default kYes, backward-compatible). Existing fast-call users (DOM bindings, V8 internals) are unaffected. Patches in deps/v8 cover the API header, the constructor, the GetFastApiCallTarget overload-matching, and the js-call-reducer input-layout setup; the simplified-lowering and turboshaft graph-builder loops already iterate from input 0 over ArgumentCount() inputs and pick up the new layout automatically. Signed-off-by: Bryan English <bryan@bryanenglish.com>
With v8 fast-call patched to support HasReceiver=kNo (previous commit),
the dlsym'd target function pointer is registered directly with v8 as
the C function address — no JIT'd receiver-strip trampoline is needed.
This removes the entire stub-emitter and JIT-memory infrastructure,
along with the platform-specific argument-cap logic (v8's own 8-arg
fast-call cap takes over) and the boot-time self-test that verified
JIT pages were executable.
Deleted: src/ffi/fastcall/jit_memory.{h,cc}, stub_emitter.h, the four
per-platform stub_emitter_*.cc files (aarch64, arm, x64_sysv, x64_win),
test/cctest/test_ffi_fastcall_{emitter,jit}.cc, and the related node.gyp
/ config.gypi entries. The CFunctionInfoBundle drops its arg_classes
and result_class members (they only existed to feed the stub emitters).
IsFastCallEligible loses the per-platform GP/FP/Win64 caps and the
AArch32 i64-arg rejection — v8 now handles those uniformly. FastCall-
State drops stub_entry/stub_alloc_size; its destructor no longer needs
to free JIT pages. Net change: ~2000 lines deleted across src/ and
test/cctest/.
Functional behavior is unchanged: all 15 FFI tests and the FFI cctests
still pass. The benchmark gain over the prior shipping wrapper is
+2-24% on AArch64 macOS (largest on many-args at +24% — saved register-
shifts in the deleted stub).
Signed-off-by: Bryan English <bryan@bryanenglish.com>
|
Okay, so, new benchmark: The main reasons this improved are:
The net result is that there's not a whole lot extra going on that V8 wouldn't have to do regardless, so I'd guess we're approaching limits here. I'll hold off on making the V8 upstream patch until we also see @ShogunPanda's latest approach and decide which way we want to go. I also want to put together a benchmark repo that also tests this against existing userland FFI implementations, and also against NAPI (without and without the C++ wrapper lib), and non-NAPI addons. |
HEADS UP: This likely isn't done yet. It's mainly here to compare approaches with @ShogunPanda.
Add a parallel dispatch path that uses V8 fast API calls instead of libffi for eligible native calls. At DynamicLibrary.getFunction time, generate a per-function JIT'd trampoline that strips V8's receiver argument and tail-calls the target. Signatures with callbacks, unsupported argument types, or more register-passed args than the platform ABI permits transparently fall back to libffi.
Stub emitters cover Linux/macOS/FreeBSD on x86_64 and AArch64, Windows on x86_64 and AArch64, and Linux on AArch32. JIT memory is allocated per isolate via direct mmap with MAP_JIT on macOS and W^X enforcement elsewhere. The JS wrapper validates each argument per declared type, mirroring the libffi slow callback so the contract is identical across both paths and across V8 optimization tiers.
The path is gated behind --experimental-ffi and can be disabled at build time with --without-ffi-fastcall. The previous shared-buffer JS fast path is removed, replaced by this fast-call path.
Some benchmarks: