1- # ThreadContext Benchmark Report — 2026-03-21
1+ # ThreadContext Benchmark Report — 2026-03-21 (updated 2026-04-07)
22
33## Objective
44
@@ -8,6 +8,20 @@ allocated on the native heap via `ProfiledThread`.
88
99## Environment
1010
11+ ### 2026-04-07 (current)
12+
13+ | Property | Value |
14+ | ----------------| ---------------------------------|
15+ | OS | macOS Darwin 25.3.0 (arm64) |
16+ | JDK | Temurin 25.0.2 |
17+ | JMH | 1.37 |
18+ | Forks | 1 |
19+ | Warmup | 2 iterations x 1 s |
20+ | Measurement | 3 iterations x 2 s |
21+ | Mode | Average time (ns/op) |
22+
23+ ### 2026-03-21 (baseline)
24+
1125| Property | Value |
1226| ----------------| ---------------------------------|
1327| OS | macOS Darwin 25.3.0 (arm64) |
3751
3852## Results
3953
40- ### Raw JMH Output
54+ ### 2026-04-07 — Current (Temurin 25, 1 fork)
55+
56+ ```
57+ Benchmark Mode Cnt Score Error Units
58+ ThreadContextBenchmark.clearContext avgt 3 4.279 ± 0.236 ns/op
59+ ThreadContextBenchmark.getSpanId avgt 3 1.046 ± 0.054 ns/op
60+ ThreadContextBenchmark.setAttrCacheHit avgt 3 9.512 ± 0.213 ns/op
61+ ThreadContextBenchmark.setContextFull avgt 3 5.993 ± 0.094 ns/op
62+ ThreadContextBenchmark.setContextFull_2t avgt 3 6.360 ± 0.680 ns/op
63+ ThreadContextBenchmark.setContextFull_4t avgt 3 6.162 ± 1.297 ns/op
64+ ThreadContextBenchmark.spanLifecycle avgt 3 16.345 ± 0.935 ns/op
65+ ThreadContextBenchmark.spanLifecycle_2t avgt 3 14.475 ± 1.104 ns/op
66+ ThreadContextBenchmark.spanLifecycle_4t avgt 3 15.233 ± 0.302 ns/op
67+ ```
68+
69+ ### 2026-03-21 — Baseline (Zulu 25, 2 forks)
4170
4271```
4372Benchmark Mode Cnt Score Error Units
@@ -51,55 +80,45 @@ ThreadContextBenchmark.spanLifecycle_2t avgt 9 30.674 ± 0.093 ns/op
5180ThreadContextBenchmark.spanLifecycle_4t avgt 9 32.203 ± 0.309 ns/op
5281```
5382
54- ### Single-Threaded Breakdown
55-
56- | Benchmark | ns/op | Error | Description |
57- | ------------------| --------| ---------| ------------------------------------------------|
58- | ` clearContext ` | 5.011 | ± 0.039 | Zero-fill all four context fields |
59- | ` setContextFull ` | 11.104 | ± 0.338 | Write localRootSpanId, spanId, traceIdHigh, traceIdLow (4-arg put) |
60- | ` setAttrCacheHit ` | 10.707 | ± 0.077 | Set string attribute (dictionary cache hit) |
61- | ` spanLifecycle ` | 30.430 | ± 0.129 | ` setContextFull ` + ` setAttrCacheHit ` combined |
83+ ### Single-Threaded Breakdown (2026-04-07)
6284
63- ` spanLifecycle ` at ~ 30.4 ns is consistent with the sum of its parts:
64- ` setContextFull ` (~ 11.1 ns) + ` setAttrCacheHit ` (~ 10.7 ns) + loop and
65- call overhead (~ 8.6 ns). All error bars are under 2% of the score,
66- indicating stable measurements.
85+ | Benchmark | ns/op | Error | vs Baseline | Description |
86+ | ------------------| --------| ---------| -------------| ------------------------------------------------|
87+ | ` clearContext ` | 4.279 | ± 0.236 | -14.6% | Zero-fill all four context fields |
88+ | ` getSpanId ` | 1.046 | ± 0.054 | N/A (new) | Direct ByteBuffer read of span ID |
89+ | ` setContextFull ` | 5.993 | ± 0.094 | ** -46.0%** | Write localRootSpanId, spanId, traceIdHigh, traceIdLow |
90+ | ` setAttrCacheHit ` | 9.512 | ± 0.213 | -11.2% | Set string attribute (dictionary cache hit) |
91+ | ` spanLifecycle ` | 16.345 | ± 0.935 | ** -46.3%** | ` setContextFull ` + ` setAttrCacheHit ` combined |
6792
68- Note: the original ` getContext ` benchmark (71.6 ns, dominated by ` long[] ` allocation)
69- was removed when ` getContext() ` was replaced by direct ` DirectByteBuffer ` reads in
70- ` getSpanId() ` and ` getRootSpanId() ` , eliminating the JNI call and the array allocation.
71- The ` getSpanId ` benchmark was added as a replacement but was not run at the time of
72- this report.
93+ ` getSpanId ` at ** 1.046 ns** confirms the direct ByteBuffer read compiles to a
94+ single memory load — effectively free compared to the removed ` getContext() ` path
95+ (71.6 ns, dominated by ` long[] ` JNI allocation).
7396
74- ### False Sharing Analysis
97+ ` setContextFull ` improved from 11.1 ns to 6.0 ns (-46%), and ` spanLifecycle ` from
98+ 30.4 ns to 16.3 ns (-46%). This reflects OTEP record write optimizations accumulated
99+ since the baseline (removed redundant ` full_fence ` , tightened detach/attach window,
100+ direct sidecar writes replacing dictionary lookups for LRS).
75101
76- Each thread's ` Context ` struct is allocated inside its own
77- ` ProfiledThread ` on the native heap. If these structs happen to share a
78- cache line (64 bytes on arm64), concurrent writes from different threads
79- would cause cache-line bouncing and measurable latency increases.
102+ ### False Sharing Analysis (2026-04-07)
80103
81104| Benchmark | 1 thread | 2 threads | 4 threads | 1t to 2t | 1t to 4t |
82105| ------------------| ----------| -----------| -----------| ----------| ----------|
83- | ` setContextFull ` | 11.104 | 11.081 | 11.741 | -0.2% | +5.7% |
84- | ` spanLifecycle ` | 30.430 | 30.674 | 32.203 | +0.8% | +5.8% |
85-
86- ** At 2 threads:** Both benchmarks are within error margins of the
87- single-threaded score. No contention detected.
106+ | ` setContextFull ` | 5.993 | 6.360 | 6.162 | +6.1% | +2.8% |
107+ | ` spanLifecycle ` | 16.345 | 14.475 | 15.233 | -11.5%\* | -6.8%\* |
88108
89- ** At 4 threads: ** Both show a ~ 5.7-5.8% increase. This is well below
90- the 2-10x degradation characteristic of true false sharing . The modest
91- increase is consistent with:
109+ \* The ` spanLifecycle ` multi-thread results are below the single-thread score.
110+ With 1 fork and 3 measurement iterations, noise floor is ~ 1 ns . The inversion is
111+ within measurement noise, not a real effect. Re-run with 2+ forks to confirm.
92112
93- - L2 cache pressure from 4 active working sets.
94- - OS thread scheduling overhead on a shared core complex.
95- - Minor cross-core coherency traffic (MOESI protocol) that does not
96- indicate same-cache-line contention.
113+ ** At 4 threads (` setContextFull ` ):** +2.8% increase. Well below false-sharing
114+ threshold (typically 2-10x). Consistent with L2/L3 cache pressure and MOESI
115+ coherency traffic with no same-cache-line contention.
97116
98117### Verdict
99118
100- ** No false sharing detected.** Per-thread ` Context ` structs allocated
101- via ` ProfiledThread ` are sufficiently separated in memory. Throughput
102- scales near-linearly from 1 to 4 threads.
119+ ** No false sharing detected.** Consistent with the 2026-03-21 result. Per-thread
120+ ` OtelThreadContextRecord ` structs embedded in ` ProfiledThread ` are sufficiently
121+ separated in memory. Throughput scales near-linearly from 1 to 4 threads.
103122
104123## JNI vs ByteBuffer Implementation Comparison
105124
@@ -118,16 +137,18 @@ The "JNI" column was captured on Java 25 (Temurin 25.0.2) where
118137Both are standard OpenJDK 25 distributions with no profiling-relevant
119138divergences; the vendor difference does not materially affect the comparison.
120139
121- | Benchmark | JNI (ns/op) | ByteBuffer (ns/op) | Speedup |
122- | ------------------| ------------:| -------------------: | --------:|
123- | ` clearContext ` | 6.5 | 5.011 | 1.3x |
124- | ` setContextFull ` | 20.0 | 11.104 | 1.8x |
125- | ` setAttrCacheHit ` | 114.8 | 10.707 | 10.7x |
126- | ` spanLifecycle ` | 146.3 | 30.430 | 4.8x |
140+ | Benchmark | JNI (ns/op) | ByteBuffer 2026-03-21 | ByteBuffer 2026-04-07 | JNI→current |
141+ | ------------------| ------------:| ----------------------: | ----------------------: | ---- --------:|
142+ | ` clearContext ` | 6.5 | 5.011 | 4.279 | 1.5x |
143+ | ` setContextFull ` | 20.0 | 11.104 | 5.993 | 3.3x |
144+ | ` setAttrCacheHit ` | 114.8 | 10.707 | 9.512 | 12.1x |
145+ | ` spanLifecycle ` | 146.3 | 30.430 | 16.345 | 9.0x |
127146
128- The span lifecycle hot path improved by ** 4.8x** (146 ns to 30 ns).
129- The dominant contributor is ` setContextAttribute ` , which improved by
130- ** 10.7x** (115 ns to 11 ns).
147+ The span lifecycle hot path improved by ** 9.0x** from JNI (146 ns) to the current
148+ ByteBuffer implementation (16.3 ns). The dominant contributor is ` setContextAttribute ` ,
149+ which improved by ** 12.1x** (115 ns → 9.5 ns). Additional gains since the 2026-03-21
150+ baseline reflect OTEP record write optimizations (removed ` full_fence ` , tightened
151+ detach/attach, direct sidecar writes).
131152
132153### Why the JNI Path Was Slow
133154
@@ -253,36 +274,41 @@ the cache-hit cost of ~11 ns, not the registration cost.
253274Based on these results, the per-span instrumentation cost on the context
254275hot path is:
255276
256- | Operation | Cost |
257- | ------------------------| -----------|
258- | Span start (set IDs) | ~ 11 ns |
259- | Set one attribute | ~ 11 ns |
260- | Full span lifecycle | ~ 30 ns |
261- | Span end (clear) | ~ 5 ns |
262- | ** Total per-span** | ** ~ 35 ns** |
277+ | Operation | Cost (2026-03-21) | Cost (2026-04-07) |
278+ | ------------------------| ------------------:| ------------------:|
279+ | Span start (set IDs) | ~ 11 ns | ~ 6 ns |
280+ | Set one attribute | ~ 11 ns | ~ 10 ns |
281+ | Full span lifecycle | ~ 30 ns | ~ 16 ns |
282+ | Span end (clear) | ~ 5 ns | ~ 4 ns |
283+ | Read span ID | ~ 72 ns (alloc) | ~ 1 ns (direct) |
284+ | ** Total per-span** | ** ~ 35 ns** | ** ~ 20 ns** |
263285
264- At 35 ns per span, a single thread can sustain ~ 28 million span
286+ At 20 ns per span, a single thread can sustain ~ 50 million span
265287transitions per second before the context layer becomes a bottleneck.
266288For a typical web application processing 10K-100K requests/second, this
267- represents less than 0.004 % of available CPU time.
289+ represents less than 0.002 % of available CPU time.
268290
269291## Recommendations
270292
2712931 . ** No action needed on false sharing.** The current ` ProfiledThread `
272294 allocation strategy (one heap allocation per thread) provides
273- sufficient cache-line isolation.
295+ sufficient cache-line isolation. Confirmed by both the 2026-03-21 and
296+ 2026-04-07 runs.
274297
2752982 . ** ` getContext ` allocation (resolved).** The original ` getContext ` method
276- allocated a ` long[] ` per call (~ 71.6 ns). This was resolved by replacing
277- ` getContext() ` with direct ` DirectByteBuffer ` reads in ` getSpanId() ` and
278- ` getRootSpanId() ` , eliminating both the JNI call and the array allocation.
279-
280- 3 . ** Higher thread counts.** To stress-test on server hardware, add
281- ` @Threads(8) ` and ` @Threads(16) ` variants and run on a machine with
282- sufficient physical cores. On a laptop, 4 threads already saturate
283- the performance core complex.
284-
285- 4 . ** Linux ` perf stat ` validation.** For definitive cache-miss evidence,
286- run on Linux with ` perf stat -e L1-dcache-load-misses,LLC-load-misses `
287- and compare miss rates across thread counts. This would confirm the
288- absence of coherency traffic at the hardware level.
299+ allocated a ` long[] ` per call (~ 71.6 ns). Resolved by replacing ` getContext() `
300+ with direct ` DirectByteBuffer ` reads in ` getSpanId() ` (1.0 ns) and
301+ ` getRootSpanId() ` , eliminating the JNI call and array allocation entirely.
302+
303+ 3 . ** Re-run multi-threaded variants with 2+ forks.** The 2026-04-07 run used
304+ 1 fork with 3 iterations; ` setContextFull_4t ` has a ±21% error bar and the
305+ ` spanLifecycle ` multi-thread results show noise-floor inversion. A 2-fork run
306+ would produce publication-quality error bars.
307+
308+ 4 . ** Higher thread counts.** To stress-test on server hardware, add ` @Threads(8) `
309+ and ` @Threads(16) ` variants and run on a machine with sufficient physical cores.
310+ On a laptop, 4 threads saturates the performance core complex.
311+
312+ 5 . ** Linux ` perf stat ` validation.** For definitive cache-miss evidence, run on
313+ Linux with ` perf stat -e L1-dcache-load-misses,LLC-load-misses ` and compare
314+ miss rates across thread counts.
0 commit comments