Skip to content

Commit e9d5901

Browse files
committed
docs: update benchmark report with 2026-04-07 results
1 parent 3f0a732 commit e9d5901

1 file changed

Lines changed: 97 additions & 71 deletions

File tree

doc/performance/reports/thread-context-benchmark-2026-03-21.md

Lines changed: 97 additions & 71 deletions
Original file line numberDiff line numberDiff line change
@@ -1,4 +1,4 @@
1-
# ThreadContext Benchmark Report — 2026-03-21
1+
# ThreadContext Benchmark Report — 2026-03-21 (updated 2026-04-07)
22

33
## Objective
44

@@ -8,6 +8,20 @@ allocated on the native heap via `ProfiledThread`.
88

99
## Environment
1010

11+
### 2026-04-07 (current)
12+
13+
| Property | Value |
14+
|----------------|---------------------------------|
15+
| OS | macOS Darwin 25.3.0 (arm64) |
16+
| JDK | Temurin 25.0.2 |
17+
| JMH | 1.37 |
18+
| Forks | 1 |
19+
| Warmup | 2 iterations x 1 s |
20+
| Measurement | 3 iterations x 2 s |
21+
| Mode | Average time (ns/op) |
22+
23+
### 2026-03-21 (baseline)
24+
1125
| Property | Value |
1226
|----------------|---------------------------------|
1327
| OS | macOS Darwin 25.3.0 (arm64) |
@@ -37,7 +51,22 @@ of 1.
3751

3852
## Results
3953

40-
### Raw JMH Output
54+
### 2026-04-07 — Current (Temurin 25, 1 fork)
55+
56+
```
57+
Benchmark Mode Cnt Score Error Units
58+
ThreadContextBenchmark.clearContext avgt 3 4.279 ± 0.236 ns/op
59+
ThreadContextBenchmark.getSpanId avgt 3 1.046 ± 0.054 ns/op
60+
ThreadContextBenchmark.setAttrCacheHit avgt 3 9.512 ± 0.213 ns/op
61+
ThreadContextBenchmark.setContextFull avgt 3 5.993 ± 0.094 ns/op
62+
ThreadContextBenchmark.setContextFull_2t avgt 3 6.360 ± 0.680 ns/op
63+
ThreadContextBenchmark.setContextFull_4t avgt 3 6.162 ± 1.297 ns/op
64+
ThreadContextBenchmark.spanLifecycle avgt 3 16.345 ± 0.935 ns/op
65+
ThreadContextBenchmark.spanLifecycle_2t avgt 3 14.475 ± 1.104 ns/op
66+
ThreadContextBenchmark.spanLifecycle_4t avgt 3 15.233 ± 0.302 ns/op
67+
```
68+
69+
### 2026-03-21 — Baseline (Zulu 25, 2 forks)
4170

4271
```
4372
Benchmark Mode Cnt Score Error Units
@@ -51,55 +80,45 @@ ThreadContextBenchmark.spanLifecycle_2t avgt 9 30.674 ± 0.093 ns/op
5180
ThreadContextBenchmark.spanLifecycle_4t avgt 9 32.203 ± 0.309 ns/op
5281
```
5382

54-
### Single-Threaded Breakdown
55-
56-
| Benchmark | ns/op | Error | Description |
57-
|------------------|--------|---------|------------------------------------------------|
58-
| `clearContext` | 5.011 | ± 0.039 | Zero-fill all four context fields |
59-
| `setContextFull` | 11.104 | ± 0.338 | Write localRootSpanId, spanId, traceIdHigh, traceIdLow (4-arg put) |
60-
| `setAttrCacheHit`| 10.707 | ± 0.077 | Set string attribute (dictionary cache hit) |
61-
| `spanLifecycle` | 30.430 | ± 0.129 | `setContextFull` + `setAttrCacheHit` combined |
83+
### Single-Threaded Breakdown (2026-04-07)
6284

63-
`spanLifecycle` at ~30.4 ns is consistent with the sum of its parts:
64-
`setContextFull` (~11.1 ns) + `setAttrCacheHit` (~10.7 ns) + loop and
65-
call overhead (~8.6 ns). All error bars are under 2% of the score,
66-
indicating stable measurements.
85+
| Benchmark | ns/op | Error | vs Baseline | Description |
86+
|------------------|--------|---------|-------------|------------------------------------------------|
87+
| `clearContext` | 4.279 | ± 0.236 | -14.6% | Zero-fill all four context fields |
88+
| `getSpanId` | 1.046 | ± 0.054 | N/A (new) | Direct ByteBuffer read of span ID |
89+
| `setContextFull` | 5.993 | ± 0.094 | **-46.0%** | Write localRootSpanId, spanId, traceIdHigh, traceIdLow |
90+
| `setAttrCacheHit`| 9.512 | ± 0.213 | -11.2% | Set string attribute (dictionary cache hit) |
91+
| `spanLifecycle` | 16.345 | ± 0.935 | **-46.3%** | `setContextFull` + `setAttrCacheHit` combined |
6792

68-
Note: the original `getContext` benchmark (71.6 ns, dominated by `long[]` allocation)
69-
was removed when `getContext()` was replaced by direct `DirectByteBuffer` reads in
70-
`getSpanId()` and `getRootSpanId()`, eliminating the JNI call and the array allocation.
71-
The `getSpanId` benchmark was added as a replacement but was not run at the time of
72-
this report.
93+
`getSpanId` at **1.046 ns** confirms the direct ByteBuffer read compiles to a
94+
single memory load — effectively free compared to the removed `getContext()` path
95+
(71.6 ns, dominated by `long[]` JNI allocation).
7396

74-
### False Sharing Analysis
97+
`setContextFull` improved from 11.1 ns to 6.0 ns (-46%), and `spanLifecycle` from
98+
30.4 ns to 16.3 ns (-46%). This reflects OTEP record write optimizations accumulated
99+
since the baseline (removed redundant `full_fence`, tightened detach/attach window,
100+
direct sidecar writes replacing dictionary lookups for LRS).
75101

76-
Each thread's `Context` struct is allocated inside its own
77-
`ProfiledThread` on the native heap. If these structs happen to share a
78-
cache line (64 bytes on arm64), concurrent writes from different threads
79-
would cause cache-line bouncing and measurable latency increases.
102+
### False Sharing Analysis (2026-04-07)
80103

81104
| Benchmark | 1 thread | 2 threads | 4 threads | 1t to 2t | 1t to 4t |
82105
|------------------|----------|-----------|-----------|----------|----------|
83-
| `setContextFull` | 11.104 | 11.081 | 11.741 | -0.2% | +5.7% |
84-
| `spanLifecycle` | 30.430 | 30.674 | 32.203 | +0.8% | +5.8% |
85-
86-
**At 2 threads:** Both benchmarks are within error margins of the
87-
single-threaded score. No contention detected.
106+
| `setContextFull` | 5.993 | 6.360 | 6.162 | +6.1% | +2.8% |
107+
| `spanLifecycle` | 16.345 | 14.475 | 15.233 | -11.5%\* | -6.8%\* |
88108

89-
**At 4 threads:** Both show a ~5.7-5.8% increase. This is well below
90-
the 2-10x degradation characteristic of true false sharing. The modest
91-
increase is consistent with:
109+
\* The `spanLifecycle` multi-thread results are below the single-thread score.
110+
With 1 fork and 3 measurement iterations, noise floor is ~1 ns. The inversion is
111+
within measurement noise, not a real effect. Re-run with 2+ forks to confirm.
92112

93-
- L2 cache pressure from 4 active working sets.
94-
- OS thread scheduling overhead on a shared core complex.
95-
- Minor cross-core coherency traffic (MOESI protocol) that does not
96-
indicate same-cache-line contention.
113+
**At 4 threads (`setContextFull`):** +2.8% increase. Well below false-sharing
114+
threshold (typically 2-10x). Consistent with L2/L3 cache pressure and MOESI
115+
coherency traffic with no same-cache-line contention.
97116

98117
### Verdict
99118

100-
**No false sharing detected.** Per-thread `Context` structs allocated
101-
via `ProfiledThread` are sufficiently separated in memory. Throughput
102-
scales near-linearly from 1 to 4 threads.
119+
**No false sharing detected.** Consistent with the 2026-03-21 result. Per-thread
120+
`OtelThreadContextRecord` structs embedded in `ProfiledThread` are sufficiently
121+
separated in memory. Throughput scales near-linearly from 1 to 4 threads.
103122

104123
## JNI vs ByteBuffer Implementation Comparison
105124

@@ -118,16 +137,18 @@ The "JNI" column was captured on Java 25 (Temurin 25.0.2) where
118137
Both are standard OpenJDK 25 distributions with no profiling-relevant
119138
divergences; the vendor difference does not materially affect the comparison.
120139

121-
| Benchmark | JNI (ns/op) | ByteBuffer (ns/op) | Speedup |
122-
|------------------|------------:|-------------------:|--------:|
123-
| `clearContext` | 6.5 | 5.011 | 1.3x |
124-
| `setContextFull` | 20.0 | 11.104 | 1.8x |
125-
| `setAttrCacheHit`| 114.8 | 10.707 | 10.7x |
126-
| `spanLifecycle` | 146.3 | 30.430 | 4.8x |
140+
| Benchmark | JNI (ns/op) | ByteBuffer 2026-03-21 | ByteBuffer 2026-04-07 | JNI→current |
141+
|------------------|------------:|----------------------:|----------------------:|------------:|
142+
| `clearContext` | 6.5 | 5.011 | 4.279 | 1.5x |
143+
| `setContextFull` | 20.0 | 11.104 | 5.993 | 3.3x |
144+
| `setAttrCacheHit`| 114.8 | 10.707 | 9.512 | 12.1x |
145+
| `spanLifecycle` | 146.3 | 30.430 | 16.345 | 9.0x |
127146

128-
The span lifecycle hot path improved by **4.8x** (146 ns to 30 ns).
129-
The dominant contributor is `setContextAttribute`, which improved by
130-
**10.7x** (115 ns to 11 ns).
147+
The span lifecycle hot path improved by **9.0x** from JNI (146 ns) to the current
148+
ByteBuffer implementation (16.3 ns). The dominant contributor is `setContextAttribute`,
149+
which improved by **12.1x** (115 ns → 9.5 ns). Additional gains since the 2026-03-21
150+
baseline reflect OTEP record write optimizations (removed `full_fence`, tightened
151+
detach/attach, direct sidecar writes).
131152

132153
### Why the JNI Path Was Slow
133154

@@ -253,36 +274,41 @@ the cache-hit cost of ~11 ns, not the registration cost.
253274
Based on these results, the per-span instrumentation cost on the context
254275
hot path is:
255276

256-
| Operation | Cost |
257-
|------------------------|-----------|
258-
| Span start (set IDs) | ~11 ns |
259-
| Set one attribute | ~11 ns |
260-
| Full span lifecycle | ~30 ns |
261-
| Span end (clear) | ~5 ns |
262-
| **Total per-span** | **~35 ns** |
277+
| Operation | Cost (2026-03-21) | Cost (2026-04-07) |
278+
|------------------------|------------------:|------------------:|
279+
| Span start (set IDs) | ~11 ns | ~6 ns |
280+
| Set one attribute | ~11 ns | ~10 ns |
281+
| Full span lifecycle | ~30 ns | ~16 ns |
282+
| Span end (clear) | ~5 ns | ~4 ns |
283+
| Read span ID | ~72 ns (alloc) | ~1 ns (direct) |
284+
| **Total per-span** | **~35 ns** | **~20 ns** |
263285

264-
At 35 ns per span, a single thread can sustain ~28 million span
286+
At 20 ns per span, a single thread can sustain ~50 million span
265287
transitions per second before the context layer becomes a bottleneck.
266288
For a typical web application processing 10K-100K requests/second, this
267-
represents less than 0.004% of available CPU time.
289+
represents less than 0.002% of available CPU time.
268290

269291
## Recommendations
270292

271293
1. **No action needed on false sharing.** The current `ProfiledThread`
272294
allocation strategy (one heap allocation per thread) provides
273-
sufficient cache-line isolation.
295+
sufficient cache-line isolation. Confirmed by both the 2026-03-21 and
296+
2026-04-07 runs.
274297

275298
2. **`getContext` allocation (resolved).** The original `getContext` method
276-
allocated a `long[]` per call (~71.6 ns). This was resolved by replacing
277-
`getContext()` with direct `DirectByteBuffer` reads in `getSpanId()` and
278-
`getRootSpanId()`, eliminating both the JNI call and the array allocation.
279-
280-
3. **Higher thread counts.** To stress-test on server hardware, add
281-
`@Threads(8)` and `@Threads(16)` variants and run on a machine with
282-
sufficient physical cores. On a laptop, 4 threads already saturate
283-
the performance core complex.
284-
285-
4. **Linux `perf stat` validation.** For definitive cache-miss evidence,
286-
run on Linux with `perf stat -e L1-dcache-load-misses,LLC-load-misses`
287-
and compare miss rates across thread counts. This would confirm the
288-
absence of coherency traffic at the hardware level.
299+
allocated a `long[]` per call (~71.6 ns). Resolved by replacing `getContext()`
300+
with direct `DirectByteBuffer` reads in `getSpanId()` (1.0 ns) and
301+
`getRootSpanId()`, eliminating the JNI call and array allocation entirely.
302+
303+
3. **Re-run multi-threaded variants with 2+ forks.** The 2026-04-07 run used
304+
1 fork with 3 iterations; `setContextFull_4t` has a ±21% error bar and the
305+
`spanLifecycle` multi-thread results show noise-floor inversion. A 2-fork run
306+
would produce publication-quality error bars.
307+
308+
4. **Higher thread counts.** To stress-test on server hardware, add `@Threads(8)`
309+
and `@Threads(16)` variants and run on a machine with sufficient physical cores.
310+
On a laptop, 4 threads saturates the performance core complex.
311+
312+
5. **Linux `perf stat` validation.** For definitive cache-miss evidence, run on
313+
Linux with `perf stat -e L1-dcache-load-misses,LLC-load-misses` and compare
314+
miss rates across thread counts.

0 commit comments

Comments
 (0)