Skip to content

Commit 10987e4

Browse files
committed
fix: address ivoanjo-ghost review findings
- ThreadContextBenchmark: replace removed getContext() with getSpanId() - thread.cpp: use __atomic_store_n(RELEASE) for _otel_ctx_initialized - thread.h: document naturally-aligned u64 atomicity assumption - ThreadContext: document clearContextDirect valid=0 OTEP reader trade-off - context_api.h: move LOCAL_ROOT_SPAN_ATTR_INDEX to context_api.cpp (file-local) - perf report: remove stale getContext results, note replacement by direct buffer reads
1 parent 742ca57 commit 10987e4

7 files changed

Lines changed: 26 additions & 16 deletions

File tree

ddprof-lib/src/main/cpp/context_api.cpp

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -23,6 +23,9 @@
2323
#include "thread.h"
2424
#include <cstring>
2525

26+
// Reserved attribute index for local root span ID in OTEL attrs_data.
27+
// Only used within this translation unit; not part of the public ContextApi header.
28+
static const uint8_t LOCAL_ROOT_SPAN_ATTR_INDEX = 0;
2629

2730
/**
2831
* Initialize context TLS for the current thread on first use.

ddprof-lib/src/main/cpp/context_api.h

Lines changed: 0 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -78,8 +78,6 @@ class ContextApi {
7878
*/
7979
static void registerAttributeKeys(const char** keys, int count);
8080

81-
// Reserved attribute index for local root span ID in OTEL attrs_data.
82-
static const uint8_t LOCAL_ROOT_SPAN_ATTR_INDEX = 0;
8381
};
8482

8583
#endif /* _CONTEXT_API_H */

ddprof-lib/src/main/cpp/thread.cpp

Lines changed: 3 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -110,7 +110,9 @@ void ProfiledThread::releaseFromBuffer() {
110110
// short-circuit before reading partially-zeroed data during the memset below.
111111
// (The valid flag is zeroed by memset too, but _otel_ctx_initialized guards
112112
// the isContextInitialized() check which runs before any record access.)
113-
_otel_ctx_initialized = false;
113+
// Use __ATOMIC_RELEASE so the compiler cannot reorder this store after the
114+
// memset on ARM with aggressive optimizations.
115+
__atomic_store_n(&_otel_ctx_initialized, false, __ATOMIC_RELEASE);
114116
clearOtelSidecar();
115117
memset(&_otel_ctx_record, 0, sizeof(_otel_ctx_record));
116118

ddprof-lib/src/main/cpp/thread.h

Lines changed: 3 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -129,6 +129,9 @@ class ProfiledThread : public ThreadLocalData {
129129
// between detach() and attach() in Java, ContextApi::get returns valid=0 with
130130
// root_span_id=0; writing that would clobber the value Java just stored.
131131
if (context_valid) {
132+
// Plain store is safe: naturally-aligned u64 stores/loads are atomic on
133+
// x86-64 and aarch64 (the only supported targets). The Java writer uses
134+
// sidecarBuffer.putLong() which is a single aligned 8-byte store.
132135
_otel_local_root_span_id = root_span_id;
133136
}
134137
_recording_epoch = recording_epoch;

ddprof-lib/src/main/java/com/datadoghq/profiler/ThreadContext.java

Lines changed: 6 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -293,6 +293,12 @@ private void setContextDirect(long localRootSpanId, long spanId, long trHi, long
293293
* attrs_data_size is not reset here; the next non-zero setContext call will reset it
294294
* before attach(). This is safe because valid remains 0 after clear, so no reader will
295295
* observe the stale attrs_data_size.
296+
*
297+
* <p>External OTEP readers see valid=0 and skip the record — they cannot distinguish
298+
* "cleared" from "being mutated". This is intentional: without also resetting
299+
* attrs_data_size here, publishing valid=1 with a stale attrs_data_size would expose
300+
* a partially-valid record. The cleared state is effectively invisible to external
301+
* readers until the next non-zero setContext call publishes it.
296302
*/
297303
private void clearContextDirect() {
298304
recordBuffer.putLong(traceIdOffset, 0);

ddprof-stresstest/src/jmh/java/com/datadoghq/profiler/stresstest/scenarios/throughput/ThreadContextBenchmark.java

Lines changed: 2 additions & 2 deletions
Original file line numberDiff line numberDiff line change
@@ -126,8 +126,8 @@ public void spanLifecycle_4t(ThreadState ts) {
126126
}
127127

128128
@Benchmark
129-
public long[] getContext(ThreadState ts) {
130-
return ts.ctx.getContext();
129+
public long getSpanId(ThreadState ts) {
130+
return ts.ctx.getSpanId();
131131
}
132132

133133
@Benchmark

doc/performance/reports/thread-context-benchmark-2026-03-21.md

Lines changed: 9 additions & 11 deletions
Original file line numberDiff line numberDiff line change
@@ -42,7 +42,6 @@ of 1.
4242
```
4343
Benchmark Mode Cnt Score Error Units
4444
ThreadContextBenchmark.clearContext avgt 9 5.011 ± 0.039 ns/op
45-
ThreadContextBenchmark.getContext avgt 9 71.557 ± 1.162 ns/op
4645
ThreadContextBenchmark.setAttrCacheHit avgt 9 10.707 ± 0.077 ns/op
4746
ThreadContextBenchmark.setContextFull avgt 9 11.104 ± 0.338 ns/op
4847
ThreadContextBenchmark.setContextFull_2t avgt 9 11.081 ± 0.105 ns/op
@@ -60,16 +59,17 @@ ThreadContextBenchmark.spanLifecycle_4t avgt 9 32.203 ± 0.309 ns/op
6059
| `setContextFull` | 11.104 | ± 0.338 | Write localRootSpanId, spanId, traceIdHigh, traceIdLow (4-arg put) |
6160
| `setAttrCacheHit`| 10.707 | ± 0.077 | Set string attribute (dictionary cache hit) |
6261
| `spanLifecycle` | 30.430 | ± 0.129 | `setContextFull` + `setAttrCacheHit` combined |
63-
| `getContext` | 71.557 | ± 1.162 | Read context into a `long[]` (allocates) |
6462

6563
`spanLifecycle` at ~30.4 ns is consistent with the sum of its parts:
6664
`setContextFull` (~11.1 ns) + `setAttrCacheHit` (~10.7 ns) + loop and
6765
call overhead (~8.6 ns). All error bars are under 2% of the score,
6866
indicating stable measurements.
6967

70-
`getContext` is the most expensive operation at ~71.6 ns, dominated by
71-
the `long[]` allocation on the return path. This is a read-side
72-
operation, not on the span start/end hot path.
68+
Note: the original `getContext` benchmark (71.6 ns, dominated by `long[]` allocation)
69+
was removed when `getContext()` was replaced by direct `DirectByteBuffer` reads in
70+
`getSpanId()` and `getRootSpanId()`, eliminating the JNI call and the array allocation.
71+
The `getSpanId` benchmark was added as a replacement but was not run at the time of
72+
this report.
7373

7474
### False Sharing Analysis
7575

@@ -124,7 +124,6 @@ divergences; the vendor difference does not materially affect the comparison.
124124
| `setContextFull` | 20.0 | 11.104 | 1.8x |
125125
| `setAttrCacheHit`| 114.8 | 10.707 | 10.7x |
126126
| `spanLifecycle` | 146.3 | 30.430 | 4.8x |
127-
| `getContext` | 68.8 | 71.557 | 0.96x |
128127

129128
The span lifecycle hot path improved by **4.8x** (146 ns to 30 ns).
130129
The dominant contributor is `setContextAttribute`, which improved by
@@ -273,11 +272,10 @@ represents less than 0.004% of available CPU time.
273272
allocation strategy (one heap allocation per thread) provides
274273
sufficient cache-line isolation.
275274

276-
2. **`getContext` allocation.** The `long[]` allocation in `getContext`
277-
(~71.6 ns) could be eliminated by passing a reusable buffer, but
278-
this is a read-side operation not on the span start/end critical
279-
path. Only worth optimizing if profiling shows GC pressure from
280-
high-frequency context reads.
275+
2. **`getContext` allocation (resolved).** The original `getContext` method
276+
allocated a `long[]` per call (~71.6 ns). This was resolved by replacing
277+
`getContext()` with direct `DirectByteBuffer` reads in `getSpanId()` and
278+
`getRootSpanId()`, eliminating both the JNI call and the array allocation.
281279

282280
3. **Higher thread counts.** To stress-test on server hardware, add
283281
`@Threads(8)` and `@Threads(16)` variants and run on a machine with

0 commit comments

Comments
 (0)