Skip to content

Defer MetricKey construction to the aggregator thread#11381

Open
dougqh wants to merge 3 commits into
dougqh/conflating-metrics-producer-winsfrom
dougqh/conflating-metrics-background-work
Open

Defer MetricKey construction to the aggregator thread#11381
dougqh wants to merge 3 commits into
dougqh/conflating-metrics-producer-winsfrom
dougqh/conflating-metrics-background-work

Conversation

@dougqh
Copy link
Copy Markdown
Contributor

@dougqh dougqh commented May 15, 2026

What Does This Do

Moves the per-span MetricKey construction, cache lookups, and aggregation off the producer thread into the existing aggregator thread, replacing the Batch-based conflation pipeline with a thin per-span SpanSnapshot posted to the inbox.

Motivation

Incremental step towards using a lighter weight structure for metrics.

In the subsequent PR, I intend to switch to a simplified hash table that isn't thread-safe.

The simplified hashtable uses custom entries that that will allow us to avoid the MetricKey construction on look-up,
but given that the simple hashtable isn't thread-safe we need to move the work to the consumer thread first.

Additional Notes

Stacked on top of #11380 -- review that first; the merge base of this PR is dougqh/conflating-metrics-producer-wins, not master. The diff shown here is only the work that's new beyond that PR.

What the producer does now (per span)

  • filter (shouldComputeMetric, resource-ignored, longRunning)
  • collect tag values into a SpanSnapshot (one allocation per span)
  • inbox.offer(snapshot) + return error flag for forceKeep

What moved off the producer

  • MetricKey construction and its hash computation
  • SERVICE_NAMES.computeIfAbsent (UTF8 encoding of service name)
  • SPAN_KINDS.computeIfAbsent (UTF8 encoding of span.kind)
  • PEER_TAGS_CACHE lookups (peer-tag name+value UTF8 encoding)
  • pending / keys ConcurrentHashMap operations
  • Batch pooling, atomic ops, contributeTo

Removed entirely

  • Batch.java -- the aggregator's existing LRUCache<MetricKey, AggregateMetric> IS the conflation point now
  • pending ConcurrentHashMap<MetricKey, Batch>
  • keys ConcurrentHashMap<MetricKey, MetricKey> (canonical dedup)
  • batchPool MessagePassingQueue<Batch>
  • CommonKeyCleaner's keys.keySet() tracking; AggregateExpiry now just reports LRU drops to health metrics

Added

  • SpanSnapshot: immutable value carrying the raw MetricKey inputs + a tagAndDuration long (duration OR-ed with ERROR_TAG / TOP_LEVEL_TAG).
  • AggregateMetric.recordOneDuration(long) -- single-hit equivalent of the existing recordDurations(int, AtomicLongArray).
  • Peer-tag values flow through the snapshot as a flattened String[] of [name0, value0, name1, value1, ...]; the aggregator encodes them through PEER_TAGS_CACHE on its own thread.
  • HealthMetrics.onStatsInboxFull() + a TracerHealthMetrics counter reported as stats.dropped_aggregates{reason:inbox_full} -- parallel to the existing reason:lru_eviction. Without conflation the producer can lose snapshots when the bounded MPSC queue is full; this makes that visible without silencing it.

Benchmark results (2 forks × 5 iter × 15s)

ConflatingMetricsAggregatorDDSpanBenchmark:

avgt (µs/op) CI (99.9%)
prior commit (stacked base) 6.343 ± 0.115 [6.228, 6.458]
this PR 2.506 ± 0.044 [2.462, 2.550]

~60% faster on the production DDSpan path. The SimpleSpan bench shows ~53% faster as well.

Caveat on the bench numbers

Without conflation, the producer pushes 1 inbox item per span instead of ~1 per 64. At the JMH bench's synthetic rate (effectively ~20M snapshots/sec from the producer) the consumer can't keep up and inbox.offer silently drops -- the new onStatsInboxFull counter would fire constantly. The headline numbers measure producer publish() latency only; consumer throughput at realistic span rates is a follow-up to validate. Tuning maxPending matters more in this design.

Real fixes for capacity (out of scope for this PR):

  • Bump maxPending default; the conflating design used 2048 slots × ~64 conflation = ~131K effective capacity, the new design has 2048 slots flat.
  • Producer-side mini-batching (TLAB-style accumulator per thread) to recover some compression.

Test plan

  • ./gradlew :dd-trace-core:test --tests 'datadog.trace.common.metrics.*' passes
  • ./gradlew :dd-trace-core:test --tests 'datadog.trace.core.monitor.*' passes
  • ./gradlew :dd-trace-core:compileJava :dd-trace-core:compileTestGroovy :dd-trace-core:compileJmhJava :dd-trace-core:compileTraceAgentTestGroovy all green
  • ./gradlew spotlessCheck clean
  • CI muzzle / integration suites
  • Validate stats.dropped_aggregates{reason:inbox_full} reports as expected under a synthetic high-load run (not in the JMH bench)

🤖 Generated with Claude Code

dougqh and others added 2 commits May 15, 2026 13:50
Replace the producer-side conflation pipeline with a thin per-span SpanSnapshot
posted to the existing aggregator thread. The aggregator now builds the
MetricKey, does the SERVICE_NAMES / SPAN_KINDS / PEER_TAGS_CACHE lookups, and
updates the AggregateMetric directly -- all off the producer's hot path.

What the producer does now, per span:

  - filter (shouldComputeMetric, resource-ignored, longRunning)
  - collect tag values into a SpanSnapshot (1 allocation per span)
  - inbox.offer(snapshot) + return error flag for forceKeep

What moved off the producer:

  - MetricKey construction and its hash computation
  - SERVICE_NAMES.computeIfAbsent (UTF8 encoding of service name)
  - SPAN_KINDS.computeIfAbsent (UTF8 encoding of span.kind)
  - PEER_TAGS_CACHE lookups (peer-tag name+value UTF8 encoding)
  - pending/keys ConcurrentHashMap operations
  - Batch pooling, batch atomic ops, batch contributeTo

Removed entirely:

  - Batch.java -- the conflation primitive is no longer needed; the
    aggregator's existing LRUCache<MetricKey, AggregateMetric> IS the
    conflation point now.
  - pending ConcurrentHashMap<MetricKey, Batch>
  - keys ConcurrentHashMap<MetricKey, MetricKey> (canonical dedup)
  - batchPool MessagePassingQueue<Batch>
  - The CommonKeyCleaner role of tracking keys.keySet() on LRU eviction --
    AggregateExpiry now just reports drops to healthMetrics.

Added:

  - SpanSnapshot: immutable value carrying the raw MetricKey inputs + a
    tagAndDuration long (duration | ERROR_TAG | TOP_LEVEL_TAG).
  - AggregateMetric.recordOneDuration(long tagAndDuration) -- the single-hit
    equivalent of the existing recordDurations(int, AtomicLongArray).
  - Peer-tag values flow through the snapshot as a flattened String[] of
    [name0, value0, name1, value1, ...]; the aggregator encodes them through
    PEER_TAGS_CACHE on its own thread.

Benchmark results (2 forks x 5 iter x 15s):

  ConflatingMetricsAggregatorDDSpanBenchmark
    prior commit  6.343 +- 0.115 us/op
    this commit   2.506 +- 0.044 us/op  (~60% faster)

  ConflatingMetricsAggregatorBenchmark (SimpleSpan)
    prior commit  6.585 +- 0.049 us/op
    this commit   3.116 +- 0.032 us/op  (~53% faster)

Caveat on the benchmark: without conflation, the producer pushes 1 inbox
item per span instead of ~1 per 64. At the benchmark's synthetic rate the
consumer can't keep up and inbox.offer silently drops. The numbers measure
producer publish() latency only; consumer throughput at realistic span rates
is a follow-up to validate. Tuning maxPending matters more in this design.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
With the per-span SpanSnapshot inbox path, the producer can lose snapshots
when the bounded MPSC queue is full -- silently, since inbox.offer() returns
a boolean we previously ignored. The conflating-Batch design used to absorb
~64x more producer pressure per inbox slot, so this is a new failure mode
worth surfacing.

Wire it through the existing HealthMetrics path:

- HealthMetrics.onStatsInboxFull() (no-op default).
- TracerHealthMetrics gets a statsInboxFull LongAdder and a new reason tag
  reason:inbox_full reported under the same stats.dropped_aggregates metric
  used for LRU evictions. Two LongAdders, two tagged time series.
- ConflatingMetricsAggregator.publish increments the counter when
  inbox.offer(snapshot) returns false.

This doesn't fix the drop -- tuning maxPending and/or building producer-side
batching are the actual fixes. But it makes the failure visible in the same
place ops already watches.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@dougqh dougqh added type: enhancement Enhancements and improvements comp: core Tracer core tag: performance Performance related changes tag: no release notes Changes to exclude from release notes comp: metrics Metrics tag: ai generated Largely based on code generated by an AI or LLM labels May 15, 2026
@dougqh dougqh marked this pull request as ready for review May 18, 2026 15:37
@dougqh dougqh requested a review from a team as a code owner May 18, 2026 15:37
@dougqh dougqh requested a review from mhlidd May 18, 2026 15:37
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 950499c767

ℹ️ About Codex in GitHub

Codex has been enabled to automatically review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

When you sign up for Codex through ChatGPT, Codex can also answer questions or update the PR, like "@codex address that feedback".

Comment on lines +514 to +518
reportIfChanged(
target.statsd,
"stats.dropped_aggregates",
target.statsInboxFull,
REASON_INBOX_FULL_TAG);
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Resize health metric history for inbox-full counter

When statsInboxFull is nonzero this added 52nd reportIfChanged call indexes previousCounts[++countIndex], but previousCounts is still sized for the previous 51 counters. As a result the new reason:inbox_full metric is never emitted and every flush that reaches this call logs the resize warning instead; increase the array size alongside the new counter.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

comp: core Tracer core comp: metrics Metrics tag: ai generated Largely based on code generated by an AI or LLM tag: no release notes Changes to exclude from release notes tag: performance Performance related changes type: enhancement Enhancements and improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants