Skip to content

[otap-df-telemetry] add duration metric helper for processors#2211

Merged
jmacd merged 45 commits intoopen-telemetry:mainfrom
jmacd:jmacdonald/processor-duration-histogram
Mar 17, 2026
Merged

[otap-df-telemetry] add duration metric helper for processors#2211
jmacd merged 45 commits intoopen-telemetry:mainfrom
jmacd:jmacdonald/processor-duration-histogram

Conversation

@jmacd
Copy link
Copy Markdown
Contributor

@jmacd jmacd commented Mar 5, 2026

Add a node.processor metric set with a process.success.duration and process.failed.duration Mmsc instrument for measuring the wall-clock duration of the work done in a process() call. A closure is used to prevent inclusion of async-await points in the measurement.

The metric is registered via the node telemetry context.

This is intended to be gated by MetricLevel >= Normal

Fixes #2210.

jmacd and others added 5 commits March 4, 2026 22:38
Add a node.processor metric set with a process.duration Mmsc instrument
that automatically measures the wall-clock duration of every processor's
process() call. The timing is performed at the engine wrapper level in
ProcessorWrapper::start(), so all processors are instrumented without
requiring changes to the Processor trait or individual implementations.

The metric is registered via the node telemetry context during
prepare_runtime() and reported on each CollectTelemetry interval
(1-second periodic) as well as at final shutdown.

This is intended to be gated by MetricLevel >= Normal once open-telemetry#2169 merges.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace engine-level automatic process() timing with an opt-in
approach where processors time their own compute sections using
the new Mmsc::timed() scoped API.

Processors opt in by:
- Adding a process_duration: Mmsc field to their metric set
- Wrapping their compute section in timed()
- Reporting via the existing CollectTelemetry flow

Instrumented processors: filter, attributes, KQL recordset,
and condense attributes. Processors like retry and fanout that
don't do meaningful compute are not instrumented.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…nto jmacdonald/processor-duration-histogram
Check effect_handler.node_interests() for CONSUMER_METRICS before
calling timed(). At Basic or None level, the Instant::now() overhead
is skipped entirely. Aligns with the MetricLevel policy from PR open-telemetry#2169.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move process_duration: Mmsc from each processor's individual metric
set into a shared ProcessDuration type in the engine crate. This
encapsulates the metric set, the MetricLevel gating (via Interests),
and reporting into a single reusable wrapper.

Each call site is now a single line:
  self.process_duration.timed(interests, || compute())

Deleted per-processor metrics files that only contained
process_duration (KQL, condense). Removed process_duration from
filter and attributes metric sets that have other counters.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@github-actions github-actions bot added the rust Pull requests that update Rust code label Mar 5, 2026
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 5, 2026

Codecov Report

❌ Patch coverage is 87.16216% with 38 lines in your changes missing coverage. Please review.
✅ Project coverage is 87.71%. Comparing base (87aaf03) to head (6deafe3).
⚠️ Report is 3 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff            @@
##             main    #2211    +/-   ##
========================================
  Coverage   87.71%   87.71%            
========================================
  Files         575      577     +2     
  Lines      197290   198107   +817     
========================================
+ Hits       173044   173775   +731     
- Misses      23720    23806    +86     
  Partials      526      526            
Components Coverage Δ
otap-dataflow 89.74% <87.16%> (-0.01%) ⬇️
query_abstraction 80.61% <ø> (ø)
query_engine 90.61% <ø> (ø)
syslog_cef_receivers ∅ <ø> (∅)
otel-arrow-go 52.44% <ø> (ø)
quiver 91.91% <ø> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

jmacd and others added 4 commits March 5, 2026 19:22
Replace the closure-based timed() approach and mutable-reference guard
with a lightweight TimingGuard that captures the start instant without
borrowing the ProcessDuration. This avoids borrow checker conflicts
when the processor needs to call methods on &self/&mut self between
start and stop.

API: ProcessDuration::start(interests) -> TimingGuard
     timing.stop(&mut self.process_duration)

All four processors (filter, attributes, KQL recordset, condense)
now use this pattern with minimal diff against upstream - no code
re-indentation or method extraction required.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Introduce a Timer type that encapsulates Instant and the
elapsed-to-nanos conversion. Mmsc::timed() and TimingGuard now
both delegate to Timer, eliminating raw Instant usage from callers.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove elapsed_nanos() from Timer. Instead, Mmsc::record_timer()
consumes the Timer directly and handles the elapsed-to-nanos
conversion internally. TimingGuard::stop() now delegates straight
to record_timer().

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Enables Normal+ level metrics (including process_duration) so the
example config demonstrates the full metrics picture out of the box.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jmacd jmacd marked this pull request as ready for review March 5, 2026 20:28
@jmacd jmacd requested a review from a team as a code owner March 5, 2026 20:28
Comment thread rust/otap-dataflow/crates/engine/src/process_duration.rs Outdated
Comment thread rust/otap-dataflow/crates/engine/src/process_duration.rs Outdated
Comment thread rust/otap-dataflow/crates/engine/src/process_duration.rs Outdated
Comment thread rust/otap-dataflow/crates/otap/src/filter_processor.rs Outdated
@jmacd
Copy link
Copy Markdown
Contributor Author

jmacd commented Mar 12, 2026

e.g.,

# HELP process_duration_min Wall-clock duration of the processor compute section, in nanoseconds.
# TYPE process_duration_min gauge
process_duration_min{set="processor.process.duration",node_id="debug",node_urn="urn:otel:processor:debug",node_type="processor",pipeline_id="main",pipeline_group_id="default",core_id="19",numa_node_id="0",process_instance_id="AGOOAMAILN73HA4PQSJ3WPDKSA",host_id="JoshCorpSurfaceLaptop",container_id="",custom="{node_sequence=2, pipeline_name=MyCustomPipelineName}"} 4139070 1773287780962
# HELP process_duration_max Wall-clock duration of the processor compute section, in nanoseconds.
# TYPE process_duration_max gauge
process_duration_max{set="processor.process.duration",node_id="debug",node_urn="urn:otel:processor:debug",node_type="processor",pipeline_id="main",pipeline_group_id="default",core_id="19",numa_node_id="0",process_instance_id="AGOOAMAILN73HA4PQSJ3WPDKSA",host_id="JoshCorpSurfaceLaptop",container_id="",custom="{node_sequence=2, pipeline_name=MyCustomPipelineName}"} 4409694 1773287780962
# HELP process_duration_sum Wall-clock duration of the processor compute section, in nanoseconds.
# TYPE process_duration_sum counter
process_duration_sum{set="processor.process.duration",node_id="debug",node_urn="urn:otel:processor:debug",node_type="processor",pipeline_id="main",pipeline_group_id="default",core_id="19",numa_node_id="0",process_instance_id="AGOOAMAILN73HA4PQSJ3WPDKSA",host_id="JoshCorpSurfaceLaptop",container_id="",custom="{node_sequence=2, pipeline_name=MyCustomPipelineName}"} 8548764 1773287780962
# HELP process_duration_count Wall-clock duration of the processor compute section, in nanoseconds.
# TYPE process_duration_count counter
process_duration_count{set="processor.process.duration",node_id="debug",node_urn="urn:otel:processor:debug",node_type="processor",pipeline_id="main",pipeline_group_id="default",core_id="19",numa_node_id="0",process_instance_id="AGOOAMAILN73HA4PQSJ3WPDKSA",host_id="JoshCorpSurfaceLaptop",container_id="",custom="{node_sequence=2, pipeline_name=MyCustomPipelineName}"} 2 1773287780962

jmacd and others added 2 commits March 16, 2026 19:15
…nto jmacdonald/processor-duration-histogram

# Please enter a commit message to explain why this merge is necessary,
# especially if it merges an updated upstream into a topic branch.
#
# Lines starting with '#' will be ignored, and an empty message aborts
# the commit.
Revert to Cell<Mmsc> with &self on timed() so the closure can
borrow sibling fields on the processor.  This structurally prevents
timing from spanning .await points (reviewer feedback).

- Remove start_timer() and record_elapsed() — all 5 processors now
  use timed() exclusively
- Debug processor: timed() wraps sync proto decode only
- Filter/KQL/attributes/condense: timed() wraps sync compute work

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment thread rust/otap-dataflow/crates/engine/src/process_duration.rs Outdated
jmacd and others added 3 commits March 16, 2026 21:10
All processor call sites now use effect_handler.timed(&self.compute_duration, || { ... })
instead of self.compute_duration.timed(effect_handler.node_interests(), || { ... }).

The EffectHandler already knows its interests, so processors no longer
need to thread them through manually.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Constrain timed() to FnOnce() -> Result<T, E> and route elapsed time
into separate compute.duration.success and compute.duration.failed
instruments based on the closure outcome. This lets operators
distinguish compute time from error-path time in duration metrics.

All callers already return Result, so no call-site changes needed.

Add end-to-end test verifying both metric names appear in the
telemetry registry after report().

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
RefCell supports non-Copy types (e.g. Box<Histogram>, [u64; N] bucket
arrays) that Cell cannot handle. The borrow is non-reentrant by
construction since timed() runs a sync closure.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment thread rust/otap-dataflow/crates/engine/src/process_duration.rs Outdated
Comment on lines +161 to +162
self.metrics.log_signals_consumed.add(log_signals_consumed);
self.metrics.log_signals_filtered.add(log_signals_filtered);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: The counter updates (self.metrics.log_signals_consumed.add(...), self.metrics.span_signals_consumed.add(...), etc.) are inside the timed() closure, so they're included in the compute duration measurement. These are cheap add() calls so the impact is minimal, but for purity the counters could be updated after the closure returns (similar to how the attributes processor updates its counters outside timed()).

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree, fixed. And also-- I sort of don't care whether metric updates are included in compute time (a) because the distortions will be small, (b) because compute is compute. 😀

Extract the repeated TelemetryRegistryHandle/ControllerContext/
PipelineContext boilerplate into a shared helper in
crates/engine/src/testing/mod.rs. Use it in process_duration tests
and condense_attributes_processor test helper.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment on lines +46 to +47
/// closure. `RefCell` is used instead of `Cell` to support future
/// instrument types that are not `Copy` (e.g. histogram buckets).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Compared to Cell which has zero runtime borrow-check cost, RefCell does come with a runtime borrow-check cost for every borrow_mut call. I understand that Cell might not help in future when we move to Histograms, but I guess that switch could have been made when we actually needed it.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Replaced with Cell for now!

Comment thread rust/otap-dataflow/configs/fake-kql-debug-noop.yaml Outdated
jmacd and others added 2 commits March 16, 2026 16:20
Co-authored-by: Utkarsh Umesan Pillai <66651184+utpilla@users.noreply.github.com>
}
})?;
let req =
effect_handler.timed(&self.compute_duration, || {
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This timing only covers protobuf decode. Most of the debug processor’s actual work happens afterward in process_log / process_metric / process_trace, so this metric will underreport the real processing cost.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've gone back and forth on this! The blocking calls to the console will appear as compute if we're not careful, but I guess the work in formatting can be measured. Not sure if it's worth it. I mainly added this in the debug procsesor to test the code path because it's used in many test configs.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it's just an observation, not blocking specifically for debug processor.

jmacd and others added 2 commits March 17, 2026 00:06
…ters outside timed()

Address PR review feedback:
- Replace RefCell<Mmsc> with Cell<Mmsc> for zero runtime borrow-check
  overhead, since Mmsc is Copy.
- Move filter processor counter .add() calls outside the timed() closure
  so only actual compute work is measured.
- Fix pre-existing test bug with swapped metric field name order.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…nto jmacdonald/processor-duration-histogram
jmacd and others added 4 commits March 17, 2026 08:10
…jmacd/otel-arrow into jmacdonald/processor-duration-histogram
The method only uses &self.pipeline immutably, so &mut self was
unnecessary and caused a borrow conflict with &self.compute_duration
inside the timed() closure.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@jmacd jmacd enabled auto-merge March 17, 2026 18:25
@jmacd jmacd added this pull request to the merge queue Mar 17, 2026
Merged via the queue into open-telemetry:main with commit f71e43b Mar 17, 2026
67 of 68 checks passed
@jmacd jmacd deleted the jmacdonald/processor-duration-histogram branch March 17, 2026 19:14
cijothomas pushed a commit to cijothomas/otel-arrow that referenced this pull request Mar 17, 2026
…elemetry#2211)

Add a `node.processor` metric set with a `process.success.duration` and
`process.failed.duration` Mmsc instrument for measuring the wall-clock
duration of the work done in a process() call. A closure is used to
prevent inclusion of async-await points in the measurement.

The metric is registered via the node telemetry context.

This is intended to be gated by MetricLevel >= Normal

Fixes open-telemetry#2210.

---------

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Utkarsh Umesan Pillai <66651184+utpilla@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

rust Pull requests that update Rust code

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

[otap-df-telemetry] Processor duration measurement helper

4 participants