[otap-df-telemetry] add duration metric helper for processors#2211
[otap-df-telemetry] add duration metric helper for processors#2211jmacd merged 45 commits intoopen-telemetry:mainfrom
Conversation
Add a node.processor metric set with a process.duration Mmsc instrument that automatically measures the wall-clock duration of every processor's process() call. The timing is performed at the engine wrapper level in ProcessorWrapper::start(), so all processors are instrumented without requiring changes to the Processor trait or individual implementations. The metric is registered via the node telemetry context during prepare_runtime() and reported on each CollectTelemetry interval (1-second periodic) as well as at final shutdown. This is intended to be gated by MetricLevel >= Normal once open-telemetry#2169 merges. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Replace engine-level automatic process() timing with an opt-in approach where processors time their own compute sections using the new Mmsc::timed() scoped API. Processors opt in by: - Adding a process_duration: Mmsc field to their metric set - Wrapping their compute section in timed() - Reporting via the existing CollectTelemetry flow Instrumented processors: filter, attributes, KQL recordset, and condense attributes. Processors like retry and fanout that don't do meaningful compute are not instrumented. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…nto jmacdonald/processor-duration-histogram
Check effect_handler.node_interests() for CONSUMER_METRICS before calling timed(). At Basic or None level, the Instant::now() overhead is skipped entirely. Aligns with the MetricLevel policy from PR open-telemetry#2169. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Move process_duration: Mmsc from each processor's individual metric set into a shared ProcessDuration type in the engine crate. This encapsulates the metric set, the MetricLevel gating (via Interests), and reporting into a single reusable wrapper. Each call site is now a single line: self.process_duration.timed(interests, || compute()) Deleted per-processor metrics files that only contained process_duration (KQL, condense). Removed process_duration from filter and attributes metric sets that have other counters. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #2211 +/- ##
========================================
Coverage 87.71% 87.71%
========================================
Files 575 577 +2
Lines 197290 198107 +817
========================================
+ Hits 173044 173775 +731
- Misses 23720 23806 +86
Partials 526 526
🚀 New features to boost your workflow:
|
Replace the closure-based timed() approach and mutable-reference guard
with a lightweight TimingGuard that captures the start instant without
borrowing the ProcessDuration. This avoids borrow checker conflicts
when the processor needs to call methods on &self/&mut self between
start and stop.
API: ProcessDuration::start(interests) -> TimingGuard
timing.stop(&mut self.process_duration)
All four processors (filter, attributes, KQL recordset, condense)
now use this pattern with minimal diff against upstream - no code
re-indentation or method extraction required.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Introduce a Timer type that encapsulates Instant and the elapsed-to-nanos conversion. Mmsc::timed() and TimingGuard now both delegate to Timer, eliminating raw Instant usage from callers. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Remove elapsed_nanos() from Timer. Instead, Mmsc::record_timer() consumes the Timer directly and handles the elapsed-to-nanos conversion internally. TimingGuard::stop() now delegates straight to record_timer(). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Enables Normal+ level metrics (including process_duration) so the example config demonstrates the full metrics picture out of the box. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
|
e.g., |
…nto jmacdonald/processor-duration-histogram # Please enter a commit message to explain why this merge is necessary, # especially if it merges an updated upstream into a topic branch. # # Lines starting with '#' will be ignored, and an empty message aborts # the commit.
Revert to Cell<Mmsc> with &self on timed() so the closure can borrow sibling fields on the processor. This structurally prevents timing from spanning .await points (reviewer feedback). - Remove start_timer() and record_elapsed() — all 5 processors now use timed() exclusively - Debug processor: timed() wraps sync proto decode only - Filter/KQL/attributes/condense: timed() wraps sync compute work Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
All processor call sites now use effect_handler.timed(&self.compute_duration, || { ... })
instead of self.compute_duration.timed(effect_handler.node_interests(), || { ... }).
The EffectHandler already knows its interests, so processors no longer
need to thread them through manually.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Constrain timed() to FnOnce() -> Result<T, E> and route elapsed time into separate compute.duration.success and compute.duration.failed instruments based on the closure outcome. This lets operators distinguish compute time from error-path time in duration metrics. All callers already return Result, so no call-site changes needed. Add end-to-end test verifying both metric names appear in the telemetry registry after report(). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
RefCell supports non-Copy types (e.g. Box<Histogram>, [u64; N] bucket arrays) that Cell cannot handle. The borrow is non-reentrant by construction since timed() runs a sync closure. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| self.metrics.log_signals_consumed.add(log_signals_consumed); | ||
| self.metrics.log_signals_filtered.add(log_signals_filtered); |
There was a problem hiding this comment.
nit: The counter updates (self.metrics.log_signals_consumed.add(...), self.metrics.span_signals_consumed.add(...), etc.) are inside the timed() closure, so they're included in the compute duration measurement. These are cheap add() calls so the impact is minimal, but for purity the counters could be updated after the closure returns (similar to how the attributes processor updates its counters outside timed()).
There was a problem hiding this comment.
I agree, fixed. And also-- I sort of don't care whether metric updates are included in compute time (a) because the distortions will be small, (b) because compute is compute. 😀
Extract the repeated TelemetryRegistryHandle/ControllerContext/ PipelineContext boilerplate into a shared helper in crates/engine/src/testing/mod.rs. Use it in process_duration tests and condense_attributes_processor test helper. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
| /// closure. `RefCell` is used instead of `Cell` to support future | ||
| /// instrument types that are not `Copy` (e.g. histogram buckets). |
There was a problem hiding this comment.
nit: Compared to Cell which has zero runtime borrow-check cost, RefCell does come with a runtime borrow-check cost for every borrow_mut call. I understand that Cell might not help in future when we move to Histograms, but I guess that switch could have been made when we actually needed it.
There was a problem hiding this comment.
Replaced with Cell for now!
Co-authored-by: Utkarsh Umesan Pillai <66651184+utpilla@users.noreply.github.com>
| } | ||
| })?; | ||
| let req = | ||
| effect_handler.timed(&self.compute_duration, || { |
There was a problem hiding this comment.
This timing only covers protobuf decode. Most of the debug processor’s actual work happens afterward in process_log / process_metric / process_trace, so this metric will underreport the real processing cost.
There was a problem hiding this comment.
I've gone back and forth on this! The blocking calls to the console will appear as compute if we're not careful, but I guess the work in formatting can be measured. Not sure if it's worth it. I mainly added this in the debug procsesor to test the code path because it's used in many test configs.
There was a problem hiding this comment.
Yeah it's just an observation, not blocking specifically for debug processor.
…ters outside timed() Address PR review feedback: - Replace RefCell<Mmsc> with Cell<Mmsc> for zero runtime borrow-check overhead, since Mmsc is Copy. - Move filter processor counter .add() calls outside the timed() closure so only actual compute work is measured. - Fix pre-existing test bug with swapped metric field name order. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…nto jmacdonald/processor-duration-histogram
…jmacd/otel-arrow into jmacdonald/processor-duration-histogram
…donald/processor-duration-histogram
The method only uses &self.pipeline immutably, so &mut self was unnecessary and caused a borrow conflict with &self.compute_duration inside the timed() closure. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
f71e43b
…elemetry#2211) Add a `node.processor` metric set with a `process.success.duration` and `process.failed.duration` Mmsc instrument for measuring the wall-clock duration of the work done in a process() call. A closure is used to prevent inclusion of async-await points in the measurement. The metric is registered via the node telemetry context. This is intended to be gated by MetricLevel >= Normal Fixes open-telemetry#2210. --------- Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com> Co-authored-by: Utkarsh Umesan Pillai <66651184+utpilla@users.noreply.github.com>
Add a
node.processormetric set with aprocess.success.durationandprocess.failed.durationMmsc instrument for measuring the wall-clock duration of the work done in a process() call. A closure is used to prevent inclusion of async-await points in the measurement.The metric is registered via the node telemetry context.
This is intended to be gated by MetricLevel >= Normal
Fixes #2210.