Skip to content

[SVLS-8351] Add CPU Enhanced Metrics in Linux Azure Functions#77

Open
kathiehuang wants to merge 51 commits intomainfrom
kathie.huang/add-cpu-enhanced-metrics
Open

[SVLS-8351] Add CPU Enhanced Metrics in Linux Azure Functions#77
kathiehuang wants to merge 51 commits intomainfrom
kathie.huang/add-cpu-enhanced-metrics

Conversation

@kathiehuang
Copy link
Copy Markdown
Contributor

@kathiehuang kathiehuang commented Feb 13, 2026

What does this PR do?

Adds CPU usage enhanced metric for Linux Azure Functions.

  • Adds a new datadog-metrics-collector crate that reads CPU metrics every second and submits them to the Datadog backend every 10 seconds when DD_ENHANCED_METRICS_ENABLED=true (default on)
    • This creates an OS-agnostic CpuMetricsCollector struct and CpuStatsReader trait. Currently this only collects CPU metrics in Linux. CPU metrics in Windows will be completed in a future PR
  • Emits one new metric:
    • azure.functions.enhanced.cpu.usage - container-level CPU consumption rate in nanocores, sourced from cpuacct.usage

Additional Notes

  • Tags attached to all CPU metrics:
    • Azure resource metadata from libdd-common:
      • resource_group
      • subscription_id
      • name
    • Metadata from other environment variables:
      • region
      • plan_tier
      • service
      • env
      • version
      • serverless_compat_version
  • Sets up CpuStats struct to store the computed CPU total
  • Adds DD_ENHANCED_METRICS to EnabledMetricsComponents to determine which metrics components should be enabled
    • If DD_USE_DOGSTATSD is true (default true), start the DogStatsD listener
    • In Azure functions, the instance metrics collector always starts
    • If DD_ENHANCED_METRICS is true (default true in Azure Functions), start the CPU metrics collector
  • Metrics are submitted as distribution metrics because not all metrics have tags with a unique identifier from the instance they are sent from
  • If the collector cannot read the cgroup files successfully, it will not submit enhanced metrics for that interval and log accordingly

Motivation

https://datadoghq.atlassian.net/browse/SVLS-8351

Describe how to test/QA your changes

Build with serverless-compat-self-monitoring.

This was deployed with the serverless-compat-self-monitoring pipeline across all runtimes and hosting plans. All hosting plans in Linux were tested to verify metrics submit correctly, and a Windows function was tested to enhanced metrics are disabled in Windows environments.

Testing to compare against Azure Monitor is documented in an internal doc in Enhanced Metrics in the Serverless Compatibility Layer. Explanation of what cgroup contains can be found in Calculating CPU Enhanced Metrics in Windows and Linux Azure Functions.

  • Verified CPU usage increases with a CPU-intensive function
  • Verified CPU limit increases when the hosting plan is scaled up
  • Verified enhanced metrics still send when DD_USE_DOGSTATSD is off
  • Verified custom metrics still send when DD_ENHANCED_METRICS_ENABLED is off

References: datadog-agent cgroup collection and calculation logic

@kathiehuang kathiehuang changed the title [SVLS-8351] Add CPU Enhanced Metrics [SVLS-8351] Add CPU Enhanced Metrics in Linux Azure Functions Mar 6, 2026
@kathiehuang kathiehuang force-pushed the kathie.huang/add-cpu-enhanced-metrics branch 2 times, most recently from 7010e35 to 5953d68 Compare March 6, 2026 20:38
@DataDog DataDog deleted a comment from chatgpt-codex-connector Bot Mar 7, 2026
@kathiehuang
Copy link
Copy Markdown
Contributor Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c6a55dc810

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread crates/datadog-serverless-compat/src/main.rs Outdated
@kathiehuang
Copy link
Copy Markdown
Contributor Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dfe28a3a43

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

self.last_usage_ns = current_usage_ns;

// Divide nanoseconds delta by collection interval to get usage rate in nanocores
let usage_rate_nc = delta_ns / self.collection_interval_secs as f64;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Use real elapsed time for CPU usage rate

CpuMetricsCollector::collect_and_submit computes usage_rate_nc by dividing the CPU delta by a fixed configured interval, but collection is driven from a tokio::select! loop where the flush branch awaits network I/O (metrics_flusher.flush().await), so real sampling gaps can be longer/shorter than 3s under timeout/retry or scheduler delay; this skews the reported nanocore rate and can produce false spikes/drops in production dashboards.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed this in c6a55dc - tracked the last collection time and the current time to get a more precise time interval to divide by!

.await;
if let Some(ref windows_pipe_name) = dd_dogstatsd_windows_pipe_name {
info!("dogstatsd-pipe: starting to listen on pipe {windows_pipe_name}");
let needs_aggregator = dd_use_dogstatsd || dd_enhanced_metrics;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Gate aggregator startup to Azure enhanced-metrics path

The new needs_aggregator condition enables the aggregator whenever DD_ENHANCED_METRICS_ENABLED is true (default), even in non-Azure environments where cpu_collector is never created, so deployments with DD_USE_DOGSTATSD=false still start/flush an unused aggregator and can log DD_API_KEY not set errors despite no enhanced metrics being emitted; this should be gated by env_type == EnvironmentType::AzureFunction.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed by only setting DD_ENHANCED_METRICS to default true if we're in an Azure Function
36bba17

@kathiehuang
Copy link
Copy Markdown
Contributor Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 058ef533f8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +214 to +217
let mut cpu_collector = if dd_enhanced_metrics && env_type == EnvironmentType::AzureFunction {
aggregator_handle.as_ref().map(|handle| {
let tags = build_cpu_metrics_tags();
CpuMetricsCollector::new(handle.clone(), tags)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Skip CPU collector when flusher setup has failed

This block creates CpuMetricsCollector whenever Azure enhanced metrics are enabled, even if start_aggregator failed to build a flusher (DD_API_KEY missing, bad site/proxy) and returned metrics_flusher = None. In that scenario there is no flush path, but the collector still inserts two new timestamped distributions every 3s, so contexts accumulate until overflow and metrics are dropped while consuming resources; enhanced collection should be gated on a usable flusher (or disabled when flushing is unavailable).

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in c626c03 to only create CPUMetricsCollector if the metrics flusher was successfully created

Comment on lines +233 to +235
if let Some(metrics_flusher) = metrics_flusher.as_ref() {
debug!("Flushing dogstatsd metrics");
metrics_flusher.flush().await;
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Badge Decouple CPU sampling from metrics flush awaits

The CPU collection timer and metrics_flusher.flush().await share the same tokio::select! loop, so a slow/timeouting flush blocks the task and prevents CPU collection from running at the intended 3-second cadence. When flushes stall (for example during network degradation), sampling becomes delayed and irregular, which directly reduces CPU usage metric fidelity; CPU collection should run in its own task or otherwise avoid being blocked by flush I/O.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 071d2f0 to spawn the metrics flusher as an independent task from the collector so that a slow flush will not block CPU collection.

DOGSTATSD_TIMEOUT_DURATION is 5 seconds

@kathiehuang
Copy link
Copy Markdown
Contributor Author

@codex review

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 071d2f0488

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment on lines +240 to +242
tokio::spawn(async move {
metrics_flusher.flush().await;
});
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Serialize metrics flushes to prevent overlapping upload tasks

This change spawns a new flush() task on every interval tick, so if a flush takes longer than 10s (for example during network slowness or outages), multiple flushes run concurrently and keep accumulating over time. Because Flusher::flush performs network I/O with retries, this can happen in production and causes unnecessary task/socket/memory pressure; the previous logic awaited each flush and guaranteed only one in-flight flush at a time.

Useful? React with 👍 / 👎.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The client has a 5-second timeout with RetryStrategy::LinearBackoff(3, 1), so this would require three consecutive full timeouts. I'm happy to add this, but I'm not sure if it's needed

@kathiehuang kathiehuang marked this pull request as ready for review March 9, 2026 17:26
@kathiehuang kathiehuang requested review from a team as code owners March 9, 2026 17:26
@kathiehuang kathiehuang requested review from Lewis-E, duncanpharvey and lym953 and removed request for a team March 9, 2026 17:26
Copy link
Copy Markdown
Contributor

@duncanista duncanista left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd suggest using features for OS specific business logic

Also suggest checking how ADP is doing agent checks in rust, this sounds like an agent check for a very specific use case

@@ -39,6 +42,7 @@ use dogstatsd::{
use dogstatsd::metric::{SortedTags, EMPTY_TAGS};
use tokio_util::sync::CancellationToken;

const CPU_METRICS_COLLECTION_INTERVAL: u64 = 3;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const CPU_METRICS_COLLECTION_INTERVAL: u64 = 3;
const CPU_METRICS_COLLECTION_INTERVAL_SECONDS: u64 = 3;

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Updated in aae174a

Comment thread crates/datadog-metrics-collector/src/cpu.rs Outdated
Comment thread crates/datadog-metrics-collector/src/linux.rs Outdated
Comment thread crates/datadog-metrics-collector/src/azure_cpu.rs
Comment thread crates/datadog-metrics-collector/src/linux.rs Outdated
@Lewis-E
Copy link
Copy Markdown
Contributor

Lewis-E commented Mar 9, 2026

So, ~6 debug logs every 3 seconds? Do these go to Datadog & cost money?

Comment thread crates/datadog-serverless-compat/src/main.rs Outdated
Copy link
Copy Markdown
Collaborator

@duncanpharvey duncanpharvey left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Excellent work! I added a few suggestions to consider

Comment thread crates/datadog-trace-agent/Cargo.toml Outdated
Comment on lines +36 to +37
dogstatsd = { path = "../dogstatsd", default-features = true }
num_cpus = "1.16"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these dependencies needed in datadog-trace-agent?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh good catch - this was accidentally left over from before I moved the metrics collector into its own crate. Fixed in 2ad5f24!

let (metrics_flusher, aggregator_handle) = if needs_aggregator {
debug!("Creating metrics flusher and aggregator");

let (flusher, handle) =
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think a comment here to note why the aggregator is started separately from the dogstatsd listener would be helpful - just enough to be clear that there are different configuration options that require this (dogstatsd enabled/disabled, enhanced metrics enabled/disabled).

Maybe a unit test as well to assert that all of these combinations are covered?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point! I added a comment in 60cdecf

It seems like it'll be a little hard to make a meaningful unit test since the aggregator/dogstatsd startup logic has side effects that would make it hard to test in isolation? Maybe I could refactor the startup decision into a struct that describes what to start to separate the decision from execution?

struct AgentConfig {
    start_aggregator: bool,
    start_dogstatsd: bool,
    start_enhanced_metrics: bool,
}

fn resolve_agent_config(dd_use_dogstatsd: bool, dd_enhanced_metrics: bool) -> AgentConfig {
    AgentConfig {
        start_aggregator: dd_use_dogstatsd || dd_enhanced_metrics,
        start_dogstatsd: dd_use_dogstatsd,
        start_enhanced_metrics: dd_enhanced_metrics,
    }
}

@@ -312,3 +361,42 @@ fn build_metrics_client(
}
Ok(builder.build()?)
}

fn build_cpu_metrics_tags() -> Option<SortedTags> {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would this method make more sense to live in the datadog-metrics-collector crate and be used internally within the crate?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That makes sense! I moved it in f867f6f

@kathiehuang
Copy link
Copy Markdown
Contributor Author

kathiehuang commented Mar 11, 2026

Also suggest checking how ADP is doing agent checks in rust, this sounds like an agent check for a very specific use case

I looked into how ADP does agent checks in Rust - it looks like it's still experimental and may be too high-level for this use case. I will make a Jira ticket for the backlog though so we can come back to this in the future and see if anything in the way they do checks is applicable!

https://datadoghq.atlassian.net/browse/SVLS-8699

Comment thread crates/datadog-serverless-compat/src/main.rs
kathiehuang added 26 commits May 5, 2026 17:37
…llector from initializing and logging that metrics collection is being skipped
@kathiehuang kathiehuang force-pushed the kathie.huang/add-cpu-enhanced-metrics branch from b84dc84 to a79098f Compare May 5, 2026 21:38
@kathiehuang kathiehuang marked this pull request as ready for review May 5, 2026 21:39
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a79098f64b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread crates/datadog-metrics-collector/src/azure_cpu.rs
@kathiehuang kathiehuang requested a review from duncanpharvey May 6, 2026 16:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

6 participants