[SVLS-8351] Add CPU Enhanced Metrics in Linux Azure Functions by kathiehuang · Pull Request #77 · DataDog/serverless-components

kathiehuang · 2026-02-13T20:19:14Z

What does this PR do?

Adds CPU usage enhanced metric for Linux Azure Functions.

Adds a new datadog-metrics-collector crate that reads CPU metrics every second and submits them to the Datadog backend every 10 seconds when DD_ENHANCED_METRICS_ENABLED=true (default on)
- This creates an OS-agnostic CpuMetricsCollector struct and CpuStatsReader trait. Currently this only collects CPU metrics in Linux. CPU metrics in Windows will be completed in a future PR
Emits one new metric:
- azure.functions.enhanced.cpu.usage - container-level CPU consumption rate in nanocores, sourced from cpuacct.usage

Additional Notes

Tags attached to all CPU metrics:
- Azure resource metadata from libdd-common:
  - resource_group
  - subscription_id
  - name
- Metadata from other environment variables:
  - region
  - plan_tier
  - service
  - env
  - version
  - serverless_compat_version
Sets up CpuStats struct to store the computed CPU total
Adds DD_ENHANCED_METRICS to EnabledMetricsComponents to determine which metrics components should be enabled
- If DD_USE_DOGSTATSD is true (default true), start the DogStatsD listener
- In Azure functions, the instance metrics collector always starts
- If DD_ENHANCED_METRICS is true (default true in Azure Functions), start the CPU metrics collector
Metrics are submitted as distribution metrics because not all metrics have tags with a unique identifier from the instance they are sent from
If the collector cannot read the cgroup files successfully, it will not submit enhanced metrics for that interval and log accordingly

Motivation

https://datadoghq.atlassian.net/browse/SVLS-8351

Describe how to test/QA your changes

Build with serverless-compat-self-monitoring.

This was deployed with the serverless-compat-self-monitoring pipeline across all runtimes and hosting plans. All hosting plans in Linux were tested to verify metrics submit correctly, and a Windows function was tested to enhanced metrics are disabled in Windows environments.

Testing to compare against Azure Monitor is documented in an internal doc in Enhanced Metrics in the Serverless Compatibility Layer. Explanation of what cgroup contains can be found in Calculating CPU Enhanced Metrics in Windows and Linux Azure Functions.

Verified CPU usage increases with a CPU-intensive function
Verified CPU limit increases when the hosting plan is scaled up
Verified enhanced metrics still send when DD_USE_DOGSTATSD is off
Verified custom metrics still send when DD_ENHANCED_METRICS_ENABLED is off

References: datadog-agent cgroup collection and calculation logic

kathiehuang · 2026-03-07T01:04:19Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: c6a55dc810

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

kathiehuang · 2026-03-07T01:23:51Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: dfe28a3a43

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-07T01:29:11Z

+            self.last_usage_ns = current_usage_ns;
+
+            // Divide nanoseconds delta by collection interval to get usage rate in nanocores
+            let usage_rate_nc = delta_ns / self.collection_interval_secs as f64;


Use real elapsed time for CPU usage rate

CpuMetricsCollector::collect_and_submit computes usage_rate_nc by dividing the CPU delta by a fixed configured interval, but collection is driven from a tokio::select! loop where the flush branch awaits network I/O (metrics_flusher.flush().await), so real sampling gaps can be longer/shorter than 3s under timeout/retry or scheduler delay; this skews the reported nanocore rate and can produce false spikes/drops in production dashboards.

Useful? React with 👍 / 👎.

Addressed this in c6a55dc - tracked the last collection time and the current time to get a more precise time interval to divide by!

chatgpt-codex-connector · 2026-03-07T01:29:11Z

-        .await;
-        if let Some(ref windows_pipe_name) = dd_dogstatsd_windows_pipe_name {
-            info!("dogstatsd-pipe: starting to listen on pipe {windows_pipe_name}");
+    let needs_aggregator = dd_use_dogstatsd || dd_enhanced_metrics;


Gate aggregator startup to Azure enhanced-metrics path

The new needs_aggregator condition enables the aggregator whenever DD_ENHANCED_METRICS_ENABLED is true (default), even in non-Azure environments where cpu_collector is never created, so deployments with DD_USE_DOGSTATSD=false still start/flush an unused aggregator and can log DD_API_KEY not set errors despite no enhanced metrics being emitted; this should be gated by env_type == EnvironmentType::AzureFunction.

Useful? React with 👍 / 👎.

Fixed by only setting DD_ENHANCED_METRICS to default true if we're in an Azure Function
36bba17

kathiehuang · 2026-03-07T01:45:55Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 058ef533f8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-07T01:52:46Z

+    let mut cpu_collector = if dd_enhanced_metrics && env_type == EnvironmentType::AzureFunction {
+        aggregator_handle.as_ref().map(|handle| {
+            let tags = build_cpu_metrics_tags();
+            CpuMetricsCollector::new(handle.clone(), tags)


Skip CPU collector when flusher setup has failed

This block creates CpuMetricsCollector whenever Azure enhanced metrics are enabled, even if start_aggregator failed to build a flusher (DD_API_KEY missing, bad site/proxy) and returned metrics_flusher = None. In that scenario there is no flush path, but the collector still inserts two new timestamped distributions every 3s, so contexts accumulate until overflow and metrics are dropped while consuming resources; enhanced collection should be gated on a usable flusher (or disabled when flushing is unavailable).

Useful? React with 👍 / 👎.

Addressed in c626c03 to only create CPUMetricsCollector if the metrics flusher was successfully created

chatgpt-codex-connector · 2026-03-07T01:52:46Z

+                if let Some(metrics_flusher) = metrics_flusher.as_ref() {
+                    debug!("Flushing dogstatsd metrics");
+                    metrics_flusher.flush().await;


Decouple CPU sampling from metrics flush awaits

The CPU collection timer and metrics_flusher.flush().await share the same tokio::select! loop, so a slow/timeouting flush blocks the task and prevents CPU collection from running at the intended 3-second cadence. When flushes stall (for example during network degradation), sampling becomes delayed and irregular, which directly reduces CPU usage metric fidelity; CPU collection should run in its own task or otherwise avoid being blocked by flush I/O.

Useful? React with 👍 / 👎.

Addressed in 071d2f0 to spawn the metrics flusher as an independent task from the collector so that a slow flush will not block CPU collection.

DOGSTATSD_TIMEOUT_DURATION is 5 seconds

kathiehuang · 2026-03-09T14:29:22Z

@codex review

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 071d2f0488

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-03-09T14:35:21Z

+                    tokio::spawn(async move {
+                        metrics_flusher.flush().await;
+                    });


Serialize metrics flushes to prevent overlapping upload tasks

This change spawns a new flush() task on every interval tick, so if a flush takes longer than 10s (for example during network slowness or outages), multiple flushes run concurrently and keep accumulating over time. Because Flusher::flush performs network I/O with retries, this can happen in production and causes unnecessary task/socket/memory pressure; the previous logic awaited each flush and guaranteed only one in-flight flush at a time.

Useful? React with 👍 / 👎.

The client has a 5-second timeout with RetryStrategy::LinearBackoff(3, 1), so this would require three consecutive full timeouts. I'm happy to add this, but I'm not sure if it's needed

duncanista

I'd suggest using features for OS specific business logic

Also suggest checking how ADP is doing agent checks in rust, this sounds like an agent check for a very specific use case

duncanista · 2026-03-09T17:29:56Z

@@ -39,6 +42,7 @@ use dogstatsd::{
 use dogstatsd::metric::{SortedTags, EMPTY_TAGS};
 use tokio_util::sync::CancellationToken;

+const CPU_METRICS_COLLECTION_INTERVAL: u64 = 3;


Suggested change

const CPU_METRICS_COLLECTION_INTERVAL: u64 = 3;

const CPU_METRICS_COLLECTION_INTERVAL_SECONDS: u64 = 3;

Updated in aae174a

Lewis-E · 2026-03-09T20:30:18Z

So, ~6 debug logs every 3 seconds? Do these go to Datadog & cost money?

duncanpharvey

Excellent work! I added a few suggestions to consider

duncanpharvey · 2026-03-10T21:27:41Z

+dogstatsd = { path = "../dogstatsd", default-features = true }
+num_cpus = "1.16"


Are these dependencies needed in datadog-trace-agent?

Oh good catch - this was accidentally left over from before I moved the metrics collector into its own crate. Fixed in 2ad5f24!

duncanpharvey · 2026-03-10T21:35:11Z

+    let (metrics_flusher, aggregator_handle) = if needs_aggregator {
+        debug!("Creating metrics flusher and aggregator");
+
+        let (flusher, handle) =


I think a comment here to note why the aggregator is started separately from the dogstatsd listener would be helpful - just enough to be clear that there are different configuration options that require this (dogstatsd enabled/disabled, enhanced metrics enabled/disabled).

Maybe a unit test as well to assert that all of these combinations are covered?

Good point! I added a comment in 60cdecf

It seems like it'll be a little hard to make a meaningful unit test since the aggregator/dogstatsd startup logic has side effects that would make it hard to test in isolation? Maybe I could refactor the startup decision into a struct that describes what to start to separate the decision from execution?

struct AgentConfig { start_aggregator: bool, start_dogstatsd: bool, start_enhanced_metrics: bool, } fn resolve_agent_config(dd_use_dogstatsd: bool, dd_enhanced_metrics: bool) -> AgentConfig { AgentConfig { start_aggregator: dd_use_dogstatsd || dd_enhanced_metrics, start_dogstatsd: dd_use_dogstatsd, start_enhanced_metrics: dd_enhanced_metrics, } }

duncanpharvey · 2026-03-10T21:37:42Z

@@ -312,3 +361,42 @@ fn build_metrics_client(
    }
    Ok(builder.build()?)
 }
+
+fn build_cpu_metrics_tags() -> Option<SortedTags> {


Would this method make more sense to live in the datadog-metrics-collector crate and be used internally within the crate?

That makes sense! I moved it in f867f6f

kathiehuang · 2026-03-11T16:38:05Z

Also suggest checking how ADP is doing agent checks in rust, this sounds like an agent check for a very specific use case

I looked into how ADP does agent checks in Rust - it looks like it's still experimental and may be too high-level for this use case. I will make a Jira ticket for the backlog though so we can come back to this in the future and see if anything in the way they do checks is applicable!

https://datadoghq.atlassian.net/browse/SVLS-8699

…eeded

…rics-collector

…llector from initializing and logging that metrics collection is being skipped

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: a79098f64b

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

kathiehuang changed the title ~~[SVLS-8351] Add CPU Enhanced Metrics~~ [SVLS-8351] Add CPU Enhanced Metrics in Linux Azure Functions Mar 6, 2026

kathiehuang force-pushed the kathie.huang/add-cpu-enhanced-metrics branch 2 times, most recently from 7010e35 to 5953d68 Compare March 6, 2026 20:38

DataDog deleted a comment from chatgpt-codex-connector Bot Mar 7, 2026

chatgpt-codex-connector Bot reviewed Mar 7, 2026

View reviewed changes

Comment thread crates/datadog-serverless-compat/src/main.rs Outdated

chatgpt-codex-connector Bot reviewed Mar 7, 2026

View reviewed changes

chatgpt-codex-connector Bot reviewed Mar 9, 2026

View reviewed changes

kathiehuang marked this pull request as ready for review March 9, 2026 17:26

kathiehuang requested review from a team as code owners March 9, 2026 17:26

kathiehuang requested review from Lewis-E, duncanpharvey and lym953 and removed request for a team March 9, 2026 17:26

duncanista reviewed Mar 9, 2026

View reviewed changes

Lewis-E reviewed Mar 9, 2026

View reviewed changes

Comment thread crates/datadog-metrics-collector/src/cpu.rs Outdated

Lewis-E reviewed Mar 9, 2026

View reviewed changes

Comment thread crates/datadog-metrics-collector/src/linux.rs Outdated

Lewis-E reviewed Mar 9, 2026

View reviewed changes

Comment thread crates/datadog-metrics-collector/src/azure_cpu.rs

Lewis-E reviewed Mar 9, 2026

View reviewed changes

Comment thread crates/datadog-metrics-collector/src/linux.rs Outdated

Lewis-E reviewed Mar 10, 2026

View reviewed changes

Comment thread crates/datadog-serverless-compat/src/main.rs Outdated

duncanpharvey reviewed Mar 10, 2026

View reviewed changes

Lewis-E reviewed Mar 19, 2026

View reviewed changes

Comment thread crates/datadog-serverless-compat/src/main.rs

kathiehuang added 26 commits May 5, 2026 17:37

Create windows-enhanced-metrics feature for Windows-specific logic

a9bb5a6

Add unit to collection interval variable

e325a0b

Make last_usage_ns an Option and keep CPU total as u64 until f64 is n…

0469761

…eeded

Change collection interval to 1 for precision and remove unneeded logs

f01ca9f

Move tag building logic from datadog-serverless-compat to datadog-met…

8208718

…rics-collector

Remove unused dependencies from datadog-trace-agent

7e3f98d

Turn off DD_ENHANCED_METRICS in Windows for now to prevent metrics co…

8022a6e

…llector from initializing and logging that metrics collection is being skipped

Handle malformed cpuset.cpus file

4824e0e

Skip collection when elapsed_secs is less than or equal to 0

c8533cb

Update comments to clarify that Windows is not supported yet

aaabbe1

Add unit test for metric classification

5dafcb4

Log when scheduler quota can't be parsed

e2c26bc

nit: address clippy warning

9323283

Remove resource_id tag and add check for invalid CPU set range

6128427

Move log flusher into flush_interval.tick()

6a06cb2

Clean up

94e3082

Use azure_tags::build_enhanced_metrics_tags

95b3af6

Add DD_ENHANCED_METRICS_ENABLED to EnabledMetricsComponents

8c357ad

Gate windows-enhanced-metrics feature explicitly by OS

f482900

Update license

21bce7f

Fix Windows CI checks

96a4111

Remove CPU limit metric

1b0ed5d

Remove CgroupStats struct

2d3b0e7

Rename Azure-specific files

a0d46a4

Update comments

7d3b51e

Guard CPU collection loop if cpu_collector is None

a79098f

kathiehuang force-pushed the kathie.huang/add-cpu-enhanced-metrics branch from b84dc84 to a79098f Compare May 5, 2026 21:38

kathiehuang marked this pull request as ready for review May 5, 2026 21:39

chatgpt-codex-connector Bot reviewed May 5, 2026

View reviewed changes

Comment thread crates/datadog-metrics-collector/src/azure_cpu.rs

kathiehuang requested a review from duncanpharvey May 6, 2026 16:51

	const CPU_METRICS_COLLECTION_INTERVAL: u64 = 3;
	const CPU_METRICS_COLLECTION_INTERVAL_SECONDS: u64 = 3;

		dogstatsd = { path = "../dogstatsd", default-features = true }
		num_cpus = "1.16"

Conversation

kathiehuang commented Feb 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What does this PR do?

Additional Notes

Motivation

Describe how to test/QA your changes

Uh oh!

kathiehuang commented Mar 7, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

Uh oh!

kathiehuang commented Mar 7, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kathiehuang commented Mar 7, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot Mar 7, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kathiehuang commented Mar 9, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot Mar 9, 2026

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

duncanista left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Lewis-E commented Mar 9, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

duncanpharvey left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

kathiehuang commented Feb 13, 2026 •

edited

Loading

Lewis-E commented Mar 9, 2026 •

edited

Loading

kathiehuang commented Mar 11, 2026 •

edited

Loading