feat(metrics): emit workflow execution and per-block metrics to CloudWatch#4931
feat(metrics): emit workflow execution and per-block metrics to CloudWatch#4931TheodoreSpeaks wants to merge 4 commits into
Conversation
|
The latest updates on your projects. Learn more about Vercel for GitHub. |
|
@greptile review |
PR SummaryMedium Risk Overview
The stale-execution cleanup cron only force-fails rows still Reviewed by Cursor Bugbot for commit 83bfcbd. Bugbot is set up for automated code reviews on this repo. Configure here. |
Greptile SummaryThis PR generalises
Confidence Score: 5/5Safe to merge; all metric calls are fire-and-forget with errors caught and dropped, so no request-path impact. The new metric paths are fully additive to existing logic and cannot break execution correctness. The completionMetricEmitted guard and the cron conditional-update pattern correctly prevent double-counting. Test coverage for the emitter and all LoggingSession emission paths is solid. The only finding is an omitted optional field in a telemetry call. apps/sim/app/api/cron/cleanup-stale-executions/route.ts — the already-computed totalDurationMs is not forwarded to recordExecutionCompleted. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[Workflow execution starts] --> B[LoggingSession.start / safeStart]
B --> C[recordExecutionStarted Sim/Workflow]
C --> D{Execution path}
D --> E[BlockExecutor.execute]
E --> F{Block outcome}
F -->|success| G[recordBlockExecuted success=true + BlockDuration]
F -->|error| H[recordBlockExecuted success=false + BlockDuration]
D --> I{Session completion}
I -->|complete| J[recordExecutionCompleted status=success + ExecutionDuration]
I -->|completeWithError| K[recordExecutionCompleted status=failed + ExecutionDuration]
I -->|completeWithCancellation| L[recordExecutionCompleted status=cancelled + ExecutionDuration]
I -->|completeWithPause| M[recordExecutionPaused]
I -->|cost-only fallback| N[recordExecutionCompleted status=failed/cancelled/success]
I -->|markAsFailed| O[recordExecutionCompleted status=failed no durationMs]
M -->|later markAsFailed| O
P[cleanup-stale-executions cron] -->|crashed workers only| Q[recordExecutionCompleted status=failed no durationMs]
J & K & L & N & O & Q --> R[(CloudWatch Sim/Workflow)]
G & H --> R
Reviews (3): Last reviewed commit: "improvement(metrics): one-shot guard on ..." | Re-trigger Greptile |
| const ENVIRONMENT = | ||
| process.env.OTEL_DEPLOYMENT_ENVIRONMENT || | ||
| process.env.DEPLOYMENT_ENVIRONMENT || | ||
| process.env.GRAFANA_DEPLOYMENT_ENVIRONMENT || | ||
| process.env.NODE_ENV || | ||
| 'development' |
There was a problem hiding this comment.
GRAFANA_DEPLOYMENT_ENVIRONMENT ordering may not fix staging
GRAFANA_DEPLOYMENT_ENVIRONMENT is checked only after DEPLOYMENT_ENVIRONMENT. If staging workers already have DEPLOYMENT_ENVIRONMENT set to any value — even 'production' (common when a shared env template is used) — the Grafana variable is silently shadowed and the fix has no effect. If staging workers consistently have GRAFANA_DEPLOYMENT_ENVIRONMENT but not DEPLOYMENT_ENVIRONMENT, the current order is fine; but if the opposite is possible, GRAFANA_DEPLOYMENT_ENVIRONMENT should move earlier in the chain, or the comment should document the expected env layout explicitly.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
There was a problem hiding this comment.
Verified the premise does not hold: the infra CDK stacks set neither OTEL_DEPLOYMENT_ENVIRONMENT nor DEPLOYMENT_ENVIRONMENT anywhere (grepped application-stack and constructs), and CloudWatch currently shows only Environment=production series — which is exactly the bug. The ordering is intentional: the OTEL_/DEPLOYMENT_ vars stay as explicit overrides, with the Grafana label (already set per-env for trigger.dev workers) as the next-best signal. The authoritative fix — DEPLOYMENT_ENVIRONMENT: config.environment on the ECS env maps — lands in the infra repo separately.
Greptile SummaryThis PR generalizes the existing
Confidence Score: 4/5Safe to merge; all terminal completion paths are guarded against double-counting and the CloudWatch flush machinery is unmodified. The two findings are confined to metric accuracy, not execution correctness. The dedup flag (completionMetricEmitted), the regex cardinality guard on Operation, and the trigger type safety (notNull DB column) are all correct. The safeStart double-emit risk is latent rather than present — nothing currently triggers it — but the pattern is fragile enough to note. The cost-only fallback's || 0 duration passes a zero ExecutionDuration data point whenever totalDurationMs is absent, which will skew CloudWatch percentile math for that path. The two metric-emission sites in logging-session.ts — around safeStart (line 796) and the cost-only fallback duration (line 1080) — are worth a second look before the staging rollout produces misleading percentile data. Important Files Changed
Sequence DiagramsequenceDiagram
participant Caller
participant LoggingSession
participant BlockExecutor
participant workflowMetrics
participant Buffer
participant CloudWatch
Caller->>LoggingSession: start() / safeStart()
LoggingSession->>workflowMetrics: "recordExecutionStarted({trigger})"
workflowMetrics->>Buffer: enqueue(Sim/Workflow, ExecutionStarted)
loop Per block
BlockExecutor->>BlockExecutor: execute block
alt Success
BlockExecutor->>workflowMetrics: "recordBlockExecuted({blockType, operation?, success:true, durationMs})"
else Error
BlockExecutor->>workflowMetrics: "recordBlockExecuted({blockType, operation?, success:false, durationMs})"
end
workflowMetrics->>Buffer: enqueue(Sim/Workflow, BlockExecuted + BlockDuration)
end
alt Normal completion
LoggingSession->>workflowMetrics: "recordExecutionCompleted({trigger, status, durationMs})"
else Pause
LoggingSession->>workflowMetrics: "recordExecutionPaused({trigger})"
else markAsFailed
LoggingSession->>workflowMetrics: "recordExecutionCompleted({trigger, status:failed}) [guarded by completionMetricEmitted]"
else Crashed worker (cron)
Note over LoggingSession,workflowMetrics: cleanup-stale-executions route
LoggingSession->>workflowMetrics: "recordExecutionCompleted({trigger, status:failed})"
end
workflowMetrics->>Buffer: enqueue(Sim/Workflow, ExecutionCompleted + ExecutionDuration?)
Note over Buffer,CloudWatch: Every 5s / threshold / SIGTERM
Buffer->>CloudWatch: PutMetricData per namespace (Sim/Workflow, Sim/HostedKey)
Reviews (2): Last reviewed commit: "feat(metrics): emit workflow execution a..." | Re-trigger Greptile |
…etric on actual row transition
|
@greptile review |
…d omit unknown durations
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 83bfcbd. Configure here.

Summary
lib/monitoring/metrics.tsinto a namespace-aware CloudWatch emitter (per-namespacePutMetricDatabatching, same buffer/flush/exit-drain machinery);hostedKeyMetricsAPI unchangedSim/Workflownamespace:ExecutionStarted,ExecutionCompleted(Trigger/Status dims),ExecutionDuration,ExecutionPaused,BlockExecuted(BlockType/Operation/Success dims),BlockDurationLoggingSession(all completion paths incl. cost-only fallback andmarkAsFailed), guarded against double-counting; pause is tracked separately from terminal completionsBlockExecutorat blockLog finalization (success + error paths);Operationdim is regex-guarded against dynamic values exploding cardinalitycleanup-stale-executionscron now counts crashed-worker executions as failed completions (only place those can be recorded)GRAFANA_DEPLOYMENT_ENVIRONMENTto the Environment dimension fallback chain so staging workers stop tagging metrics asproductionType of Change
Testing
Unit tests for the emitter (namespace grouping, buffer drain, dimension shapes, failure drop) and LoggingSession metric emission (start/success/failed/cancelled/paused, dedup across fallback/markAsFailed).
bun run lint:check,tsc --noEmit, andcheck:api-validation:strictpass.Checklist