Skip to content

feat(observability): step-aware rewind semantics for the progress event stream #252

@krokoko

Description

@krokoko

Component

API or orchestration, Agent (Python runtime), Scripts / CLI

Describe the feature

Add step-aware rewind semantics to the append-only progress event stream so that when a durable orchestrator step or an agent pipeline phase is retried after a crash/transient failure, consumers (bgagent watch, status, notification channels) can detect the replay and discard or supersede the now-stale events from the abandoned attempt — instead of rendering duplicated or contradictory progress.

Today the stream is append-only and forward-cursored:

  • TaskEventsTable is append-only; event_id is a ULID (time-sortable) SK (progress_writer.py, ORCHESTRATOR.md).
  • GET /v1/tasks/{id}/events supports an after=<event_id> forward ULID cursor (get-task-events.ts).
  • watch polls forward with afterEventId and adaptive cadence (cli/src/commands/watch.ts).

Because the orchestrator is a Lambda Durable Function that replays from the last checkpoint after a crash (ORCHESTRATOR.md §durable execution), and idempotent steps may re-run, a retried step/phase can emit a second set of progress events for work the consumer already saw. With a pure forward cursor and no replay marker, watch shows both attempts: e.g. "repo_setup_complete" twice, or turn/tool events from an abandoned attempt interleaved with the live one. This is a correctness/UX gap in the streaming surface, not a data-loss issue (the audit log intentionally keeps everything).

Use case

  • Trustworthy live progress: A user watching bgagent watch after an orchestrator crash sees a single coherent timeline, not duplicated milestones from the replayed step.
  • Accurate notifications: Slack/Linear/GitHub progress comments don't post contradictory "now doing X" updates from a superseded attempt.
  • Auditability preserved: The append-only audit log still retains every event (including superseded ones); rewind is a consumer-side presentation contract, not a destructive delete.

Proposed solution

Phase 0 — Attempt/epoch stamping

  • Stamp each progress event with an attempt/epoch identifier tied to the durable step execution (e.g. attempt_epoch derived from the durable execution attempt or a monotonic per-step counter) plus the logical step/phase name. Add as optional metadata so existing readers are unaffected.

Phase 1 — Rewind marker

  • When a step/phase begins a new attempt, write a stream_rewind control event carrying { step, superseded_after_event_id, new_epoch }. This marks "events for step with an earlier epoch than new_epoch are superseded."
  • Producers: progress_writer.py (runner + pipeline writers) emit the rewind marker at phase entry on retry; orchestrator emits it on durable-step retry.

Phase 2 — Consumer honoring

  • cli/src/commands/watch.ts and the status formatter honor stream_rewind: when seen, drop already-rendered events for step with epoch < new_epoch (or visually mark them superseded). Forward after cursor semantics are unchanged.
  • The get-task-events.ts API stays append-only; optionally add a collapse=superseded query mode that filters superseded events server-side for consumers that don't implement client-side rewind.

Phase 3 — Notifications

  • Notification emitters treat a stream_rewind as a signal to edit/replace the prior progress message for that step rather than append a contradictory one (where the channel supports edits).

Out of scope (explicit)

  • Deleting or mutating events in TaskEventsTable — the audit log remains complete and append-only.
  • Changing the durable-execution retry/idempotency model itself (ORCHESTRATOR.md) — this issue makes its replays legible to consumers, it does not alter replay behavior.

Acceptance criteria

  • Progress events carry an attempt_epoch + logical step/phase in metadata; documented in design docs (source + Starlight sync).
  • A stream_rewind control event is emitted on durable-step / pipeline-phase retry with { step, superseded_after_event_id, new_epoch }.
  • progress_writer.py producers emit epoch + rewind markers; covered by agent/tests/test_progress_writer.py.
  • bgagent watch and status discard/supersede stale events on stream_rewind; coherent timeline verified by cli/test.
  • GET /v1/tasks/{id}/events remains append-only and backward-compatible; optional collapse=superseded mode (if implemented) covered in cdk/test/handlers.
  • No events are deleted from TaskEventsTable; audit completeness preserved (asserted by test).
  • Forward after cursor behavior unchanged for existing consumers that ignore the new fields.

Metadata

Metadata

Assignees

No one assigned

    Labels

    agent-runtimePython agent container: pipeline, runner, hooks, prompts, tools, DockerfileenhancementNew feature or requestobservabilityTracing, attribution, dashboards, metrics, alarms, telemetry redactionorchestrationTask lifecycle, REST API handlers, orchestrator Lambdas, durable execution

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions