Component
API or orchestration, Agent (Python runtime), Scripts / CLI
Describe the feature
Add step-aware rewind semantics to the append-only progress event stream so that when a durable orchestrator step or an agent pipeline phase is retried after a crash/transient failure, consumers (bgagent watch, status, notification channels) can detect the replay and discard or supersede the now-stale events from the abandoned attempt — instead of rendering duplicated or contradictory progress.
Today the stream is append-only and forward-cursored:
TaskEventsTable is append-only; event_id is a ULID (time-sortable) SK (progress_writer.py, ORCHESTRATOR.md).
GET /v1/tasks/{id}/events supports an after=<event_id> forward ULID cursor (get-task-events.ts).
watch polls forward with afterEventId and adaptive cadence (cli/src/commands/watch.ts).
Because the orchestrator is a Lambda Durable Function that replays from the last checkpoint after a crash (ORCHESTRATOR.md §durable execution), and idempotent steps may re-run, a retried step/phase can emit a second set of progress events for work the consumer already saw. With a pure forward cursor and no replay marker, watch shows both attempts: e.g. "repo_setup_complete" twice, or turn/tool events from an abandoned attempt interleaved with the live one. This is a correctness/UX gap in the streaming surface, not a data-loss issue (the audit log intentionally keeps everything).
Use case
- Trustworthy live progress: A user watching
bgagent watch after an orchestrator crash sees a single coherent timeline, not duplicated milestones from the replayed step.
- Accurate notifications: Slack/Linear/GitHub progress comments don't post contradictory "now doing X" updates from a superseded attempt.
- Auditability preserved: The append-only audit log still retains every event (including superseded ones); rewind is a consumer-side presentation contract, not a destructive delete.
Proposed solution
Phase 0 — Attempt/epoch stamping
- Stamp each progress event with an attempt/epoch identifier tied to the durable step execution (e.g.
attempt_epoch derived from the durable execution attempt or a monotonic per-step counter) plus the logical step/phase name. Add as optional metadata so existing readers are unaffected.
Phase 1 — Rewind marker
- When a step/phase begins a new attempt, write a
stream_rewind control event carrying { step, superseded_after_event_id, new_epoch }. This marks "events for step with an earlier epoch than new_epoch are superseded."
- Producers:
progress_writer.py (runner + pipeline writers) emit the rewind marker at phase entry on retry; orchestrator emits it on durable-step retry.
Phase 2 — Consumer honoring
cli/src/commands/watch.ts and the status formatter honor stream_rewind: when seen, drop already-rendered events for step with epoch < new_epoch (or visually mark them superseded). Forward after cursor semantics are unchanged.
- The
get-task-events.ts API stays append-only; optionally add a collapse=superseded query mode that filters superseded events server-side for consumers that don't implement client-side rewind.
Phase 3 — Notifications
- Notification emitters treat a
stream_rewind as a signal to edit/replace the prior progress message for that step rather than append a contradictory one (where the channel supports edits).
Out of scope (explicit)
- Deleting or mutating events in
TaskEventsTable — the audit log remains complete and append-only.
- Changing the durable-execution retry/idempotency model itself (
ORCHESTRATOR.md) — this issue makes its replays legible to consumers, it does not alter replay behavior.
Acceptance criteria
Component
API or orchestration, Agent (Python runtime), Scripts / CLI
Describe the feature
Add step-aware rewind semantics to the append-only progress event stream so that when a durable orchestrator step or an agent pipeline phase is retried after a crash/transient failure, consumers (
bgagent watch,status, notification channels) can detect the replay and discard or supersede the now-stale events from the abandoned attempt — instead of rendering duplicated or contradictory progress.Today the stream is append-only and forward-cursored:
TaskEventsTableis append-only;event_idis a ULID (time-sortable) SK (progress_writer.py,ORCHESTRATOR.md).GET /v1/tasks/{id}/eventssupports anafter=<event_id>forward ULID cursor (get-task-events.ts).watchpolls forward withafterEventIdand adaptive cadence (cli/src/commands/watch.ts).Because the orchestrator is a Lambda Durable Function that replays from the last checkpoint after a crash (
ORCHESTRATOR.md§durable execution), and idempotent steps may re-run, a retried step/phase can emit a second set of progress events for work the consumer already saw. With a pure forward cursor and no replay marker,watchshows both attempts: e.g. "repo_setup_complete" twice, or turn/tool events from an abandoned attempt interleaved with the live one. This is a correctness/UX gap in the streaming surface, not a data-loss issue (the audit log intentionally keeps everything).Use case
bgagent watchafter an orchestrator crash sees a single coherent timeline, not duplicated milestones from the replayed step.Proposed solution
Phase 0 — Attempt/epoch stamping
attempt_epochderived from the durable execution attempt or a monotonic per-step counter) plus the logicalstep/phasename. Add as optional metadata so existing readers are unaffected.Phase 1 — Rewind marker
stream_rewindcontrol event carrying{ step, superseded_after_event_id, new_epoch }. This marks "events forstepwith an earlier epoch thannew_epochare superseded."progress_writer.py(runner + pipeline writers) emit the rewind marker at phase entry on retry; orchestrator emits it on durable-step retry.Phase 2 — Consumer honoring
cli/src/commands/watch.tsand thestatusformatter honorstream_rewind: when seen, drop already-rendered events forstepwith epoch <new_epoch(or visually mark them superseded). Forwardaftercursor semantics are unchanged.get-task-events.tsAPI stays append-only; optionally add acollapse=supersededquery mode that filters superseded events server-side for consumers that don't implement client-side rewind.Phase 3 — Notifications
stream_rewindas a signal to edit/replace the prior progress message for that step rather than append a contradictory one (where the channel supports edits).Out of scope (explicit)
TaskEventsTable— the audit log remains complete and append-only.ORCHESTRATOR.md) — this issue makes its replays legible to consumers, it does not alter replay behavior.Acceptance criteria
attempt_epoch+ logicalstep/phasein metadata; documented in design docs (source + Starlight sync).stream_rewindcontrol event is emitted on durable-step / pipeline-phase retry with{ step, superseded_after_event_id, new_epoch }.progress_writer.pyproducers emit epoch + rewind markers; covered byagent/tests/test_progress_writer.py.bgagent watchandstatusdiscard/supersede stale events onstream_rewind; coherent timeline verified bycli/test.GET /v1/tasks/{id}/eventsremains append-only and backward-compatible; optionalcollapse=supersededmode (if implemented) covered incdk/test/handlers.TaskEventsTable; audit completeness preserved (asserted by test).aftercursor behavior unchanged for existing consumers that ignore the new fields.