feat(observability): step-aware rewind semantics for the progress event stream

### Component

API or orchestration, Agent (Python runtime), Scripts / CLI

### Describe the feature

Add **step-aware rewind semantics** to the append-only progress event stream so that when a durable orchestrator step or an agent pipeline phase is retried after a crash/transient failure, consumers (`bgagent watch`, `status`, notification channels) can detect the replay and **discard or supersede the now-stale events** from the abandoned attempt — instead of rendering duplicated or contradictory progress.

Today the stream is append-only and forward-cursored:
- `TaskEventsTable` is append-only; `event_id` is a ULID (time-sortable) SK (`progress_writer.py`, `ORCHESTRATOR.md`).
- `GET /v1/tasks/{id}/events` supports an `after=<event_id>` forward ULID cursor (`get-task-events.ts`).
- `watch` polls forward with `afterEventId` and adaptive cadence (`cli/src/commands/watch.ts`).

Because the orchestrator is a Lambda Durable Function that **replays from the last checkpoint** after a crash (`ORCHESTRATOR.md` §durable execution), and idempotent steps may re-run, a retried step/phase can emit a *second* set of progress events for work the consumer already saw. With a pure forward cursor and no replay marker, `watch` shows both attempts: e.g. "repo_setup_complete" twice, or turn/tool events from an abandoned attempt interleaved with the live one. This is a **correctness/UX gap in the streaming surface**, not a data-loss issue (the audit log intentionally keeps everything).

### Use case

- **Trustworthy live progress:** A user watching `bgagent watch` after an orchestrator crash sees a single coherent timeline, not duplicated milestones from the replayed step.
- **Accurate notifications:** Slack/Linear/GitHub progress comments don't post contradictory "now doing X" updates from a superseded attempt.
- **Auditability preserved:** The append-only audit log still retains every event (including superseded ones); rewind is a **consumer-side presentation contract**, not a destructive delete.

### Proposed solution

**Phase 0 — Attempt/epoch stamping**
- Stamp each progress event with an **attempt/epoch identifier** tied to the durable step execution (e.g. `attempt_epoch` derived from the durable execution attempt or a monotonic per-step counter) plus the logical `step`/`phase` name. Add as optional metadata so existing readers are unaffected.

**Phase 1 — Rewind marker**
- When a step/phase begins a new attempt, write a `stream_rewind` control event carrying `{ step, superseded_after_event_id, new_epoch }`. This marks "events for `step` with an earlier epoch than `new_epoch` are superseded."
- Producers: `progress_writer.py` (runner + pipeline writers) emit the rewind marker at phase entry on retry; orchestrator emits it on durable-step retry.

**Phase 2 — Consumer honoring**
- `cli/src/commands/watch.ts` and the `status` formatter honor `stream_rewind`: when seen, drop already-rendered events for `step` with epoch < `new_epoch` (or visually mark them superseded). Forward `after` cursor semantics are unchanged.
- The `get-task-events.ts` API stays append-only; optionally add a `collapse=superseded` query mode that filters superseded events server-side for consumers that don't implement client-side rewind.

**Phase 3 — Notifications**
- Notification emitters treat a `stream_rewind` as a signal to edit/replace the prior progress message for that step rather than append a contradictory one (where the channel supports edits).

**Out of scope (explicit)**
- Deleting or mutating events in `TaskEventsTable` — the audit log remains complete and append-only.
- Changing the durable-execution retry/idempotency model itself (`ORCHESTRATOR.md`) — this issue makes its replays *legible to consumers*, it does not alter replay behavior.

### Acceptance criteria

- [ ] Progress events carry an `attempt_epoch` + logical `step`/`phase` in metadata; documented in design docs (source + Starlight sync).
- [ ] A `stream_rewind` control event is emitted on durable-step / pipeline-phase retry with `{ step, superseded_after_event_id, new_epoch }`.
- [ ] `progress_writer.py` producers emit epoch + rewind markers; covered by `agent/tests/test_progress_writer.py`.
- [ ] `bgagent watch` and `status` discard/supersede stale events on `stream_rewind`; coherent timeline verified by `cli/test`.
- [ ] `GET /v1/tasks/{id}/events` remains append-only and backward-compatible; optional `collapse=superseded` mode (if implemented) covered in `cdk/test/handlers`.
- [ ] No events are deleted from `TaskEventsTable`; audit completeness preserved (asserted by test).
- [ ] Forward `after` cursor behavior unchanged for existing consumers that ignore the new fields.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(observability): step-aware rewind semantics for the progress event stream #252

Component

Describe the feature

Use case

Proposed solution

Acceptance criteria

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

feat(observability): step-aware rewind semantics for the progress event stream #252

Description

Component

Describe the feature

Use case

Proposed solution

Acceptance criteria

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions