Skip to content

refactor(run-control): unify target resolution and SSE handling#2731

Open
dgageot wants to merge 2 commits intodocker:mainfrom
dgageot:board/588d847ac6fc2a26
Open

refactor(run-control): unify target resolution and SSE handling#2731
dgageot wants to merge 2 commits intodocker:mainfrom
dgageot:board/588d847ac6fc2a26

Conversation

@dgageot
Copy link
Copy Markdown
Member

@dgageot dgageot commented May 9, 2026

Continues the work merged in #2714 by simplifying and hardening the run
control plane (run --listen, send, watch, proto, serve mcp --attach).

What this changes

  • One target resolver everywhere. New runregistry.Find(target) is the
    single entry point used by send, watch, proto, and serve mcp --attach. It resolves an empty target to the latest live run, a numeric
    target by PID, an http(s)://host:port target by address, and anything
    else as a (partial) session id. Exact session-id matches always win over
    substring matches, and ambiguous substring matches return a clear error
    instead of silently picking one. Replaces the per-command resolveTarget
    and the previous PID-only behaviour of --to/--attach.

  • Shared SSE helper. cmd/root/sse.go collapses the duplicated SSE
    client code from watch.go and proto.go into openEventStream +
    readEventStream. Payloads are surfaced as json.RawMessage so consumers
    can either forward the bytes verbatim or decode once.

    • proto now forwards the raw event bytes instead of doing a redundant
      unmarshal/marshal round-trip.
    • Error responses are read with io.LimitReader(resp.Body, 4 KiB) so a
      misbehaving server can't push an unbounded payload into client memory.
    • readEventStream returns ctx.Err() on cancellation; watch maps
      context.Canceled back to a clean exit so Ctrl+C doesn't print a
      spurious error.
  • No double responses in proto. dispatchProto reports whether it
    already produced a typed reply (e.g. transcript); the generic ack is
    skipped in that case.

  • Atomic registry writes. runregistry.Write now writes through a
    sibling temp file + Rename so concurrent readers never see torn JSON.
    Latest is simplified with slices.MaxFunc.

  • Cross-platform event hooks. --on-event shell commands now go through
    shellpath.DetectShell() (PowerShell on Windows, $SHELL on Unix)
    instead of hard-coded sh -c. Captured stdout/stderr is logged on
    failure, and the use of context.Background() is documented (hooks
    intentionally outlive app shutdown so the last event still flushes).

Tests

  • runregistry covers Find for pid / addr / session id, ambiguity, the
    exact-match-vs-substring regression, and the empty-registry case.
  • cmd/root/sse_test.go covers happy path, error-body cap, and ctx
    cancellation.
  • Existing tests updated for the new dispatchProto signature.

task lint and task test are green.

@dgageot dgageot requested a review from a team as a code owner May 9, 2026 12:55
Copy link
Copy Markdown

@docker-agent docker-agent left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Assessment: 🟡 NEEDS ATTENTION

Two medium-severity bugs found in the new code. No high-severity issues.

Comment thread pkg/runregistry/registry.go Outdated
return r, nil
}
}
return Record{}, fmt.Errorf("no live run with pid %d", pid)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MEDIUM] ErrNoRun sentinel not returned for "target not found" — docstring/implementation mismatch

The doc comment on ErrNoRun states:

"ErrNoRun is returned when no live run can be found that satisfies the caller's request (empty registry, or no record matches the target)."

But the implementation only returns ErrNoRun in the two empty-registry paths (lines 160 and 170). When the registry is non-empty and the specific target is not found, all three miss-paths (pid, addr, session-id) return plain fmt.Errorf(...) values that are not wrapped with ErrNoRun.

// pid miss — plain error, not ErrNoRun
return Record{}, fmt.Errorf("no live run with pid %d", pid)

// addr miss — plain error, not ErrNoRun
return Record{}, fmt.Errorf("no live run at %s", target)

// session miss — plain error, not ErrNoRun
return Record{}, fmt.Errorf("no live run matches %q ...", target)

Any caller that does errors.Is(err, runregistry.ErrNoRun) to detect "no matching run" will silently miss these three cases. Given the docstring promises the sentinel for all "not found" outcomes, these paths should either wrap with fmt.Errorf("…: %w", ErrNoRun) or the docstring should be corrected to document the current (narrower) behaviour.

Comment thread cmd/root/run_event_hooks.go Outdated
cmd.Stdin = bytes.NewReader(payload)
if err := cmd.Run(); err != nil {
slog.Warn("on-event hook failed", "command", command, "error", err)
out, err := cmd.CombinedOutput()
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[MEDIUM] CombinedOutput() is incompatible with "tail-f-style listener" hooks — goroutine leak

cmd.CombinedOutput() blocks until the subprocess exits and its stdout/stderr reaches EOF. A "tail-f-style listener" (the example in the comment directly above this line) never exits, so this call would block indefinitely.

Because the caller spawns a fresh goroutine per event (go runEventHook(...)), every event that fires while a long-running hook is active leaks another goroutine. CombinedOutput() also buffers all subprocess output in memory, so an infinite-output hook also causes unbounded heap growth.

The comment explicitly documents the intent that such hooks should be allowed to run, but the implementation contradicts that intent. Consider using cmd.Start() + non-blocking reads (or simply discarding output for "fire-and-forget" hooks) so the goroutine can finish promptly without waiting for process exit.

@dgageot dgageot force-pushed the board/588d847ac6fc2a26 branch from 90e088b to ad7b51c Compare May 9, 2026 13:44
@dgageot
Copy link
Copy Markdown
Member Author

dgageot commented May 9, 2026

Addressed review feedback:

  • runregistry.ErrNoRun — wrapped the three "target not found" errors (pid miss, addr miss, session-id miss) with fmt.Errorf("...: %w", ErrNoRun) so callers using errors.Is(err, ErrNoRun) actually catch them. The docstring now matches the implementation. Tests in TestFind were extended with assert.ErrorIs for each miss path.
  • runEventHook — replaced cmd.CombinedOutput() with a boundedBuffer capped at 4 KiB (matches the SSE error-body cap). A chatty or long-running hook can no longer push unbounded data into the agent's heap, while we still keep enough output for the failure log. The misleading "tail-f-style" wording in the comment is gone — each invocation receives a single event on stdin and exits, and the comment now says exactly that. Added TestBoundedBuffer_CapsAtMaxHookOutput.

task lint and task test are green.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants