Skip to content

fix: add retry limits and context recovery hint for error retries (#12087)#12090

Draft
roomote-v0[bot] wants to merge 1 commit intomainfrom
fix/error-retry-context-recovery
Draft

fix: add retry limits and context recovery hint for error retries (#12087)#12090
roomote-v0[bot] wants to merge 1 commit intomainfrom
fix/error-retry-context-recovery

Conversation

@roomote-v0
Copy link
Copy Markdown
Contributor

@roomote-v0 roomote-v0 bot commented Apr 10, 2026

Related GitHub Issue

Closes: #12087

Description

This PR attempts to address Issue #12087 where provider errors during auto-retry cause the model to lose track of the current task and hallucinate about previously completed work.

Root cause analysis: After a provider error + auto-retry, the apiConversationHistory is correctly preserved. However, weaker models (e.g., glm-5 via OpenAI Compatible) can get confused by the conversation structure after an error discontinuity, especially when the current task prompt is embedded several messages back as a tool_result on attempt_completion. The model latches onto the strong "Task Completed" signal from the prior task instead of the new request. Auto-approved operations (file read/write) then allow this hallucination to cascade.

Changes:

  1. Add MAX_STREAM_RETRIES (5) limit: Both mid-stream and first-chunk error auto-retries now have a cap. Previously there was no limit -- only exponential backoff up to 10 minutes. When the limit is reached, the error is surfaced to the user via the manual retry prompt.

  2. Add context recovery hint on mid-stream error retries: When retrying after a mid-stream error, a brief instruction is prepended to the user content reminding the model to continue with the most recent request and not repeat completed work. This helps weaker models re-orient after the error discontinuity.

  3. Graceful fallthrough for first-chunk errors: When autoApprovalEnabled is true and max retries are exhausted, the code falls through to the manual retry prompt instead of continuing indefinitely.

Test Procedure

  • New test file src/core/task/__tests__/error-retry-limits.spec.ts with 9 tests covering:
    • Retry limit enforcement for both mid-stream and first-chunk errors
    • Context recovery hint prepending and structure
    • Retry counter reset after manual retry
    • Stack item structure validation
  • All existing tests pass: Task.spec.ts (38 passed, 4 skipped -- pre-existing skips)
  • Run: cd src && npx vitest run core/task/__tests__/error-retry-limits.spec.ts

Pre-Submission Checklist

  • Issue Linked: This PR is linked to an approved GitHub Issue.
  • Scope: Changes are focused on the linked issue.
  • Self-Review: Performed a thorough self-review.
  • Testing: New tests added to cover the changes.
  • Documentation Impact: No documentation updates required.
  • Contribution Guidelines: Read and agree to the Contributor Guidelines.

Documentation Updates

  • No documentation updates are required.

Additional Notes

Feedback and guidance are welcome. The context recovery hint text can be tuned if needed -- the current wording is designed to be clear and directive without being too verbose.

Interactively review PR in Roo Code Cloud

…2087)

- Add MAX_STREAM_RETRIES (5) limit for both mid-stream and first-chunk
  error auto-retries. Previously there was no cap, allowing indefinite
  retry loops with only exponential backoff.

- When mid-stream error retries are exhausted, present the error to the
  user via api_req_failed ask instead of continuing silently.

- When first-chunk error retries are exhausted with autoApprovalEnabled,
  fall through to the manual retry prompt instead of continuing.

- Add a context recovery hint to the user content on mid-stream error
  retries. This helps weaker models re-orient after a provider error
  instead of hallucinating about previously completed tasks.

- Add tests for retry limit enforcement and context recovery hint.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Context loss on Provider Error causes agent to forget the latest prompt and hallucinate previous task

1 participant