Skip to content

fix: Handling of site connection issues during outage #3482

Open
rippyboii wants to merge 102 commits intopython-discord:mainfrom
rippyboii:main
Open

fix: Handling of site connection issues during outage #3482
rippyboii wants to merge 102 commits intopython-discord:mainfrom
rippyboii:main

Conversation

@rippyboii
Copy link

Summary

This PR improves bot startup reliability and moderator visibility when extensions/cogs fail to load.

  • Aggregate extension + cog load failures during startup and report them as a single alert in #mod-log.
  • Add retry + exponential backoff for cogs that depend on external sites/APIs (e.g., temporary 5xx/429/timeouts), with clear mod notifications on final failure.
  • Add unit tests to validate retry behavior, error classification, and startup reporting.

During setup_hook, extensions are loaded concurrently. When an extension/cog fails due to a transient outage (rate limits, 5xx, timeouts), failures can either:

  • stop startup unexpectedly, or
  • fail noisily/fragmentedly, making it hard to see what broke and why.

This change standardizes both resilience (retry when appropriate) and visibility (one clean startup report + targeted alerts).

Changes

1) Startup failure aggregation (single #mod-log alert)

  • Added utils/startup_reporting.py to format a standardized startup failure message.
  • Updated bot.py to:
    • collect extension + cog load failures (import/setup/add_cog)
    • wait for all load tasks to complete
    • send one aggregated alert summarizing all failures
  • Reporting is defensive: it does not crash if the log channel is unavailable.
  • Startup continues for non-critical failures.

2) Retry + backoff for external/API-dependent cogs

Implemented retry logic with exponential backoff and explicit “retriable vs non-retriable” classification, plus moderator notifications when retries are exhausted.

Covered cogs include:

  • Filtering: 3 attempts with backoff (1s, 2s, 4s); retries on HTTP 429, HTTP 5xx, TimeoutError, OSError; final failure logs + alerts #mod-alerts.
  • Reminders: retry count is configured via URLs.connect_max_retries; warns are logged to Sentry; final failure posts to #mod-log.
  • PythonNews: retries on 408, 429, 5xx, TimeoutError, OSError; on max retries logs + alerts mod_alerts and re-raises to stop startup.
  • Superstarify cogs: added retry + notification and corresponding tests.

Tests / Verification

  • Added unit tests covering:
    • retry-then-success
    • max-retries then alert + failure behavior
    • non-retriable errors
    • retry classification logic
    • aggregated startup failure reporting for faulty extensions/cogs

Suggested checks:

  1. uv run task test
  2. Run the bot and simulate a faulty extension/cog load to confirm a single aggregated #mod-log startup alert.

Moderator Alert in Discord:

mod_alert

Closes #2918

rippyboii and others added 30 commits February 25, 2026 15:50
Alerts the moderators through a discord error message if the loading of the Reminders Cog has failed.
Adds retry logic with time buff to `Filtering.cog_load()`
Changed cog_load() function to retry connecting to api if it fails initially with an exponential delay and limited max attempts.
Add test cases for retrying cog loads and skeleton for new functions
Implement skeleton functions with code for retrying fetch, alerting mods, and checking if retryable
Add unit test for on_member_update and unit test to check _alert_mods_if_loading_failed is being called
a-runebou and others added 28 commits March 2, 2026 20:20
fix: remove uncalled method (Closes #36)
refactor: rename variables (Closes #38)
refactor: simplify merging of lines (Closes #39)
refactor: remove explicit context helper function (Closes #45)
refactor: remove dataclass label (Closes #47)
@rippyboii rippyboii requested a review from mbaruh as a code owner March 24, 2026 14:09
@rippyboii
Copy link
Author

rippyboii commented Mar 24, 2026

Hello @jb3,

I have opened another PR to keep the things clean. The earlier PR had referenced some commits including reports, which was the part of our assignment but not of this project. This new PR doesn't include any commits refrencing the reports or anything else that's unnecessary to this project.

All the suggestions from the previous PR are adapted in this PR too.

Apologies for delay update from our side.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Handling of site connection issues during outage.

5 participants