fix: Handling of site connection issues during outage#3470
fix: Handling of site connection issues during outage#3470rippyboii wants to merge 0 commit intopython-discord:mainfrom
Conversation
bot/exts/filtering/filtering.py
Outdated
|
|
||
| return isinstance(error, (TimeoutError, OSError)) | ||
|
|
||
| async def _alert_mods_filter_load_failure(self, error: Exception, attempts: int) -> None: |
There was a problem hiding this comment.
This method is never actually getting called.
There was a problem hiding this comment.
Hi, Thank you for mentioning it.
Initially we alerted mods directly via the individual cogs call. However, later when we refactored to centralize the process during the startup period, we missed to refactor it.
This unused method is removed
bot/exts/filtering/filtering.py
Outdated
| await mod_alerts_channel.send( | ||
| ":warning: Filtering failed to load filter lists during startup " | ||
| f"after {attempts} attempt(s). Error: `{error_details}`" | ||
| ) |
There was a problem hiding this comment.
When there is any failure in startup process, including the cog load. Now it is all collected directly by StartupFailureReporter in startup_reporting.py, and it mentions the moderator when reporting the error
bot/exts/info/python_news.py
Outdated
| MAX_ATTEMPTS = constants.URLs.connect_max_retries | ||
| INITIAL_BACKOFF_SECONDS = 1 |
There was a problem hiding this comment.
These should be added to constants and instead of being redefined as constants at the top of the file usages should just use the imported class (e.g. do from bot.constants import URLs and then use URLs.connect_max_retries as an argument)
There was a problem hiding this comment.
Thank you for your suggestion,
We have fixed this issue!
bot/exts/filtering/filtering.py
Outdated
| FILTER_LOAD_MAX_ATTEMPTS = constants.URLs.connect_max_retries | ||
| INITIAL_BACKOFF_SECONDS = 1 |
There was a problem hiding this comment.
Move both to constants and reference the constants.
There was a problem hiding this comment.
Thank you for the suggestion!
We have fixed this issue!
bot/exts/filtering/filtering.py
Outdated
| @staticmethod | ||
| def _retryable_filter_load_error(error: Exception) -> bool: | ||
| """Return whether loading filter lists failed due to some temporary error, thus retrying could help.""" | ||
| if isinstance(error, ResponseCodeError): | ||
| return error.status in (408, 429) or error.status >= 500 | ||
|
|
||
| return isinstance(error, (TimeoutError, OSError)) |
There was a problem hiding this comment.
This should be made a generic utility and moved to a separate file for all backoff-compatible cogs to use.
There was a problem hiding this comment.
Hi, Thank you for your suggestion.
We have now added new file in utility for this retry mechanism and is reflected across the code
bot/utils/startup_reporting.py
Outdated
| if TYPE_CHECKING: | ||
| from bot.bot import Bot | ||
|
|
||
| @dataclass(frozen=True) |
There was a problem hiding this comment.
Hi,
This issue has been fixed!
bot/utils/startup_reporting.py
Outdated
| for k in keys: | ||
| e = failures[k] |
There was a problem hiding this comment.
Hi, thank you for mentioning it,
we have fixed this one too
bot/utils/startup_reporting.py
Outdated
|
|
||
| return textwrap.dedent(f""" | ||
| Failed items: | ||
| {chr(10).join(lines)} |
There was a problem hiding this comment.
Not the way to do this in so many ways, you can just do "\n".join(lines)
Can do something like:
await ctx.send(
"Failed items:\n" +
"\n".join(lines)
)
bot/bot.py
Outdated
| _current_extension: contextvars.ContextVar[str | None] = contextvars.ContextVar( | ||
| "current_extension", default=None | ||
| ) |
There was a problem hiding this comment.
I'm not exactly sure why you need a context var to implement this, it seems like unnecessary overcomplication and just makes things harder to follow.
There was a problem hiding this comment.
Hi, thank you for mentioning it
We have fixed this issue, bot.py no longer uses contextvars or _content_extensions at all.
| def test_retryable_python_news_load_error(self): | ||
| """`_retryable_site_load_error` should classify temporary failures as retryable.""" | ||
| test_cases = ( | ||
| (ResponseCodeError(MagicMock(status=408)), True), | ||
| (ResponseCodeError(MagicMock(status=429)), True), | ||
| (ResponseCodeError(MagicMock(status=500)), True), | ||
| (ResponseCodeError(MagicMock(status=503)), True), | ||
| (ResponseCodeError(MagicMock(status=400)), False), | ||
| (ResponseCodeError(MagicMock(status=404)), False), | ||
| (TimeoutError("timeout"), True), | ||
| (OSError("os error"), True), | ||
| (AttributeError("attr"), False), | ||
| (ValueError("value"), False), | ||
| ) |
There was a problem hiding this comment.
These tests should all be moved to a general utility (like the retriable error method in each of the cogs as mentioned).
There was a problem hiding this comment.
Hi Thank you for mentioning this,
I have moved this one to it's respective test file to follow the structure, following from the utility in main.
|
Hi @jb3 , Thank you so much for bringing out the suggestions. We have followed your guidance and implemented the refactor to most of it. Could you please review it once again in your free time? Thank you so much |
Summary
This PR improves bot startup reliability and moderator visibility when extensions/cogs fail to load.
#mod-log.During
setup_hook, extensions are loaded concurrently. When an extension/cog fails due to a transient outage (rate limits, 5xx, timeouts), failures can either:This change standardizes both resilience (retry when appropriate) and visibility (one clean startup report + targeted alerts).
Changes
1) Startup failure aggregation (single
#mod-logalert)utils/startup_reporting.pyto format a standardized startup failure message.bot.pyto:2) Retry + backoff for external/API-dependent cogs
Implemented retry logic with exponential backoff and explicit “retriable vs non-retriable” classification, plus moderator notifications when retries are exhausted.
Covered cogs include:
HTTP 429,HTTP 5xx,TimeoutError,OSError; final failure logs + alerts#mod-alerts.URLs.connect_max_retries; warns are logged to Sentry; final failure posts to#mod-log.408,429,5xx,TimeoutError,OSError; on max retries logs + alertsmod_alertsand re-raises to stop startup.Tests / Verification
Suggested checks:
uv run task test#mod-logstartup alert.Moderator Alert in Discord:
Closes #2918