perf(db): add query timing, statement timeouts, and replica routing for usage reads#1105
Merged
jeanduplessis merged 5 commits intomainfrom Mar 16, 2026
Merged
perf(db): add query timing, statement timeouts, and replica routing for usage reads#1105jeanduplessis merged 5 commits intomainfrom
jeanduplessis merged 5 commits intomainfrom
Conversation
Phased plan to reduce connection-pool saturation: instrument hot queries, enforce scoped statement timeouts, bound unbounded reads with period selectors, and add covering indexes on the fact table. Rollups deferred until measurement justifies them.
jrf0110
reviewed
Mar 15, 2026
Contributor
|
Another thing we could consider - the abuse service currently tracks all usage in cloudflare's analytics engine (backed by clickhouse). It has no problem this load. We're already dual-writing usage there. Perhaps we can offload the aggregate usage stats to that dataset |
…aggregations to replica Profile usage, user/org autocomplete, Kilo Pass billing reads, org summary, and org usage detail queries are all pure SELECT/aggregation with no read-then-write dependencies. Routing them to readDb alongside the admin queries removes significantly more load from the primary's connection pool. Excludes getAIAdoptionTimeseries (not a microdollar_usage hotspot).
…or microdollar_usage reads Implement Phase 1 of the db-perf plan: - Add timedUsageQuery() helper that wraps usage queries in a transaction with SET LOCAL statement_timeout (5s interactive, 20s admin) and structured JSON timing logs (route, label, scope, period, duration, rows) - Route all read-only microdollar_usage aggregation queries to readDb (replica) to reduce primary connection pool saturation - Kilo Pass getCurrentPeriodUsageUsd stays on primary (db) since it drives the subscription-state response shown immediately after writes
Contributor
Code Review SummaryStatus: No Issues Found | Recommendation: Merge Files Reviewed (10 files)
Reviewed by gpt-5.4-20260305 · 805,972 tokens |
iscekic
approved these changes
Mar 16, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Addresses connection-pool saturation caused by slow
microdollar_usageaggregation queries (10-22s, causing 17k-28k pool waiting spikes during peak hours).This PR implements Phase 1 of the performance plan documented in
plans/db-perf-improvements.md:timedUsageQuery()helper insrc/lib/usage-query.tswraps all hot usage queries in a transaction with structured JSON logging (route, label, scope, period, duration, row count) for before/after measurement.SET LOCAL statement_timeoutenforced per-query — 5s for interactive reads (dashboard, billing, autocomplete), 20s for admin reads (abuse stats). Runaway queries are cancelled instead of holding connections for 10-22s.microdollar_usageaggregation queries moved fromdb(primary) toreadDb(replica), exceptgetCurrentPeriodUsageUsdwhich stays on primary because it drives the subscription-state response shown immediately after usage writes.Affected endpoints: profile usage, user autocomplete metrics, Kilo Pass billing reads, org 30-day summary, org usage details (time series, daily breakdown, autocomplete), and admin abuse stats (hourly, daily, 1h/24h aggregates).
Verification
pnpm typecheck— passesVisual Changes
N/A
Reviewer Notes
getCurrentPeriodUsageUsdintentionally stays on primary (db) — see inline comment inkilo-pass-router.ts. Moving it to replica would risk showing stale billing state immediately after usage writes.timedUsageQueryhelper caststx as unknown as DbInstancebecause Drizzle's transaction type is narrower than the top-leveldbtype but supports the same.select()API. This is the oneascast in the PR — alternatives (generics over Drizzle's internal transaction types) add complexity without safety benefit.