Skip to content

perf(db): add query timing, statement timeouts, and replica routing for usage reads#1105

Merged
jeanduplessis merged 5 commits intomainfrom
db-perf
Mar 16, 2026
Merged

perf(db): add query timing, statement timeouts, and replica routing for usage reads#1105
jeanduplessis merged 5 commits intomainfrom
db-perf

Conversation

@jeanduplessis
Copy link
Copy Markdown
Contributor

@jeanduplessis jeanduplessis commented Mar 15, 2026

Summary

Addresses connection-pool saturation caused by slow microdollar_usage aggregation queries (10-22s, causing 17k-28k pool waiting spikes during peak hours).

This PR implements Phase 1 of the performance plan documented in plans/db-perf-improvements.md:

  • Query instrumentation (Phase 1a): New timedUsageQuery() helper in src/lib/usage-query.ts wraps all hot usage queries in a transaction with structured JSON logging (route, label, scope, period, duration, row count) for before/after measurement.
  • Statement timeouts (Phase 1b): SET LOCAL statement_timeout enforced per-query — 5s for interactive reads (dashboard, billing, autocomplete), 20s for admin reads (abuse stats). Runaway queries are cancelled instead of holding connections for 10-22s.
  • Replica routing (Phase 1c): All read-only microdollar_usage aggregation queries moved from db (primary) to readDb (replica), except getCurrentPeriodUsageUsd which stays on primary because it drives the subscription-state response shown immediately after usage writes.

Affected endpoints: profile usage, user autocomplete metrics, Kilo Pass billing reads, org 30-day summary, org usage details (time series, daily breakdown, autocomplete), and admin abuse stats (hourly, daily, 1h/24h aggregates).

Verification

  • pnpm typecheck — passes

Visual Changes

N/A

Reviewer Notes

  • getCurrentPeriodUsageUsd intentionally stays on primary (db) — see inline comment in kilo-pass-router.ts. Moving it to replica would risk showing stale billing state immediately after usage writes.
  • The timedUsageQuery helper casts tx as unknown as DbInstance because Drizzle's transaction type is narrower than the top-level db type but supports the same .select() API. This is the one as cast in the PR — alternatives (generics over Drizzle's internal transaction types) add complexity without safety benefit.
  • Phase 2 (bounded queries / period selectors) and Phase 3 (covering indexes) are planned as follow-up PRs per the plan doc.

Phased plan to reduce connection-pool saturation: instrument hot
queries, enforce scoped statement timeouts, bound unbounded reads
with period selectors, and add covering indexes on the fact table.
Rollups deferred until measurement justifies them.
@jrf0110
Copy link
Copy Markdown
Contributor

jrf0110 commented Mar 15, 2026

Another thing we could consider - the abuse service currently tracks all usage in cloudflare's analytics engine (backed by clickhouse). It has no problem this load. We're already dual-writing usage there. Perhaps we can offload the aggregate usage stats to that dataset

…aggregations to replica

Profile usage, user/org autocomplete, Kilo Pass billing reads, org summary,
and org usage detail queries are all pure SELECT/aggregation with no
read-then-write dependencies. Routing them to readDb alongside the admin
queries removes significantly more load from the primary's connection pool.

Excludes getAIAdoptionTimeseries (not a microdollar_usage hotspot).
…or microdollar_usage reads

Implement Phase 1 of the db-perf plan:

- Add timedUsageQuery() helper that wraps usage queries in a transaction
  with SET LOCAL statement_timeout (5s interactive, 20s admin) and
  structured JSON timing logs (route, label, scope, period, duration, rows)
- Route all read-only microdollar_usage aggregation queries to readDb
  (replica) to reduce primary connection pool saturation
- Kilo Pass getCurrentPeriodUsageUsd stays on primary (db) since it
  drives the subscription-state response shown immediately after writes
@jeanduplessis jeanduplessis changed the title docs(perf): add microdollar_usage query performance plan perf(db): add query timing, statement timeouts, and replica routing for usage reads Mar 15, 2026
@jeanduplessis jeanduplessis marked this pull request as ready for review March 16, 2026 07:21
@kilo-code-bot
Copy link
Copy Markdown
Contributor

kilo-code-bot bot commented Mar 16, 2026

Code Review Summary

Status: No Issues Found | Recommendation: Merge

Files Reviewed (10 files)
  • plans/db-perf-improvements.md
  • src/app/admin/api/abuse/daily-stats/route.ts
  • src/app/admin/api/abuse/hourly-stats/route.ts
  • src/app/admin/api/abuse/stats/route.ts
  • src/app/api/profile/usage/route.ts
  • src/lib/usage-query.ts
  • src/routers/kilo-pass-router.ts
  • src/routers/organizations/organization-router.ts
  • src/routers/organizations/organization-usage-details-router.ts
  • src/routers/user-router.ts

Reviewed by gpt-5.4-20260305 · 805,972 tokens

@jeanduplessis jeanduplessis requested review from RSO and iscekic March 16, 2026 08:34
@jeanduplessis jeanduplessis merged commit d9a0dc7 into main Mar 16, 2026
18 checks passed
@jeanduplessis jeanduplessis deleted the db-perf branch March 16, 2026 14:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants