Skip to content

feat(tables): background jobs (delete/export/backfill on trigger.dev) + tenant-scoped query performance#4915

Open
TheodoreSpeaks wants to merge 31 commits into
stagingfrom
improvement/table-row-deletes
Open

feat(tables): background jobs (delete/export/backfill on trigger.dev) + tenant-scoped query performance#4915
TheodoreSpeaks wants to merge 31 commits into
stagingfrom
improvement/table-row-deletes

Conversation

@TheodoreSpeaks

@TheodoreSpeaks TheodoreSpeaks commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator

Summary

Background jobs, async exports, and a systemic performance pass for Tables.

Background row deletes + table_jobs

  • "Select all" deletes (filter + optional deselections) send { filter, excludeRowIds } and delete in keyset-paginated background batches; rows inserted mid-job are spared via a created_at cutoff.
  • New table_jobs table (one row per job, type = import | delete | export | backfill). Concurrency gate: partial-unique index on (table_id) WHERE status='running' AND type <> 'export' — one writing job per table; read-only exports run concurrently.
  • Read-path mask: while a delete job runs, every read (grid, exports, search) hides the doomed rows, so mid-job reads match the eventual result.

Jobs on trigger.dev

  • Import, delete, export, and output-column backfill run as trigger.dev tasks (apps/sim/background/) with an in-process runDetached fallback — jobs survive deploys/restarts.
  • Backfill is hybrid: ≤500 completed runs inline (response returns consistent), above that a background job.

CSV export

  • Hybrid by size: ≤10k rows keep the synchronous stream; larger tables run an async export job → storage → presigned download (auto-downloads for the initiating session).
  • Unified jobs tray (imports + exports) with aggregate state icon, live progress, cancel, and re-download.

Performance (measured on a 1M-row table in a 12M-row shared relation)

Path Before After
1M-row delete ~25 min seconds/page — (table_id, id) index (0231)
1M-row export ~30 min ~1 min — keyset pagination (OFFSET was O(N²))
Grid infinite scroll / drains OFFSET per page (order_key, id) cursor seek
Filtered count (ILIKE) 11–12.7s 0.6s
Sorted view, per page 9.7s + disk spills 0.76s
Cmd+F search 75s 2s
Equality filters / $in 326ms–1.1s 17ms — tenant-scoped GIN (0232)
Unique-constraint check (every write) 3.5s <1s

Root cause for most of these: jsonb predicates are unestimatable, so the planner picked parallel seq scans over the entire shared relation (every tenant's rows). Two mechanisms fix it:

  • withSeqscanOff (lib/table/planner.ts): transaction-scoped SET LOCAL enable_seqscan = off around the queries with unindexable predicates (ILIKE/range/lower()/sort keys).
  • Migration 0232: replaces the cross-tenant GIN on data with btree_gin (table_id, data jsonb_path_ops) — containment lookups resolve tenant-scoped inside the index (and the index is smaller, 529MB vs 694MB on dev).
  • Autovacuum tuned to 2% churn on the shared relation (0231) so one tenant's mass delete doesn't degrade reads for days.

Fit & finish

  • Error toasts surface the real failure (drizzle wraps DB errors; routes now classify on the root cause) — e.g. "Row limit exceeded — this table is capped at 10,000 rows" instead of "Failed to insert row".
  • Export tray icon appears immediately on kickoff; tray restored on the tables list page.
  • Import batches 5k rows (bind-param bounded), delete pages 10k.

Migrations

  • 0231_table_jobs_and_keyset: table_jobs + import-state data migration + (table_id, id) index + autovacuum tuning + import_* column drops.
  • 0232_tenant_scoped_data_gin: btree_gin extension + tenant-scoped containment index (replaces user_table_rows_data_gin_idx).

Test plan

  • bunx vitest run lib/table app/api/table app/api/v1 hooks/queries (395 tests)
  • bun run lint:check, tsc --noEmit, bun run check:api-validation
  • Manual: 1M-row table — select-all delete, async export + download, filtered/sorted scrolls, Cmd+F, unique-column inserts; trigger.dev path with TRIGGER_DEV_ENABLED=true; EXPLAIN ANALYZE before/after for each perf claim.

🤖 Generated with Claude Code

@vercel

vercel Bot commented Jun 9, 2026

Copy link
Copy Markdown

The latest updates on your projects. Learn more about Vercel for GitHub.

1 Skipped Deployment
Project Deployment Actions Updated (UTC)
docs Skipped Skipped Jun 10, 2026 7:15am

Request Review

@cursor

cursor Bot commented Jun 9, 2026

Copy link
Copy Markdown

PR Summary

High Risk
Touches row deletion, job concurrency, and read-path masking for in-flight deletes; large surface area across API, workers, cron, and client cache/SSE behavior where regressions could lose data or leave ghost running jobs.

Overview
Replaces per-table import_* state with a unified table_jobs model (jobStatus, jobId, jobType, …) and wires background import, delete, export, and backfill through shared claim/release APIs, optional Trigger.dev tasks, and a generic /job/cancel route.

Async capabilities: new delete-async (filter + excludeRowIds, created_at cutoff, single write-job slot) and export-async + presigned export/download; cron now fails stale running table jobs and prunes old terminal jobs while deleting orphaned export files from storage.

API & UX: row listing gains after keyset cursors (offset fallback when sorted); workflow run/stop and bulk delete accept filters with upfront validation via tableFilterError. Row/column routes centralize client-facing errors with rootErrorMessage / rowWriteErrorResponse (including friendly row-limit toasts).

Grid & queries: filtered select-all deletes kick off async jobs with optimistic row stripping; selection keeps exclusion sets; action-bar counts and Cmd+A use filter-scoped totals; large exports auto-route async; SSE handles job events (exports auto-download when initiated this session); tray shows imports + exports.

Reviewed by Cursor Bugbot for commit 4954807. Bugbot is set up for automated code reviews on this repo. Configure here.

Comment thread apps/sim/hooks/queries/tables.ts
@greptile-apps

greptile-apps Bot commented Jun 9, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR adds background job infrastructure for Tables: async filtered row deletes, async CSV exports, and output-column backfill — all running on Trigger.dev (with an in-process runDetached fallback). It also ships a systemic performance pass: keyset/cursor pagination replaces OFFSET, withSeqscanOff keeps JSONB-predicate queries tenant-bounded, and a new tenant-scoped btree_gin composite index replaces the cross-tenant GIN. The table_jobs table replaces import-state columns on user_table_definitions and acts as the concurrency gate for all write jobs via a partial-unique index.

  • Background jobs: delete, export, and backfill workers are keyset-paginated, ownership-gated per page (stale/canceled workers self-stop), and surfaced via SSE; a created_at cutoff in delete jobs spares rows inserted while the job runs, and pendingDeleteMask applies the same predicate to all mid-job reads.
  • Performance: cursor pagination replaces OFFSET for the default-order grid, withSeqscanOff transactions replace seq scans on JSONB predicates (11–75s → 0.6–2s on measured paths), and the new composite GIN trims both index size and scan candidates.
  • Infrastructure: two new migrations (table_jobs + keyset index, tenant-scoped GIN), stale-job janitor extended to cover all job types, and a unified import/export tray in the UI.

Confidence Score: 4/5

Safe to merge with one targeted fix: the Trigger.dev dispatch path in both delete-async and import-async can leave a ghost running job that blocks the table's write slot for up to ~15 minutes on any transient Trigger.dev error.

The ghost-job issue in the Trigger.dev dispatch path is a concrete, observable failure: after a successful markTableJobRunning, if tasks.trigger throws, the table's one-write-job-per-table slot stays occupied until the stale-job janitor runs, blocking all subsequent imports and deletes on that table. Adding a try/catch that calls releaseJobClaim before re-throwing is a small, targeted fix. Everything else — the keyset pagination, ownership-gated workers, planner hints, migration data migration — looks well-considered and tested.

apps/sim/app/api/table/[tableId]/delete-async/route.ts and apps/sim/app/api/table/[tableId]/import-async/route.ts — the Trigger.dev dispatch section in each needs a try/catch to release the job claim on trigger failure.

Important Files Changed

Filename Overview
apps/sim/app/api/table/[tableId]/delete-async/route.ts New route: kicks off background delete jobs. Has a ghost-job risk when the Trigger.dev dispatch fails after the job row has been created — blocks the table's write slot until the stale-job janitor fires.
apps/sim/app/api/table/[tableId]/import-async/route.ts Modified to use markTableJobRunning and optionally dispatch via Trigger.dev — has the same ghost-job risk on trigger dispatch failure as delete-async.
apps/sim/lib/table/service.ts Major refactor: import-state columns removed, all job state moved to table_jobs; adds keyset pagination, pendingDeleteMask, seqscan-off planner hooks, and new background-job service functions. Logic appears sound.
apps/sim/lib/table/delete-runner.ts New background worker for large filtered row deletes; keyset-paginated, ownership-gated, retry-safe. Logic looks correct.
apps/sim/lib/table/export-runner.ts New async export worker; keyset-paginated, ownership-gated, uploads to storage. pendingDeleteMask is re-queried per batch — 200+ extra round-trips for 1M-row tables.
apps/sim/lib/table/planner.ts New helper: wraps a DB transaction with SET LOCAL enable_seqscan = off to force tenant-scoped index plans for JSONB predicates. Simple and well-documented.
packages/db/migrations/0231_table_jobs_and_keyset.sql Creates table_jobs, migrates import state from user_table_definitions, adds keyset index, tunes autovacuum. Data migration logic is sound.
packages/db/migrations/0232_tenant_scoped_data_gin.sql Drops cross-tenant GIN index and creates tenant-scoped btree_gin index. Non-concurrent drop+create is blocking on large tables — acceptable as a one-time migration but worth noting for deployment planning.
apps/sim/hooks/queries/tables.ts Adds cursor keyset pagination, useDeleteTableRowsAsync with exact-key optimistic update, export job polling hooks.
apps/sim/app/api/cron/cleanup-stale-executions/route.ts Updated janitor: marks stale running table_jobs as failed and prunes terminal jobs older than 24h; also deletes orphaned export files. Logic is correct.

Sequence Diagram

sequenceDiagram
    participant Client
    participant Route as delete-async / export-async
    participant DB as table_jobs (DB)
    participant TDev as Trigger.dev
    participant Worker as delete-runner / export-runner
    participant SSE as Table SSE stream

    Client->>Route: POST /delete-async (filter, excludeRowIds)
    Route->>DB: markTableJobRunning (INSERT, partial-unique gate)
    DB-->>Route: "claimed=true / false (409 on false)"
    Route->>TDev: tasks.trigger(payload)
    Note over Route,TDev: If trigger throws here, job row stays running (write slot locked ~15 min)
    TDev->>Worker: run(payload)
    loop Each keyset page
        Worker->>DB: updateJobProgress (ownership check)
        DB-->>Worker: "owns=true / false (stop if false)"
        Worker->>DB: deletePageByIds / uploadFile
        Worker->>SSE: appendTableEvent(progress)
    end
    Worker->>DB: markJobReady / markJobFailed
    Worker->>SSE: appendTableEvent(ready/failed)
    SSE-->>Client: job terminal event (invalidate queries)
Loading

Reviews (4): Last reviewed commit: "Merge remote-tracking branch 'origin/sta..." | Re-trigger Greptile

Comment on lines +815 to +817
const excludeRowIds = rowSel.excluded ? [...rowSel.excluded] : []
const estimatedCount = Math.max(0, tableRowCountRef.current - excludeRowIds.length)
onRequestDeleteAllByFilter({ excludeRowIds, estimatedCount })

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 The estimatedCount uses tableRowCountRef.current, which is the TOTAL table row count (tableData.rowCount), not the count matching the active filter. When a filter is active, the confirmation dialog will say "Delete 10,000 rows matching the current filter" even though only a small subset of those rows match the filter and will actually be deleted. This could alarm users or make them distrust the UI. The filtered count should come from the rows query's metadata (or at minimum the dialog should not show a count when a filter is active).

Suggested change
const excludeRowIds = rowSel.excluded ? [...rowSel.excluded] : []
const estimatedCount = Math.max(0, tableRowCountRef.current - excludeRowIds.length)
onRequestDeleteAllByFilter({ excludeRowIds, estimatedCount })
const excludeRowIds = rowSel.excluded ? [...rowSel.excluded] : []
// When a filter is active, tableRowCountRef.current is the *total* table count, not the
// filtered count — pass undefined so the dialog avoids showing a misleading number.
const estimatedCount = queryOptions?.filter
? undefined
: Math.max(0, tableRowCountRef.current - excludeRowIds.length)
onRequestDeleteAllByFilter({ excludeRowIds, estimatedCount })

Comment thread apps/sim/hooks/queries/tables.ts Outdated
Comment on lines +939 to +956
await queryClient.cancelQueries({ queryKey: tableKeys.rowsRoot(tableId) })
const previousQueries = queryClient.getQueriesData<InfiniteData<TableRowsResponse, number>>({
queryKey: tableKeys.rowsRoot(tableId),
})
const previousDetail = queryClient.getQueryData<TableDefinition>(tableKeys.detail(tableId))
const keep = new Set(excludeRowIds ?? [])
queryClient.setQueriesData<InfiniteData<TableRowsResponse, number>>(
{ queryKey: tableKeys.rowsRoot(tableId), exact: false },
(old) =>
old
? {
...old,
pages: old.pages.map((page) => ({
...page,
rows: page.rows.filter((r) => keep.has(r.id)),
})),
}
: old

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Optimistic update over-clears row caches

setQueriesData with exact: false updates every cached row query for this table, not just the one that matches the current filter. If the user previously loaded the table without a filter (or with a different filter), those cached pages are also cleared. Since onSuccess only invalidates tableKeys.lists() (not row queries), the stale no-op rows stay empty until the SSE terminal event fires — which may be minutes later for a large delete. The fix is to restrict the optimistic update to the exact query key that matches the current filter.

…ed optimistic clear, Cmd+A select-all, hide delete from tray)
…row-deletes

# Conflicts:
#	apps/sim/app/workspace/[workspaceId]/tables/[tableId]/table.tsx
#	apps/sim/app/workspace/[workspaceId]/tables/components/import-csv-dialog/import-csv-dialog.tsx
#	apps/sim/app/workspace/[workspaceId]/tables/tables.tsx
#	apps/sim/lib/table/import-runner.ts
#	apps/sim/lib/table/service.ts
#	packages/db/migrations/meta/0227_snapshot.json
#	packages/db/migrations/meta/_journal.json
Comment thread apps/sim/lib/table/service.ts
Comment thread apps/sim/lib/api/contracts/tables.ts Outdated
Comment thread apps/sim/lib/table/dispatcher.ts Outdated
Comment thread apps/sim/app/api/cron/cleanup-stale-executions/route.ts
Comment thread apps/sim/hooks/queries/tables.ts
Extends the seqscan fix to every remaining jsonb-predicate path, all
measured on a 1M-row table in a 12M-row shared relation:
- sorted page query (ORDER BY data->>'col'): 9.7s/page -> 0.76s, and deep
  pages stop spilling ~130MB sorts to disk
- updateRowsByFilter / deleteRowsByFilter row selection: 14.4s -> bounded
- delete-job worker selectRowIdPage with a filter: 12.6s/page -> bounded
- dispatcher filtered-scope window walk: same shape, same fix

Shared withSeqscanOff helper moves to lib/table/planner.ts (service +
dispatcher both consume it; dispatcher can't import service).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The plain GIN on user_table_rows.data matched @> candidates across every
tenant sharing the relation — a hot value in someone else's table
inflated everyone's equality filters (1.07M candidates fetched for a
33k-row match, lossy bitmap, 1.1s). Replace it with btree_gin
(table_id, data jsonb_path_ops): the tenant intersection happens inside
the index and paths are single hashed entries. Rare-equality probe
326ms -> 17ms with zero wasted candidates; unique-constraint checks and
upsert conflict lookups ride the same index. The new index is smaller
than the one it replaces (529MB vs 694MB on the 12M-row dev relation).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…write)

The unique check runs lower(data->>'col') = $1 LIMIT 1 on every insert
and cell edit touching a unique column. The predicate is unestimatable
and a unique (non-conflicting) value never exits early, so the planner
seq-scanned all 12.3M shared-relation rows per check — 3.5s measured.
Tenant-bound both the single and batch variants; the batch path sets the
flag on the caller's transaction when one is supplied (SET LOCAL dies at
its commit, and the statements that follow are tenant-scoped writes).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Same unestimatable data->>key predicate as the unique checks; an
insert-path upsert has no existing match so the lookup can't exit early
and seq-scans the whole shared relation. The upsert already runs in a
transaction — set the planner flag on it.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Comment thread apps/sim/lib/table/service.ts
service.ts kept a private DbTransaction alias and two inline
typeof db | DbTransaction unions after planner.ts began exporting the
canonical DbTransaction/DbExecutor — import those instead. From the
/simplify review of the perf series; no behavior change.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 4 total unresolved issues (including 3 from previous reviews).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit d339c0e. Configure here.

Comment thread apps/sim/lib/table/workflow-columns.ts
@TheodoreSpeaks TheodoreSpeaks changed the title feat(tables): paginated background row-delete jobs via table_jobs feat(tables): background jobs (delete/export/backfill on trigger.dev) + tenant-scoped query performance Jun 10, 2026
@TheodoreSpeaks

Copy link
Copy Markdown
Collaborator Author

@greptile review

The local vi.mock('@sim/db/schema') stubbed only document/knowledgeBase,
but the file's import graph reaches lib/table/service whose module scope
now references tableJobs. The global schema mock already covers all of
it — rely on it per the testing rules instead of re-mocking.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Addresses the open Bugbot/Greptile findings on filtered select-all:
- Filtered runs no longer cancel the whole table: cancelWorkflowGroupRuns
  takes a filter — it stops only dispatches with that exact filter scope
  and only in-flight cells on matching rows (semi-join); whole-table and
  differently-scoped dispatches keep running, their cancelled cells
  skipped via cancelledAt > requestedAt.
- Stop on a filtered select-all sends the filter through cancel-runs
  (contract + route + mutation) instead of a table-wide cancel.
- runColumnBodySchema rejects rowIds + filter together (mirrors
  deleteTableRowsBodySchema).
- Select-all delete clears the selection in onSuccess, not at click, so
  a failed kickoff restores both rows and selection.
- Clipboard copy/cut estimates use the filter-aware total (rowTotal)
  instead of the whole-table rowCount.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@TheodoreSpeaks

Copy link
Copy Markdown
Collaborator Author

@greptile review

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…row-deletes

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Comment on lines +67 to +97
// paths can mask the doomed rows while the job runs (see `pendingDeleteMask`).
const jobId = generateId()
const payload: TableDeleteJobPayload = { filter, excludeRowIds, cutoff: cutoff.toISOString() }
const claimed = await markTableJobRunning(tableId, jobId, 'delete', payload)
if (!claimed) {
return NextResponse.json(
{ error: 'A job is already in progress for this table' },
{ status: 409 }
)
}

if (isTriggerDevEnabled) {
// Trigger.dev runs the delete outside the web container (survives deploys) and retries —
// safe: the keyset + cutoff walk just deletes whatever remains.
const [{ tableDeleteTask }, { tasks }] = await Promise.all([
import('@/background/table-delete'),
import('@trigger.dev/sdk'),
])
await tasks.trigger<typeof tableDeleteTask>(
'table-delete',
{ jobId, tableId, workspaceId, filter, excludeRowIds, cutoff: cutoff.toISOString() },
{ tags: [`tableId:${tableId}`, `jobId:${jobId}`] }
)
} else {
runDetached('table-delete', () =>
runTableDelete({
jobId,
tableId,
workspaceId,
filter,
excludeRowIds,

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Ghost running job on trigger dispatch failure

If tasks.trigger (or its dynamic imports) throws — e.g., the Trigger.dev service is temporarily unreachable — the handler exits with a 500, but the table_jobs row created by markTableJobRunning remains with status='running'. This ghost job holds the table's one-write-job-per-table slot (the partial-unique index) until the stale-job janitor fires, which can take up to STALE_THRESHOLD_MINUTES (~15–20 min). During that window every subsequent import and delete for this table returns 409.

Wrap the dispatch block in a try/catch and release the claim before re-throwing:

try {
  await tasks.trigger<typeof tableDeleteTask>(...)
} catch (err) {
  await releaseJobClaim(tableId, jobId).catch(() => {})
  throw err
}

The same gap exists in apps/sim/app/api/table/[tableId]/import-async/route.ts for the Trigger.dev path added in this PR.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
If tasks.trigger (or its dynamic imports) throws after
markTableJobRunning, the ghost running row held the table's
one-write-job slot until the stale-job janitor fired (~15-20 min of
409s). All four kickoff routes now release the claim and rethrow; the
backfill runner releases and warns (a failed backfill never fails the
schema change). Greptile P1.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant