perf: one-transaction flush + async indexing write path, read-path fixes#28675
Conversation
Flip the packaged deployment default from none to redis. CacheBundle already falls back to NoopCacheProvider when Redis init fails, so deployments without Redis keep booting (degraded). CacheConfig.java's programmatic default stays none, so the integration-test default is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Several repositories override setFields but not setFieldsInBulk, so entity-specific fields were silently null on the list path (0 queries — a correctness bug, not an N+1). Add the bulk path for each and batch the residual per-item fetches: - DirectoryRepository: service/parent/numberOfFiles/numberOfSubDirectories/totalSize on list; batch child sizes via findEntitiesByIds (NON_DELETED). - SpreadsheetRepository: service/directory/worksheets on list; batch getWorksheets. - ServiceEntityRepository: pipelines on list for all 13 *Service types. - ContextMemoryRepository: primaryEntity/relatedEntities on list. - DatabaseRepository: drop redundant per-table setFieldsInternal re-scan in recursive CSV export (columns/tags already bulk-populated). - EntityRepository: route single-entity applyTags through the batched variant. Validated by integration tests (postgres-elasticsearch): Directory 251, Spreadsheet 165, DatabaseService 39, ContextMemory 132, Database 197, Table tag-batch — all green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Fernet rebuilt the key list (split on ",") and constructed a new SecureRandom on every encrypt/decrypt call — repeated per password field per service write/read. Hoist the parsed List<Key> and a single SecureRandom to init-once fields, rebuilt only on setFernetKey (rotation). Ciphertext format and rotation behavior unchanged. Validated: FernetKeyCacheTest — 3 tests, 0 failures (round-trip, multi-field key reuse, rotation rebuild). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…list (T1) KnowledgePageRepository overrode setFields but not setFieldsInBulk, so the hierarchy listing (GET /v1/contextCenter/pages, /hierarchy) returned every page with parent=null — the Knowledge Center tree rendered flat. Add setFieldsInBulk batching parent (CONTAINS container), relatedEntities, and editors. Tested via KnowledgePageHierarchyIT (not BaseEntityIT — KnowledgePage is parent-scoped and does not fit the generic CRUD harness): creates parent + child pages and asserts the child's parent populates on the list path and nests under the parent in the hierarchy. 2 tests, 0 failures. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…post-commit (Stage 1) The create/update/patch flush committed 5-7 times via per-DAO autocommit because the @transaction annotations on EntityRepository are decorative (the repo is new-instantiated, not a JDBI SqlObject proxy), so a mid-create failure also left orphan relationship/tag rows. Wrap each flush body in Entity.getJdbi().inTransaction(...) + DeadlockRetry (mirroring the proven cleanup() idiom): 5-7 commits -> 1, and the unit is now all-or-nothing. To keep the held DB connection free of any network round trip, every external side-effect that ran inside the flush is deferred to run post-commit via thread-local collectors: - cache L2 invalidation (addRelationship/deleteRelationship/rename-cascade) recorded, drained post-commit; Guava L1 stays inline. Cache write-through stays SYNC post-commit so GET-by-id/by-name read-your-write is preserved. - tag/entity RDF SPARQL deferred (RdfTagUpdater) and drained post-commit. - domain/data-product lineage-ES and rename-cascade ES reindexes (Glossary/GlossaryTerm/Tag/ Classification/Container/Domain/DataProduct) deferred via SearchRepository, durable retry via SearchIndexRetryQueue on a failed drain. The Redis gate is removed, so the wrap is always engaged regardless of cache provider. Deadlock-replay safety: EntityUpdater base state is snapshotted/restored per attempt; the 6 cascade-bearing subclass updaters reset run-once guards via resetForRetryAttempt; the deferral collectors clear+reopen per attempt so a rolled-back attempt enqueues nothing. In Stage 1 the deferred externals still drain SYNC post-commit (off the held connection); moving them off the request thread is Stage 2. Validated by integration tests on postgres-elasticsearch, cache=none AND cache=redis: OneTransactionFlushAtomicityIT + 1693 cache=none and 683 cache=redis cascade/rename tests. Known follow-up: descendant by-id cache under a deadlock-replayed rename + Redis (narrow, self-healing; non-deadlock real path verified under Redis) — that replay-test is cache=none. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Keep Redis opt-in (CACHE_PROVIDER=redis) rather than the packaged default, so the default deployment and the standard test suite run without an external Redis dependency and continue to work. The Stage-1 one-transaction flush is independent of the cache provider (the Redis gate was removed), so this does not affect it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…read (Stage 2)
Stage 1 committed the write flush in one transaction and deferred the external
side-effects (ES index, RDF, lineage-ES, rename-cascade reindex) to drain
synchronously post-commit. Stage 2 moves those drains OFF the request thread so the
write API returns right after the DB commit + sync cache write-through, never
blocking on ES/RDF/SPARQL.
- OrderedLaneExecutor: entity-id-striped single-consumer lanes (virtual threads),
bounded queue. All async work for one entity id serializes on its own lane in
submission order, so create->update->update (and a postCreate that re-updates the
same doc) index in order. Bulk create slices per-entity onto each member's own lane.
- SearchIndexHandler.isAsync() -> true; the post-commit RDF/lineage/rename-cascade
drains submit onto the same lanes keyed by entity id.
- Request thread never blocks: a full lane sheds the task to the durable outbox
(SearchIndexRetryQueue, a fast local-DB upsert) rather than blocking or running the
slow external inline. Every async failure (index/update/delete/soft-delete, lane
Throwable, overflow, shutdown-drop) is recoverable via the outbox.
- Cache write-through + L2 invalidation stay SYNC post-commit, so GET /entity/{id}
and /name/{fqn} remain immediately read-your-write. Only /search and SPARQL become
eventually consistent.
- Graceful shutdown drains in-flight and flushes queued tasks to the outbox.
Honest residual: under pathological lane overflow/hard-stop, the outbox reindex
rebuilds an entity's own doc + lineage from committed DB, but does not replay RDF
triples or rename-cascade DESCENDANT reindexes -- those overflow recoveries are
best-effort (the normal path self-enqueues per-closure retries).
Validated on postgres-elasticsearch (cache=none): AsyncSearchIndexConsistencyIT
(async visibility + same-entity ordering convergence) 2/2; OneTransactionFlushAtomicityIT
284, GlossaryTermResourceIT 170, TableResourceIT -- all green (atomicity + rename
cascades + read-your-write intact under async). Unit: OrderedLaneExecutorTest,
EntityLifecycleEventDispatcherTest, SearchIndexHandlerTest.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ttribution WritePathBenchmarkIT (inert unless -Dbenchmark.run=true) measures single-thread create/update p50/p99/mean against a fsync-on Postgres (-DdbDurable=true). SDK-only so it runs unchanged against an older service jar for before/after comparison. Each update is attributed (GC pause vs DB checkpoint/commit vs OM-server processing) via a raw-commit probe + GarbageCollectorMXBean + pg checkpoint count. Adds a dbDurable flag to TestSuiteBootstrap (default off). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…optimization # Conflicts: # openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java
…lidation count Addresses two gitar-bot review findings on PR #28675: 1. (Bug) With SearchIndexHandler.isAsync()=true, the ordered lane captured the live EntityInterface by reference. After dispatch the request thread keeps mutating that same instance (REST PII masking, clearFields stripping for the HTTP response, secret masking on service connections), so the lane could index a masked/stripped/partial document. executeHandler(EntityInterface,...) now takes a Consumer<EntityInterface> and, on the async path, deep-copies the committed state via JsonUtils before submitting it to the lane; the sync path binds the live entity. Per-entity lane ordering does not protect against this because the race is on the in-memory POJO, not on lane submission order. 2. (Quality) invalidateCacheForTaggedEntities(String) always returned 0 because deferOrRunSearchBackedInvalidation discarded the inline count. The defer hook now takes an IntSupplier so the inline (non-flush) path returns the live invalidated count; the deferred path still returns 0 (count is only known post-commit and is logged in searchTaggedEntitiesAndInvalidate). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
Thanks for the review — both findings are addressed in 18d2d92. Finding 1 (Bug — async indexer captures a mutable entity) Fix: EntityInterface snapshot =
JsonUtils.readValue(JsonUtils.pojoToJson(entity), entity.getClass());
orderedLaneExecutor.submit(
entity.getId(), laneTask(entity, operation, () -> handlerCall.accept(snapshot)));The sync branch still binds the live entity (no copy cost when nothing is async). The Finding 2 (Quality — } else {
result = invalidation.getAsInt();
}The deferred path still returns
|
🟡 Playwright Results — all passed (16 flaky)✅ 4264 passed · ❌ 0 failed · 🟡 16 flaky · ⏭️ 88 skipped
🟡 16 flaky test(s) (passed on retry)
How to debug locally# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip # view trace |
…umn/async search ITs
ColumnInDataAssetTests and AsyncSearchIndexConsistencyIT searched a shared alias
(dataAsset / table_search_index) with query("*") + size(50), then filtered the
50-doc page client-side for the test's own entity. Those aliases span every
tableColumn / table doc in the cluster, so once the full IT suite has indexed
thousands of them the arbitrary page almost never contains this test's entity:
ColumnInDataAssetTests failed deterministically and AsyncSearchIndexConsistencyIT
flaked. In isolation the entity is always in the page, so both passed alone.
The production code is correct. A standalone reproduction confirmed columns keep
the borrowed dataAsset parent alias through a recreate-all reindex, and a load
reproduction (60 noise tables / ~120 columns) showed the match-all page misses the
target while a name/FQN-targeted query finds it. Fix the queries to search FOR the
specific entity (column name / table FQN) so the target ranks first and a single
page reliably contains it. No production change.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Brings in #28676 (policy-aware updateDisplayName + updatingBotDeniedOperation) and refreshes the branch base before addressing PR review findings.
…le no-op Resolves the two blocking findings from the PR #28675 review. 1. updateOwners no longer reverts owners on every bot PUT. The previous guard reverted owners whenever a bot issued a PUT and an owner already existed, silently breaking ingestion ownership re-sync across all entity types. It now keys on the policy-aware updatingBotDeniedOperation(EDIT_OWNERS) check that #28676 introduced for updateDisplayName: a bot the policy DENIES EditOwners still cannot clobber user-curated owners, but the ingestion/default bots (which only carry a DisplayName-Deny, not an Owner-Deny) sync owners again. overrideMetadata=true still force-syncs. 2. bulkDeleteStaleEntities treats an empty seenFqns as a zero-deletion no-op instead of "every entity under the scope is stale". An empty seen-set cannot be distinguished from a connector run that crashed or discovered nothing, so interpreting it as fully-stale would silently delete (and with hardDelete permanently destroy) every entity under a service/database. Mirrors the existing scope-not-found no-op rather than rejecting with 400, keeping the well-formed-but-ambiguous request safe. Tests: - BulkDeleteStaleIT.test_emptySeenFqns_withPopulatedScope_deletesNothing - a populated scope with empty seenFqns deletes nothing (wipes 3 tables without the guard). - BaseEntityIT.test_singleEntityPut_bot_updatesOwnersWhenPolicyAllows - a policy-allowed bot reassigns owners via PUT (reverts to the old owner without the fix). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… 403 test_unauthorizedUser_isForbidden minted a DataConsumer JWT but never created the user, relying on another test in the concurrent suite to seed it. In isolation the server rejects the token with 404 (user not found) instead of 403 (forbidden), and even in the suite the assertion is order-dependent. Ensure the user exists up front via UserTestFactory.getDataConsumer (the established pattern used by PermissionsResourceIT / AppOperationPermissionsIT) so the principal resolves and the DELETE authorization is what actually drives the 403. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ide ES contention A multi-agent review of the remaining intermittent CI failures confirmed the production code is correct: the main-entity index write runs inline in postCreate (outside the deferral scope) and self-enqueues to the retry outbox on failure, the post-commit deferral drain is exception-safe and thread-local-safe, and the column dataAsset alias is a static index alias. The AsyncSearchIndexConsistencyIT and ColumnInDataAssetTests timeouts only flake because, under the full ~14.5k-test suite hitting one shared Elasticsearch node, a live write can be 429'd and land via the retry worker slightly after the 60s window. Widen the poll window to 120s for just these search-visibility assertions; no production change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e-text query (fixes OpenSearch 500) The postgres-opensearch CI job failed deterministically on AsyncSearchIndexConsistencyIT and ColumnInDataAssetTests. A prior change searched for the specific entity by free-text querying its (long, many-token) fullyQualifiedName / column name. On the multi-index dataAsset and table_search_index aliases that expands into >1024 boolean clauses and trips OpenSearch's default index.query.bool.max_clause_count (Elasticsearch's default is far higher), so the search returned HTTP 500 (query_shard_exception: maxClauseCount is set to 1024). Awaitility's ignoreExceptions() swallowed the 500 and the poll timed out — passing on Elasticsearch, failing on OpenSearch. Use match-all plus a term filter on a keyword field (fullyQualifiedName for the table, name.keyword for the column) instead: one clause, exact match, no clause explosion, and the target is found regardless of how many docs the suite has indexed (keeping the earlier top-50-window fix). Revert the 60s->120s poll-window bump, which was based on the wrong (env-overload) diagnosis; with the term filter the query succeeds immediately. Verified on OpenSearch (opensearchproject/opensearch:3.4.0): both classes pass in ~2-3s with zero maxClauseCount errors. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
✅ PR checks passedThe linked issue has a description and all required Shipping project fields set. Thanks! |
|
manerow
left a comment
There was a problem hiding this comment.
Looks good to me. The one-transaction wrap and deferring the external side effects out of the held connection reads cleanly. Approving.
Code Review ✅ Approved 7 resolved / 7 findingsOptimizes API latency by implementing single-transaction flushes and an ordered asynchronous indexing path for write operations. Resolved issues regarding stale entity snapshots, cache invalidation, and regression in bot owner updates. ✅ 7 resolved✅ Quality: invalidateCacheForTaggedEntities now always returns 0
✅ Bug: Async indexer may read entity POJO mutated after dispatch
✅ Bug: Unguarded snapshot serialization can fail API after commit
✅ Performance: Entity re-serialized once per async handler in dispatch loop
✅ Edge Case: Snapshot failure drops non-search async handlers silently
...and 2 more resolved from earlier reviews OptionsDisplay: compact → Showing less information. Comment with these commands to change:
Was this helpful? React with 👍 / 👎 | Gitar |



Describe your changes:
Fixes #28726
Optimizes the OpenMetadata read and write paths for lower API latency. Reads: several repositories overrode
setFieldsbut notsetFieldsInBulk, silently dropping list fields and forcing per-item queries — added bulk field-fetch paths (Directory, Spreadsheet, all*Service, KnowledgePage, ContextMemory), dropped a redundant CSV-export tag rescan, batched single-entity tag inserts, and cached Fernet keys. Writes — Stage 1: the create/update/patch flush committed 5–7 times via per-DAO autocommit (the@Transactionannotations were decorative no-ops) and left orphan rows on mid-flush failure; it is now wrapped in one realEntity.getJdbi().inTransaction(...)+DeadlockRetry, with every external side-effect (cache-L2 invalidation, RDF/SPARQL, lineage-ES, rename-cascade reindex) deferred so the held DB connection makes no network round trip. Writes — Stage 2: those externals + the ES entity index now run asynchronously off the request thread on a new per-entity-orderedOrderedLaneExecutor(durable via the search retry outbox), while cache write-through stays synchronous post-commit soGET /{id}and/name/{fqn}remain read-your-write — only/searchand SPARQL become eventually consistent.Type of change:
High-level design:
EntityRepositorycreate/update/patch flushes ininTransaction+ deadlock-retry (mirroring the existingcleanup()idiom), collapsing 5–7 commits → 1 and making writes atomic. In-flush external calls are captured into thread-local collectors and drained post-commit;EntityUpdaterstate is snapshotted/restored per retry attempt (incl. the 6 cascade-bearing subclass updaters).SearchIndexHandler.isAsync()=true) ontoOrderedLaneExecutor, keyed by entity id so single-consumer lanes preserve per-entity ordering. A full lane sheds toSearchIndexRetryQueue(a fast local-DB upsert) rather than blocking the request thread; bulk create slices per-entity onto each member's own lane.CACHE_PROVIDER=nonedefault).Tests:
Use cases covered
*Servicepipelines, Knowledge Center hierarchy, Context memories).Backend integration tests
openmetadata-integration-tests/.OneTransactionFlushAtomicityIT,AsyncSearchIndexConsistencyIT,DirectoryResourceIT,SpreadsheetResourceIT,DatabaseServiceResourceIT,ContextMemoryIT,DatabaseResourceIT,TableResourceIT,KnowledgePageHierarchyIT,ColumnSearchIndexIT/TestCaseResourceIT(Awaitility polling for async),WritePathBenchmarkIT(inert perf harness).Unit tests
OrderedLaneExecutorTest,EntityLifecycleEventDispatcherTest,SearchIndexHandlerTest,FernetKeyCacheTest.Ingestion integration tests
Playwright (UI) tests
Manual testing performed
mvn verify -Ppostgres-elasticsearch, ~13.5k tests) oncache=noneandcache=redis; all green except pre-existing async-workflow flakiness.fsync=on) Postgres showed ~30–40% median write-latency improvement; the residual p99 tail is DB-commit/checkpoint + GC, not the OM write path.UI screen recording / screenshots:
Not applicable.
Checklist:
Fixes #<issue-number>above.