perf: one-transaction flush + async indexing write path, read-path fixes by harshach · Pull Request #28675 · open-metadata/OpenMetadata

harshach · 2026-06-03T19:05:48Z

Describe your changes:

Optimizes the OpenMetadata read and write paths for lower API latency. Reads: several repositories overrode setFields but not setFieldsInBulk, silently dropping list fields and forcing per-item queries — added bulk field-fetch paths (Directory, Spreadsheet, all *Service, KnowledgePage, ContextMemory), dropped a redundant CSV-export tag rescan, batched single-entity tag inserts, and cached Fernet keys. Writes — Stage 1: the create/update/patch flush committed 5–7 times via per-DAO autocommit (the @Transaction annotations were decorative no-ops) and left orphan rows on mid-flush failure; it is now wrapped in one real Entity.getJdbi().inTransaction(...) + DeadlockRetry, with every external side-effect (cache-L2 invalidation, RDF/SPARQL, lineage-ES, rename-cascade reindex) deferred so the held DB connection makes no network round trip. Writes — Stage 2: those externals + the ES entity index now run asynchronously off the request thread on a new per-entity-ordered OrderedLaneExecutor (durable via the search retry outbox), while cache write-through stays synchronous post-commit so GET /{id} and /name/{fqn} remain read-your-write — only /search and SPARQL become eventually consistent.

Type of change:

Improvement

High-level design:

Stage 1 wraps EntityRepository create/update/patch flushes in inTransaction + deadlock-retry (mirroring the existing cleanup() idiom), collapsing 5–7 commits → 1 and making writes atomic. In-flush external calls are captured into thread-local collectors and drained post-commit; EntityUpdater state is snapshotted/restored per retry attempt (incl. the 6 cascade-bearing subclass updaters).
Stage 2 routes those drains plus the ES entity index (SearchIndexHandler.isAsync()=true) onto OrderedLaneExecutor, keyed by entity id so single-consumer lanes preserve per-entity ordering. A full lane sheds to SearchIndexRetryQueue (a fast local-DB upsert) rather than blocking the request thread; bulk create slices per-entity onto each member's own lane.
Consistency contract: GET-by-id/by-name stay real-time (DB + sync cache); search/RDF/lineage are eventually consistent with durable retry. Redis stays opt-in (CACHE_PROVIDER=none default).
Honest residual: under pathological lane overflow/hard-stop, the outbox rebuilds an entity's doc + lineage from committed DB, but RDF triples and rename-cascade descendant reindexes are best-effort.

Tests:

Use cases covered

Create/update/patch is atomic (mid-flush failure → zero orphan rows) and deadlock-replay-safe, incl. glossary-term rename and data-product domain-change cascades.
Write returns before the doc is searchable; doc converges in search; same-entity create→update converges to the final value (lane ordering).
Previously-dropped list fields now populate (Drive directories/spreadsheets, *Service pipelines, Knowledge Center hierarchy, Context memories).

Backend integration tests

Added/updated in openmetadata-integration-tests/.
OneTransactionFlushAtomicityIT, AsyncSearchIndexConsistencyIT, DirectoryResourceIT, SpreadsheetResourceIT, DatabaseServiceResourceIT, ContextMemoryIT, DatabaseResourceIT, TableResourceIT, KnowledgePageHierarchyIT, ColumnSearchIndexIT/TestCaseResourceIT (Awaitility polling for async), WritePathBenchmarkIT (inert perf harness).

Unit tests

OrderedLaneExecutorTest, EntityLifecycleEventDispatcherTest, SearchIndexHandlerTest, FernetKeyCacheTest.

Ingestion integration tests

Not applicable (no ingestion changes).

Playwright (UI) tests

Not applicable (no UI changes).

Manual testing performed

Ran targeted ITs + the full IT suite (mvn verify -Ppostgres-elasticsearch, ~13.5k tests) on cache=none and cache=redis; all green except pre-existing async-workflow flakiness.
Write-path benchmark on a durable (fsync=on) Postgres showed ~30–40% median write-latency improvement; the residual p99 tail is DB-commit/checkpoint + GC, not the OM write path.

UI screen recording / screenshots:

Not applicable.

Checklist:

I have read the CONTRIBUTING document.
My PR is linked to a GitHub issue via Fixes #<issue-number> above.
I have added tests (unit / integration) and listed them above.
For JSON Schema changes: N/A (no schema changes).

Flip the packaged deployment default from none to redis. CacheBundle already falls back to NoopCacheProvider when Redis init fails, so deployments without Redis keep booting (degraded). CacheConfig.java's programmatic default stays none, so the integration-test default is unchanged. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Several repositories override setFields but not setFieldsInBulk, so entity-specific fields were silently null on the list path (0 queries — a correctness bug, not an N+1). Add the bulk path for each and batch the residual per-item fetches: - DirectoryRepository: service/parent/numberOfFiles/numberOfSubDirectories/totalSize on list; batch child sizes via findEntitiesByIds (NON_DELETED). - SpreadsheetRepository: service/directory/worksheets on list; batch getWorksheets. - ServiceEntityRepository: pipelines on list for all 13 *Service types. - ContextMemoryRepository: primaryEntity/relatedEntities on list. - DatabaseRepository: drop redundant per-table setFieldsInternal re-scan in recursive CSV export (columns/tags already bulk-populated). - EntityRepository: route single-entity applyTags through the batched variant. Validated by integration tests (postgres-elasticsearch): Directory 251, Spreadsheet 165, DatabaseService 39, ContextMemory 132, Database 197, Table tag-batch — all green. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Fernet rebuilt the key list (split on ",") and constructed a new SecureRandom on every encrypt/decrypt call — repeated per password field per service write/read. Hoist the parsed List<Key> and a single SecureRandom to init-once fields, rebuilt only on setFernetKey (rotation). Ciphertext format and rotation behavior unchanged. Validated: FernetKeyCacheTest — 3 tests, 0 failures (round-trip, multi-field key reuse, rotation rebuild). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…list (T1) KnowledgePageRepository overrode setFields but not setFieldsInBulk, so the hierarchy listing (GET /v1/contextCenter/pages, /hierarchy) returned every page with parent=null — the Knowledge Center tree rendered flat. Add setFieldsInBulk batching parent (CONTAINS container), relatedEntities, and editors. Tested via KnowledgePageHierarchyIT (not BaseEntityIT — KnowledgePage is parent-scoped and does not fit the generic CRUD harness): creates parent + child pages and asserts the child's parent populates on the list path and nests under the parent in the hierarchy. 2 tests, 0 failures. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@transaction

…post-commit (Stage 1) The create/update/patch flush committed 5-7 times via per-DAO autocommit because the @transaction annotations on EntityRepository are decorative (the repo is new-instantiated, not a JDBI SqlObject proxy), so a mid-create failure also left orphan relationship/tag rows. Wrap each flush body in Entity.getJdbi().inTransaction(...) + DeadlockRetry (mirroring the proven cleanup() idiom): 5-7 commits -> 1, and the unit is now all-or-nothing. To keep the held DB connection free of any network round trip, every external side-effect that ran inside the flush is deferred to run post-commit via thread-local collectors: - cache L2 invalidation (addRelationship/deleteRelationship/rename-cascade) recorded, drained post-commit; Guava L1 stays inline. Cache write-through stays SYNC post-commit so GET-by-id/by-name read-your-write is preserved. - tag/entity RDF SPARQL deferred (RdfTagUpdater) and drained post-commit. - domain/data-product lineage-ES and rename-cascade ES reindexes (Glossary/GlossaryTerm/Tag/ Classification/Container/Domain/DataProduct) deferred via SearchRepository, durable retry via SearchIndexRetryQueue on a failed drain. The Redis gate is removed, so the wrap is always engaged regardless of cache provider. Deadlock-replay safety: EntityUpdater base state is snapshotted/restored per attempt; the 6 cascade-bearing subclass updaters reset run-once guards via resetForRetryAttempt; the deferral collectors clear+reopen per attempt so a rolled-back attempt enqueues nothing. In Stage 1 the deferred externals still drain SYNC post-commit (off the held connection); moving them off the request thread is Stage 2. Validated by integration tests on postgres-elasticsearch, cache=none AND cache=redis: OneTransactionFlushAtomicityIT + 1693 cache=none and 683 cache=redis cascade/rename tests. Known follow-up: descendant by-id cache under a deadlock-replayed rename + Redis (narrow, self-healing; non-deadlock real path verified under Redis) — that replay-test is cache=none. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Keep Redis opt-in (CACHE_PROVIDER=redis) rather than the packaged default, so the default deployment and the standard test suite run without an external Redis dependency and continue to work. The Stage-1 one-transaction flush is independent of the cache provider (the Redis gate was removed), so this does not affect it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…read (Stage 2) Stage 1 committed the write flush in one transaction and deferred the external side-effects (ES index, RDF, lineage-ES, rename-cascade reindex) to drain synchronously post-commit. Stage 2 moves those drains OFF the request thread so the write API returns right after the DB commit + sync cache write-through, never blocking on ES/RDF/SPARQL. - OrderedLaneExecutor: entity-id-striped single-consumer lanes (virtual threads), bounded queue. All async work for one entity id serializes on its own lane in submission order, so create->update->update (and a postCreate that re-updates the same doc) index in order. Bulk create slices per-entity onto each member's own lane. - SearchIndexHandler.isAsync() -> true; the post-commit RDF/lineage/rename-cascade drains submit onto the same lanes keyed by entity id. - Request thread never blocks: a full lane sheds the task to the durable outbox (SearchIndexRetryQueue, a fast local-DB upsert) rather than blocking or running the slow external inline. Every async failure (index/update/delete/soft-delete, lane Throwable, overflow, shutdown-drop) is recoverable via the outbox. - Cache write-through + L2 invalidation stay SYNC post-commit, so GET /entity/{id} and /name/{fqn} remain immediately read-your-write. Only /search and SPARQL become eventually consistent. - Graceful shutdown drains in-flight and flushes queued tasks to the outbox. Honest residual: under pathological lane overflow/hard-stop, the outbox reindex rebuilds an entity's own doc + lineage from committed DB, but does not replay RDF triples or rename-cascade DESCENDANT reindexes -- those overflow recoveries are best-effort (the normal path self-enqueues per-closure retries). Validated on postgres-elasticsearch (cache=none): AsyncSearchIndexConsistencyIT (async visibility + same-entity ordering convergence) 2/2; OneTransactionFlushAtomicityIT 284, GlossaryTermResourceIT 170, TableResourceIT -- all green (atomicity + rename cascades + read-your-write intact under async). Unit: OrderedLaneExecutorTest, EntityLifecycleEventDispatcherTest, SearchIndexHandlerTest. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ttribution WritePathBenchmarkIT (inert unless -Dbenchmark.run=true) measures single-thread create/update p50/p99/mean against a fsync-on Postgres (-DdbDurable=true). SDK-only so it runs unchanged against an older service jar for before/after comparison. Each update is attributed (GC pause vs DB checkpoint/commit vs OM-server processing) via a raw-commit probe + GarbageCollectorMXBean + pg checkpoint count. Adds a dbDurable flag to TestSuiteBootstrap (default off). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…optimization # Conflicts: # openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java

…lidation count Addresses two gitar-bot review findings on PR #28675: 1. (Bug) With SearchIndexHandler.isAsync()=true, the ordered lane captured the live EntityInterface by reference. After dispatch the request thread keeps mutating that same instance (REST PII masking, clearFields stripping for the HTTP response, secret masking on service connections), so the lane could index a masked/stripped/partial document. executeHandler(EntityInterface,...) now takes a Consumer<EntityInterface> and, on the async path, deep-copies the committed state via JsonUtils before submitting it to the lane; the sync path binds the live entity. Per-entity lane ordering does not protect against this because the race is on the in-memory POJO, not on lane submission order. 2. (Quality) invalidateCacheForTaggedEntities(String) always returned 0 because deferOrRunSearchBackedInvalidation discarded the inline count. The defer hook now takes an IntSupplier so the inline (non-flush) path returns the live invalidated count; the deferred path still returns 0 (count is only known post-commit and is logged in searchTaggedEntitiesAndInvalidate). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

harshach · 2026-06-03T19:31:49Z

Thanks for the review — both findings are addressed in 18d2d92.

Finding 1 (Bug — async indexer captures a mutable entity)
SearchIndexHandler.isAsync() is now true, so the ordered lane reads the entity off the request thread after dispatch — but the request thread keeps mutating that same instance for the HTTP response (PII masking, clearFields stripping, secret masking on service connections). The lane could therefore index a masked/stripped/partial doc. Per-entity lane ordering doesn't help here because the race is on the in-memory POJO, not on lane submission order.

Fix: executeHandler(EntityInterface, …) now takes a Consumer<EntityInterface>, and the async branch deep-copies the committed state before submitting to the lane:

EntityInterface snapshot =
    JsonUtils.readValue(JsonUtils.pojoToJson(entity), entity.getClass());
orderedLaneExecutor.submit(
    entity.getId(), laneTask(entity, operation, () -> handlerCall.accept(snapshot)));

The sync branch still binds the live entity (no copy cost when nothing is async). The EntityReference overload is unaffected (references aren't response-mutated). The six onEntity* call sites were updated from () -> handler.onX(entity, …) to e -> handler.onX(e, …).

Finding 2 (Quality — invalidateCacheForTaggedEntities(String) always returned 0)
deferOrRunSearchBackedInvalidation discarded the inline count. It now takes an IntSupplier, so the inline (non-flush) path returns the live invalidated count:

} else {
  result = invalidation.getAsInt();
}

The deferred path still returns 0 — the work runs post-commit and the count is only known then (it's logged inside searchTaggedEntitiesAndInvalidate); this is documented in the method's Javadoc.

mvn spotless:check and a service-module compile both pass.

github-actions · 2026-06-04T08:18:06Z

🟡 Playwright Results — all passed (16 flaky)

✅ 4264 passed · ❌ 0 failed · 🟡 16 flaky · ⏭️ 88 skipped

Shard	Passed	Flaky	Skipped
🟡 Shard 1	300	1	4
🟡 Shard 2	801	3	9
🟡 Shard 3	802	3	8
🟡 Shard 4	852	3	12
🟡 Shard 5	720	1	47
🟡 Shard 6	789	5	8

🟡 16 flaky test(s) (passed on retry)

Flow/SearchRBAC.spec.ts › User without permission (shard 1, 1 retry)
Features/DataProductRename.spec.ts › should handle multiple consecutive renames and preserve assets (shard 2, 1 retry)
Features/DataQuality/BundleSuiteBulkOperations.spec.ts › Add test case to existing Bundle Suite (shard 2, 1 retry)
Features/Glossary/GlossaryWorkflow.spec.ts › should display correct status badge color and icon (shard 2, 1 retry)
Features/KnowledgeCenterList.spec.ts › Knowledge Center List - Test unbookmark functionality (shard 3, 1 retry)
Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
Features/Table.spec.ts › Table pagination with sorting should works (shard 3, 1 retry)
Pages/CustomProperties.spec.ts › Time Interval (shard 4, 1 retry)
Pages/CustomProperties.spec.ts › Time Interval (shard 4, 1 retry)
Pages/DomainUIInteractions.spec.ts › Delete domain with subdomains shows warning (shard 4, 1 retry)
Pages/ExplorePageRightPanel_KnowledgeCenter.spec.ts › Should remove user owner for knowledgeCenter (shard 5, 1 retry)
Pages/Lineage/DataAssetLineage.spec.ts › Column lineage for apiEndpoint -> mlModel (shard 6, 1 retry)
Pages/Lineage/LineageFilters.spec.ts › Verify Impact Analysis service filter selection (shard 6, 1 retry)
Pages/Lineage/LineageFilters.spec.ts › Verify lineage service type filter selection (shard 6, 2 retries)
Pages/Lineage/LineageRightPanel.spec.ts › Verify custom properties tab IS visible for supported type: searchIndex (shard 6, 1 retry)
Pages/Lineage/PlatformLineage.spec.ts › Verify domain platform view (shard 6, 1 retry)

📦 Download artifacts

How to debug locally

# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

…umn/async search ITs ColumnInDataAssetTests and AsyncSearchIndexConsistencyIT searched a shared alias (dataAsset / table_search_index) with query("*") + size(50), then filtered the 50-doc page client-side for the test's own entity. Those aliases span every tableColumn / table doc in the cluster, so once the full IT suite has indexed thousands of them the arbitrary page almost never contains this test's entity: ColumnInDataAssetTests failed deterministically and AsyncSearchIndexConsistencyIT flaked. In isolation the entity is always in the page, so both passed alone. The production code is correct. A standalone reproduction confirmed columns keep the borrowed dataAsset parent alias through a recreate-all reindex, and a load reproduction (60 noise tables / ~120 columns) showed the match-all page misses the target while a name/FQN-targeted query finds it. Fix the queries to search FOR the specific entity (column name / table FQN) so the target ranks first and a single page reliably contains it. No production change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Brings in #28676 (policy-aware updateDisplayName + updatingBotDeniedOperation) and refreshes the branch base before addressing PR review findings.

…le no-op Resolves the two blocking findings from the PR #28675 review. 1. updateOwners no longer reverts owners on every bot PUT. The previous guard reverted owners whenever a bot issued a PUT and an owner already existed, silently breaking ingestion ownership re-sync across all entity types. It now keys on the policy-aware updatingBotDeniedOperation(EDIT_OWNERS) check that #28676 introduced for updateDisplayName: a bot the policy DENIES EditOwners still cannot clobber user-curated owners, but the ingestion/default bots (which only carry a DisplayName-Deny, not an Owner-Deny) sync owners again. overrideMetadata=true still force-syncs. 2. bulkDeleteStaleEntities treats an empty seenFqns as a zero-deletion no-op instead of "every entity under the scope is stale". An empty seen-set cannot be distinguished from a connector run that crashed or discovered nothing, so interpreting it as fully-stale would silently delete (and with hardDelete permanently destroy) every entity under a service/database. Mirrors the existing scope-not-found no-op rather than rejecting with 400, keeping the well-formed-but-ambiguous request safe. Tests: - BulkDeleteStaleIT.test_emptySeenFqns_withPopulatedScope_deletesNothing - a populated scope with empty seenFqns deletes nothing (wipes 3 tables without the guard). - BaseEntityIT.test_singleEntityPut_bot_updatesOwnersWhenPolicyAllows - a policy-allowed bot reassigns owners via PUT (reverts to the old owner without the fix). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… 403 test_unauthorizedUser_isForbidden minted a DataConsumer JWT but never created the user, relying on another test in the concurrent suite to seed it. In isolation the server rejects the token with 404 (user not found) instead of 403 (forbidden), and even in the suite the assertion is order-dependent. Ensure the user exists up front via UserTestFactory.getDataConsumer (the established pattern used by PermissionsResourceIT / AppOperationPermissionsIT) so the principal resolves and the DELETE authorization is what actually drives the 403. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ide ES contention A multi-agent review of the remaining intermittent CI failures confirmed the production code is correct: the main-entity index write runs inline in postCreate (outside the deferral scope) and self-enqueues to the retry outbox on failure, the post-commit deferral drain is exception-safe and thread-local-safe, and the column dataAsset alias is a static index alias. The AsyncSearchIndexConsistencyIT and ColumnInDataAssetTests timeouts only flake because, under the full ~14.5k-test suite hitting one shared Elasticsearch node, a live write can be 429'd and land via the retry worker slightly after the 60s window. Widen the poll window to 120s for just these search-visibility assertions; no production change. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…e-text query (fixes OpenSearch 500) The postgres-opensearch CI job failed deterministically on AsyncSearchIndexConsistencyIT and ColumnInDataAssetTests. A prior change searched for the specific entity by free-text querying its (long, many-token) fullyQualifiedName / column name. On the multi-index dataAsset and table_search_index aliases that expands into >1024 boolean clauses and trips OpenSearch's default index.query.bool.max_clause_count (Elasticsearch's default is far higher), so the search returned HTTP 500 (query_shard_exception: maxClauseCount is set to 1024). Awaitility's ignoreExceptions() swallowed the 500 and the poll timed out — passing on Elasticsearch, failing on OpenSearch. Use match-all plus a term filter on a keyword field (fullyQualifiedName for the table, name.keyword for the column) instead: one clause, exact match, no clause explosion, and the target is found regardless of how many docs the suite has indexed (keeping the earlier top-50-window fix). Revert the 60s->120s poll-window bump, which was based on the wrong (env-overload) diagnosis; with the term filter the query succeeds immediately. Verified on OpenSearch (opensearchproject/opensearch:3.4.0): both classes pass in ~2-3s with zero maxClauseCount errors. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

github-actions · 2026-06-05T00:53:30Z

✅ PR checks passed

The linked issue has a description and all required Shipping project fields set. Thanks!

sonarqubecloud · 2026-06-05T06:29:57Z

Quality Gate passed for 'open-metadata-ingestion'

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

manerow

Looks good to me. The one-transaction wrap and deferring the external side effects out of the held connection reads cleanly. Approving.

gitar-bot · 2026-06-05T13:53:02Z

Code Review ✅ Approved 7 resolved / 7 findings

Optimizes API latency by implementing single-transaction flushes and an ordered asynchronous indexing path for write operations. Resolved issues regarding stale entity snapshots, cache invalidation, and regression in bot owner updates.

✅ 7 resolved

✅ Quality: invalidateCacheForTaggedEntities now always returns 0

📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java:3217-3231 📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java:3282-3294
invalidateCacheForTaggedEntities(String) previously returned the live count of invalidated entities. After the refactor it delegates to deferOrRunSearchBackedInvalidation(Runnable, String), which executes the invalidation as a Runnable and discards its result, hard-coding int result = 0 on both the inline and deferred paths (EntityRepository.java:3217-3244). The actual count produced by searchTaggedEntitiesAndInvalidate is dropped.

Impact is limited because no caller uses the count for control flow — the bulk variant invalidateCacheForTaggedEntities(Collection) only aggregates it for the log line "Invalidated cache for {} entities across {} renamed tag FQNs" (EntityRepository.java:3288-3290), which will now always report 0 (and the if (total > 0) guard means it never logs at all). The inner per-tag log inside searchTaggedEntitiesAndInvalidate still reports the true count, so the regression is purely misleading/absent aggregate logging. Suggested fix: have the inline path return the real count by using a small int-returning helper instead of a Runnable, or document that the return value is no longer meaningful and drop the dead total/logging in the bulk variant.

✅ Bug: Async indexer may read entity POJO mutated after dispatch

📄 openmetadata-service/src/main/java/org/openmetadata/service/events/lifecycle/handlers/SearchIndexHandler.java:150-157 📄 openmetadata-service/src/main/java/org/openmetadata/service/events/lifecycle/EntityLifecycleEventDispatcher.java:328-342
SearchIndexHandler.isAsync() was flipped from false to true (SearchIndexHandler.java:150-157), so the search document is now built and written on a background OrderedLaneExecutor lane thread that captures the same EntityInterface instance by reference in the dispatched lambda (EntityLifecycleEventDispatcher.executeHandler -> laneTask). The previous synchronous behavior was deliberately chosen — the removed comment stated "Search indexing must be visible to follow-up operations in the same request flow" and indexing ran inline before the request thread continued.

With async indexing there is now a window where the request thread can continue mutating the very same entity instance after dispatch but before the lane thread reads it to build the doc — e.g. REST-layer PII masking, clearFields/field stripping for the HTTP response, or secret encryption/decryption on service-connection configs. If any such post-dispatch mutation touches the same instance, the async indexer can observe and persist a masked/cleared/partial document into the search index. Per-entity lane ordering does not protect against this because the race is on the in-memory POJO, not on lane submission order.

I was unable to fully confirm a concrete post-dispatch mutation site within this review, so treat this as a potential issue: please verify that the entity object handed to the dispatcher is either deep-copied before async submission or is never mutated by the resource/masking/encryption layers after onEntityCreated/onEntityUpdated is called. If it can be mutated, snapshot (deep-copy) the entity at dispatch time before submitting it to the lane.

✅ Bug: Unguarded snapshot serialization can fail API after commit

📄 openmetadata-service/src/main/java/org/openmetadata/service/events/lifecycle/EntityLifecycleEventDispatcher.java:331-343
executeHandler now performs JsonUtils.readValue(JsonUtils.pojoToJson(entity), entity.getClass()) synchronously on the request thread, before the task is submitted to the ordered lane. This call is not wrapped in any try/catch:

The async dispatch entry points (onEntityCreated/onEntityUpdated/onEntityDeleted) are invoked from EntityRepository.postCreate/postUpdate/post-delete hooks, which run AFTER the DB transaction has committed and do not catch RuntimeExceptions escaping the dispatcher.

If pojoToJson/readValue throws for any entity (e.g. a field that fails to round-trip, an unexpected subtype, a serialization edge case), the exception propagates to the REST request thread. The client receives a 5xx even though the entity was already committed to the DB — a confusing 'write failed but actually succeeded' outcome.

Worse, this also defeats the durability model this PR carefully builds: because the throw happens before orderedLaneExecutor.submit(...), the enqueueLaneFailureRetry/SearchIndexRetryQueue outbox path never fires, so the search-index update is silently lost rather than retried.

Wrap the snapshot creation in a try/catch and, on failure, route the locator to the durable retry outbox (so the worker rebuilds the doc from committed DB state) instead of letting it break the request and drop the index update.

✅ Performance: Entity re-serialized once per async handler in dispatch loop

📄 openmetadata-service/src/main/java/org/openmetadata/service/events/lifecycle/EntityLifecycleEventDispatcher.java:326-340 📄 openmetadata-service/src/main/java/org/openmetadata/service/events/lifecycle/EntityLifecycleEventDispatcher.java:144-148 📄 openmetadata-service/src/main/java/org/openmetadata/service/events/lifecycle/EntityLifecycleEventDispatcher.java:183-187
executeHandler is called once per applicable handler inside the dispatch loops (onEntityCreated, dispatchBulkCreate, onEntityUpdated, dispatchBulkUpdate, etc.). The snapshot JsonUtils.readValue(JsonUtils.pojoToJson(entity), ...) is therefore performed once per async handler for the same entity. When multiple async handlers apply to an entity type, the same POJO is serialized+deserialized N times on the request thread; for bulk create/update this multiplies to (entities × async handlers) full round-trips on the request path — exactly the path this PR is trying to make faster. Consider computing the snapshot once per entity (e.g. lazily on first async handler) and reusing it across handlers for the same dispatch.

✅ Edge Case: Snapshot failure drops non-search async handlers silently

📄 openmetadata-service/src/main/java/org/openmetadata/service/events/lifecycle/EntityLifecycleEventDispatcher.java:355-360 📄 openmetadata-service/src/main/java/org/openmetadata/service/events/lifecycle/EntityLifecycleEventDispatcher.java:386-400
When snapshotOrEnqueueRetry fails to serialize an entity, it returns null and executeHandler skips the lane submit for every async handler (the shared/memoized supplier returns null for all of them). However, the only durable recovery path enqueued is SearchIndexRetryQueue.enqueue(...), which the worker uses to rebuild the search document. There are three async handlers dispatched through this path — SearchIndexHandler, DomainSyncHandler (task/announcement domain sync), and VectorEmbeddingHandler — and the latter two are NOT recovered by the search-index retry outbox. So on a (rare/pathological) post-commit serialization failure, domain-sync and vector-embedding side effects are silently lost with no retry, while the Javadoc only describes the search-doc recovery.

Impact is limited because serialization failures are pathological and the write itself is already committed, so this is minor. Suggested fix: either document this explicitly as a known best-effort residual (matching the RDF/rename-cascade caveats in the PR description), or enqueue the locator into the per-handler recovery mechanisms (not just the search outbox) so domain-sync/vector embeddings also converge.

...and 2 more resolved from earlier reviews

Options

Display: compact → Showing less information.

Comment with these commands to change:

`Compact`
`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | Gitar}

harshach and others added 8 commits June 1, 2026 23:16

github-actions Bot added backend safe to test Add this label to run secure Github workflows on PRs labels Jun 3, 2026

gitar-bot Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java

gitar-bot Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread ...ice/src/main/java/org/openmetadata/service/events/lifecycle/handlers/SearchIndexHandler.java

Merge remote-tracking branch 'origin/main' into harshach/api-latency-…

b780ab1

…optimization # Conflicts: # openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java

harshach had a problem deploying to test June 3, 2026 19:26 — with GitHub Actions Error

gitar-bot Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread .../src/main/java/org/openmetadata/service/events/lifecycle/EntityLifecycleEventDispatcher.java

gitar-bot Bot reviewed Jun 3, 2026

View reviewed changes

Comment thread .../src/main/java/org/openmetadata/service/events/lifecycle/EntityLifecycleEventDispatcher.java Outdated

harshach temporarily deployed to test June 3, 2026 19:43 — with GitHub Actions Inactive

harshach had a problem deploying to test June 3, 2026 19:44 — with GitHub Actions Error

harshach temporarily deployed to test June 4, 2026 05:34 — with GitHub Actions Inactive

harshach temporarily deployed to test June 4, 2026 15:21 — with GitHub Actions Inactive

harshach had a problem deploying to test June 4, 2026 15:22 — with GitHub Actions Error

harshach temporarily deployed to test June 4, 2026 15:22 — with GitHub Actions Inactive

harshach had a problem deploying to test June 4, 2026 15:22 — with GitHub Actions Error

harshach temporarily deployed to test June 4, 2026 15:22 — with GitHub Actions Inactive

harshach had a problem deploying to test June 4, 2026 15:22 — with GitHub Actions Error

harshach and others added 2 commits June 4, 2026 08:41

Merge origin/main into harshach/api-latency-optimization

e002be1

Brings in #28676 (policy-aware updateDisplayName + updatingBotDeniedOperation) and refreshes the branch base before addressing PR review findings.

harshach had a problem deploying to test June 4, 2026 17:09 — with GitHub Actions Error

harshach had a problem deploying to test June 4, 2026 17:10 — with GitHub Actions Error

harshach and others added 2 commits June 4, 2026 10:19

This was referenced Jun 4, 2026

Lost-update race in PolicyConditionUpdater on concurrent tag/glossary renames #28718

Open

Distributed reindex can drop parent aliases (dataAsset/all/table) when promoting via a participant-reconstructed context #28719

Closed

harshach mentioned this pull request Jun 5, 2026

perf: reduce write/read API latency — one-transaction flush + read-path bulk-field fetch #28726

Closed

Merge branch 'main' into harshach/api-latency-optimization

15f4fde

manerow approved these changes Jun 5, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: one-transaction flush + async indexing write path, read-path fixes#28675

perf: one-transaction flush + async indexing write path, read-path fixes#28675
harshach merged 24 commits into
mainfrom
harshach/api-latency-optimization

harshach commented Jun 3, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

harshach commented Jun 3, 2026

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 4, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 5, 2026 •

edited

Loading

Uh oh!

sonarqubecloud Bot commented Jun 5, 2026

Uh oh!

manerow left a comment

Uh oh!

gitar-bot Bot commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

harshach commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Describe your changes:

Type of change:

High-level design:

Tests:

Use cases covered

Backend integration tests

Unit tests

Ingestion integration tests

Playwright (UI) tests

Manual testing performed

UI screen recording / screenshots:

Checklist:

Uh oh!

Uh oh!

Uh oh!

harshach commented Jun 3, 2026

Uh oh!

Uh oh!

Uh oh!

github-actions Bot commented Jun 4, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

🟡 Playwright Results — all passed (16 flaky)

Uh oh!

github-actions Bot commented Jun 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

✅ PR checks passed

Uh oh!

sonarqubecloud Bot commented Jun 5, 2026

Quality Gate passed for 'open-metadata-ingestion'

Uh oh!

manerow left a comment

Choose a reason for hiding this comment

Uh oh!

gitar-bot Bot commented Jun 5, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

harshach commented Jun 3, 2026 •

edited

Loading

github-actions Bot commented Jun 4, 2026 •

edited

Loading

github-actions Bot commented Jun 5, 2026 •

edited

Loading