Skip to content

perf: one-transaction flush + async indexing write path, read-path fixes#28675

Merged
harshach merged 24 commits into
mainfrom
harshach/api-latency-optimization
Jun 5, 2026
Merged

perf: one-transaction flush + async indexing write path, read-path fixes#28675
harshach merged 24 commits into
mainfrom
harshach/api-latency-optimization

Conversation

@harshach
Copy link
Copy Markdown
Collaborator

@harshach harshach commented Jun 3, 2026

Describe your changes:

Fixes #28726

Optimizes the OpenMetadata read and write paths for lower API latency. Reads: several repositories overrode setFields but not setFieldsInBulk, silently dropping list fields and forcing per-item queries — added bulk field-fetch paths (Directory, Spreadsheet, all *Service, KnowledgePage, ContextMemory), dropped a redundant CSV-export tag rescan, batched single-entity tag inserts, and cached Fernet keys. Writes — Stage 1: the create/update/patch flush committed 5–7 times via per-DAO autocommit (the @Transaction annotations were decorative no-ops) and left orphan rows on mid-flush failure; it is now wrapped in one real Entity.getJdbi().inTransaction(...) + DeadlockRetry, with every external side-effect (cache-L2 invalidation, RDF/SPARQL, lineage-ES, rename-cascade reindex) deferred so the held DB connection makes no network round trip. Writes — Stage 2: those externals + the ES entity index now run asynchronously off the request thread on a new per-entity-ordered OrderedLaneExecutor (durable via the search retry outbox), while cache write-through stays synchronous post-commit so GET /{id} and /name/{fqn} remain read-your-write — only /search and SPARQL become eventually consistent.

Type of change:

  • Improvement

High-level design:

  • Stage 1 wraps EntityRepository create/update/patch flushes in inTransaction + deadlock-retry (mirroring the existing cleanup() idiom), collapsing 5–7 commits → 1 and making writes atomic. In-flush external calls are captured into thread-local collectors and drained post-commit; EntityUpdater state is snapshotted/restored per retry attempt (incl. the 6 cascade-bearing subclass updaters).
  • Stage 2 routes those drains plus the ES entity index (SearchIndexHandler.isAsync()=true) onto OrderedLaneExecutor, keyed by entity id so single-consumer lanes preserve per-entity ordering. A full lane sheds to SearchIndexRetryQueue (a fast local-DB upsert) rather than blocking the request thread; bulk create slices per-entity onto each member's own lane.
  • Consistency contract: GET-by-id/by-name stay real-time (DB + sync cache); search/RDF/lineage are eventually consistent with durable retry. Redis stays opt-in (CACHE_PROVIDER=none default).
  • Honest residual: under pathological lane overflow/hard-stop, the outbox rebuilds an entity's doc + lineage from committed DB, but RDF triples and rename-cascade descendant reindexes are best-effort.

Tests:

Use cases covered

  • Create/update/patch is atomic (mid-flush failure → zero orphan rows) and deadlock-replay-safe, incl. glossary-term rename and data-product domain-change cascades.
  • Write returns before the doc is searchable; doc converges in search; same-entity create→update converges to the final value (lane ordering).
  • Previously-dropped list fields now populate (Drive directories/spreadsheets, *Service pipelines, Knowledge Center hierarchy, Context memories).

Backend integration tests

  • Added/updated in openmetadata-integration-tests/.
  • OneTransactionFlushAtomicityIT, AsyncSearchIndexConsistencyIT, DirectoryResourceIT, SpreadsheetResourceIT, DatabaseServiceResourceIT, ContextMemoryIT, DatabaseResourceIT, TableResourceIT, KnowledgePageHierarchyIT, ColumnSearchIndexIT/TestCaseResourceIT (Awaitility polling for async), WritePathBenchmarkIT (inert perf harness).

Unit tests

  • OrderedLaneExecutorTest, EntityLifecycleEventDispatcherTest, SearchIndexHandlerTest, FernetKeyCacheTest.

Ingestion integration tests

  • Not applicable (no ingestion changes).

Playwright (UI) tests

  • Not applicable (no UI changes).

Manual testing performed

  • Ran targeted ITs + the full IT suite (mvn verify -Ppostgres-elasticsearch, ~13.5k tests) on cache=none and cache=redis; all green except pre-existing async-workflow flakiness.
  • Write-path benchmark on a durable (fsync=on) Postgres showed ~30–40% median write-latency improvement; the residual p99 tail is DB-commit/checkpoint + GC, not the OM write path.

UI screen recording / screenshots:

Not applicable.

Checklist:

  • I have read the CONTRIBUTING document.
  • My PR is linked to a GitHub issue via Fixes #<issue-number> above.
  • I have added tests (unit / integration) and listed them above.
  • For JSON Schema changes: N/A (no schema changes).

harshach and others added 8 commits June 1, 2026 23:16
Flip the packaged deployment default from none to redis. CacheBundle already
falls back to NoopCacheProvider when Redis init fails, so deployments without
Redis keep booting (degraded). CacheConfig.java's programmatic default stays
none, so the integration-test default is unchanged.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Several repositories override setFields but not setFieldsInBulk, so entity-specific
fields were silently null on the list path (0 queries — a correctness bug, not an
N+1). Add the bulk path for each and batch the residual per-item fetches:

- DirectoryRepository: service/parent/numberOfFiles/numberOfSubDirectories/totalSize
  on list; batch child sizes via findEntitiesByIds (NON_DELETED).
- SpreadsheetRepository: service/directory/worksheets on list; batch getWorksheets.
- ServiceEntityRepository: pipelines on list for all 13 *Service types.
- ContextMemoryRepository: primaryEntity/relatedEntities on list.
- DatabaseRepository: drop redundant per-table setFieldsInternal re-scan in
  recursive CSV export (columns/tags already bulk-populated).
- EntityRepository: route single-entity applyTags through the batched variant.

Validated by integration tests (postgres-elasticsearch): Directory 251, Spreadsheet
165, DatabaseService 39, ContextMemory 132, Database 197, Table tag-batch — all green.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Fernet rebuilt the key list (split on ",") and constructed a new SecureRandom on
every encrypt/decrypt call — repeated per password field per service write/read.
Hoist the parsed List<Key> and a single SecureRandom to init-once fields, rebuilt
only on setFernetKey (rotation). Ciphertext format and rotation behavior unchanged.

Validated: FernetKeyCacheTest — 3 tests, 0 failures (round-trip, multi-field key
reuse, rotation rebuild).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…list (T1)

KnowledgePageRepository overrode setFields but not setFieldsInBulk, so the
hierarchy listing (GET /v1/contextCenter/pages, /hierarchy) returned every page
with parent=null — the Knowledge Center tree rendered flat. Add setFieldsInBulk
batching parent (CONTAINS container), relatedEntities, and editors.

Tested via KnowledgePageHierarchyIT (not BaseEntityIT — KnowledgePage is
parent-scoped and does not fit the generic CRUD harness): creates parent + child
pages and asserts the child's parent populates on the list path and nests under
the parent in the hierarchy. 2 tests, 0 failures.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…post-commit (Stage 1)

The create/update/patch flush committed 5-7 times via per-DAO autocommit because the
@transaction annotations on EntityRepository are decorative (the repo is new-instantiated,
not a JDBI SqlObject proxy), so a mid-create failure also left orphan relationship/tag rows.

Wrap each flush body in Entity.getJdbi().inTransaction(...) + DeadlockRetry (mirroring the
proven cleanup() idiom): 5-7 commits -> 1, and the unit is now all-or-nothing. To keep the
held DB connection free of any network round trip, every external side-effect that ran inside
the flush is deferred to run post-commit via thread-local collectors:
- cache L2 invalidation (addRelationship/deleteRelationship/rename-cascade) recorded, drained
  post-commit; Guava L1 stays inline. Cache write-through stays SYNC post-commit so
  GET-by-id/by-name read-your-write is preserved.
- tag/entity RDF SPARQL deferred (RdfTagUpdater) and drained post-commit.
- domain/data-product lineage-ES and rename-cascade ES reindexes (Glossary/GlossaryTerm/Tag/
  Classification/Container/Domain/DataProduct) deferred via SearchRepository, durable retry
  via SearchIndexRetryQueue on a failed drain.
The Redis gate is removed, so the wrap is always engaged regardless of cache provider.

Deadlock-replay safety: EntityUpdater base state is snapshotted/restored per attempt; the 6
cascade-bearing subclass updaters reset run-once guards via resetForRetryAttempt; the deferral
collectors clear+reopen per attempt so a rolled-back attempt enqueues nothing.

In Stage 1 the deferred externals still drain SYNC post-commit (off the held connection);
moving them off the request thread is Stage 2.

Validated by integration tests on postgres-elasticsearch, cache=none AND cache=redis:
OneTransactionFlushAtomicityIT + 1693 cache=none and 683 cache=redis cascade/rename tests.

Known follow-up: descendant by-id cache under a deadlock-replayed rename + Redis (narrow,
self-healing; non-deadlock real path verified under Redis) — that replay-test is cache=none.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Keep Redis opt-in (CACHE_PROVIDER=redis) rather than the packaged default, so the
default deployment and the standard test suite run without an external Redis
dependency and continue to work. The Stage-1 one-transaction flush is independent
of the cache provider (the Redis gate was removed), so this does not affect it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…read (Stage 2)

Stage 1 committed the write flush in one transaction and deferred the external
side-effects (ES index, RDF, lineage-ES, rename-cascade reindex) to drain
synchronously post-commit. Stage 2 moves those drains OFF the request thread so the
write API returns right after the DB commit + sync cache write-through, never
blocking on ES/RDF/SPARQL.

- OrderedLaneExecutor: entity-id-striped single-consumer lanes (virtual threads),
  bounded queue. All async work for one entity id serializes on its own lane in
  submission order, so create->update->update (and a postCreate that re-updates the
  same doc) index in order. Bulk create slices per-entity onto each member's own lane.
- SearchIndexHandler.isAsync() -> true; the post-commit RDF/lineage/rename-cascade
  drains submit onto the same lanes keyed by entity id.
- Request thread never blocks: a full lane sheds the task to the durable outbox
  (SearchIndexRetryQueue, a fast local-DB upsert) rather than blocking or running the
  slow external inline. Every async failure (index/update/delete/soft-delete, lane
  Throwable, overflow, shutdown-drop) is recoverable via the outbox.
- Cache write-through + L2 invalidation stay SYNC post-commit, so GET /entity/{id}
  and /name/{fqn} remain immediately read-your-write. Only /search and SPARQL become
  eventually consistent.
- Graceful shutdown drains in-flight and flushes queued tasks to the outbox.

Honest residual: under pathological lane overflow/hard-stop, the outbox reindex
rebuilds an entity's own doc + lineage from committed DB, but does not replay RDF
triples or rename-cascade DESCENDANT reindexes -- those overflow recoveries are
best-effort (the normal path self-enqueues per-closure retries).

Validated on postgres-elasticsearch (cache=none): AsyncSearchIndexConsistencyIT
(async visibility + same-entity ordering convergence) 2/2; OneTransactionFlushAtomicityIT
284, GlossaryTermResourceIT 170, TableResourceIT -- all green (atomicity + rename
cascades + read-your-write intact under async). Unit: OrderedLaneExecutorTest,
EntityLifecycleEventDispatcherTest, SearchIndexHandlerTest.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ttribution

WritePathBenchmarkIT (inert unless -Dbenchmark.run=true) measures single-thread
create/update p50/p99/mean against a fsync-on Postgres (-DdbDurable=true). SDK-only
so it runs unchanged against an older service jar for before/after comparison. Each
update is attributed (GC pause vs DB checkpoint/commit vs OM-server processing) via a
raw-commit probe + GarbageCollectorMXBean + pg checkpoint count. Adds a dbDurable flag
to TestSuiteBootstrap (default off).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added backend safe to test Add this label to run secure Github workflows on PRs labels Jun 3, 2026
…optimization

# Conflicts:
#	openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java
…lidation count

Addresses two gitar-bot review findings on PR #28675:

1. (Bug) With SearchIndexHandler.isAsync()=true, the ordered lane captured the
   live EntityInterface by reference. After dispatch the request thread keeps
   mutating that same instance (REST PII masking, clearFields stripping for the
   HTTP response, secret masking on service connections), so the lane could index
   a masked/stripped/partial document. executeHandler(EntityInterface,...) now
   takes a Consumer<EntityInterface> and, on the async path, deep-copies the
   committed state via JsonUtils before submitting it to the lane; the sync path
   binds the live entity. Per-entity lane ordering does not protect against this
   because the race is on the in-memory POJO, not on lane submission order.

2. (Quality) invalidateCacheForTaggedEntities(String) always returned 0 because
   deferOrRunSearchBackedInvalidation discarded the inline count. The defer hook
   now takes an IntSupplier so the inline (non-flush) path returns the live
   invalidated count; the deferred path still returns 0 (count is only known
   post-commit and is logged in searchTaggedEntitiesAndInvalidate).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@harshach
Copy link
Copy Markdown
Collaborator Author

harshach commented Jun 3, 2026

Thanks for the review — both findings are addressed in 18d2d92.

Finding 1 (Bug — async indexer captures a mutable entity)
SearchIndexHandler.isAsync() is now true, so the ordered lane reads the entity off the request thread after dispatch — but the request thread keeps mutating that same instance for the HTTP response (PII masking, clearFields stripping, secret masking on service connections). The lane could therefore index a masked/stripped/partial doc. Per-entity lane ordering doesn't help here because the race is on the in-memory POJO, not on lane submission order.

Fix: executeHandler(EntityInterface, …) now takes a Consumer<EntityInterface>, and the async branch deep-copies the committed state before submitting to the lane:

EntityInterface snapshot =
    JsonUtils.readValue(JsonUtils.pojoToJson(entity), entity.getClass());
orderedLaneExecutor.submit(
    entity.getId(), laneTask(entity, operation, () -> handlerCall.accept(snapshot)));

The sync branch still binds the live entity (no copy cost when nothing is async). The EntityReference overload is unaffected (references aren't response-mutated). The six onEntity* call sites were updated from () -> handler.onX(entity, …) to e -> handler.onX(e, …).

Finding 2 (Quality — invalidateCacheForTaggedEntities(String) always returned 0)
deferOrRunSearchBackedInvalidation discarded the inline count. It now takes an IntSupplier, so the inline (non-flush) path returns the live invalidated count:

} else {
  result = invalidation.getAsInt();
}

The deferred path still returns 0 — the work runs post-commit and the count is only known then (it's logged inside searchTaggedEntitiesAndInvalidate); this is documented in the method's Javadoc.

mvn spotless:check and a service-module compile both pass.

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 4, 2026

🟡 Playwright Results — all passed (16 flaky)

✅ 4264 passed · ❌ 0 failed · 🟡 16 flaky · ⏭️ 88 skipped

Shard Passed Failed Flaky Skipped
🟡 Shard 1 300 0 1 4
🟡 Shard 2 801 0 3 9
🟡 Shard 3 802 0 3 8
🟡 Shard 4 852 0 3 12
🟡 Shard 5 720 0 1 47
🟡 Shard 6 789 0 5 8
🟡 16 flaky test(s) (passed on retry)
  • Flow/SearchRBAC.spec.ts › User without permission (shard 1, 1 retry)
  • Features/DataProductRename.spec.ts › should handle multiple consecutive renames and preserve assets (shard 2, 1 retry)
  • Features/DataQuality/BundleSuiteBulkOperations.spec.ts › Add test case to existing Bundle Suite (shard 2, 1 retry)
  • Features/Glossary/GlossaryWorkflow.spec.ts › should display correct status badge color and icon (shard 2, 1 retry)
  • Features/KnowledgeCenterList.spec.ts › Knowledge Center List - Test unbookmark functionality (shard 3, 1 retry)
  • Features/RTL.spec.ts › Verify Following widget functionality (shard 3, 1 retry)
  • Features/Table.spec.ts › Table pagination with sorting should works (shard 3, 1 retry)
  • Pages/CustomProperties.spec.ts › Time Interval (shard 4, 1 retry)
  • Pages/CustomProperties.spec.ts › Time Interval (shard 4, 1 retry)
  • Pages/DomainUIInteractions.spec.ts › Delete domain with subdomains shows warning (shard 4, 1 retry)
  • Pages/ExplorePageRightPanel_KnowledgeCenter.spec.ts › Should remove user owner for knowledgeCenter (shard 5, 1 retry)
  • Pages/Lineage/DataAssetLineage.spec.ts › Column lineage for apiEndpoint -> mlModel (shard 6, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify Impact Analysis service filter selection (shard 6, 1 retry)
  • Pages/Lineage/LineageFilters.spec.ts › Verify lineage service type filter selection (shard 6, 2 retries)
  • Pages/Lineage/LineageRightPanel.spec.ts › Verify custom properties tab IS visible for supported type: searchIndex (shard 6, 1 retry)
  • Pages/Lineage/PlatformLineage.spec.ts › Verify domain platform view (shard 6, 1 retry)

📦 Download artifacts

How to debug locally
# Download playwright-test-results-<shard> artifact and unzip
npx playwright show-trace path/to/trace.zip    # view trace

…umn/async search ITs

ColumnInDataAssetTests and AsyncSearchIndexConsistencyIT searched a shared alias
(dataAsset / table_search_index) with query("*") + size(50), then filtered the
50-doc page client-side for the test's own entity. Those aliases span every
tableColumn / table doc in the cluster, so once the full IT suite has indexed
thousands of them the arbitrary page almost never contains this test's entity:
ColumnInDataAssetTests failed deterministically and AsyncSearchIndexConsistencyIT
flaked. In isolation the entity is always in the page, so both passed alone.

The production code is correct. A standalone reproduction confirmed columns keep
the borrowed dataAsset parent alias through a recreate-all reindex, and a load
reproduction (60 noise tables / ~120 columns) showed the match-all page misses the
target while a name/FQN-targeted query finds it. Fix the queries to search FOR the
specific entity (column name / table FQN) so the target ranks first and a single
page reliably contains it. No production change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
harshach and others added 2 commits June 4, 2026 08:41
Brings in #28676 (policy-aware updateDisplayName + updatingBotDeniedOperation) and
refreshes the branch base before addressing PR review findings.
…le no-op

Resolves the two blocking findings from the PR #28675 review.

1. updateOwners no longer reverts owners on every bot PUT. The previous guard
   reverted owners whenever a bot issued a PUT and an owner already existed,
   silently breaking ingestion ownership re-sync across all entity types. It now
   keys on the policy-aware updatingBotDeniedOperation(EDIT_OWNERS) check that
   #28676 introduced for updateDisplayName: a bot the policy DENIES EditOwners
   still cannot clobber user-curated owners, but the ingestion/default bots
   (which only carry a DisplayName-Deny, not an Owner-Deny) sync owners again.
   overrideMetadata=true still force-syncs.

2. bulkDeleteStaleEntities treats an empty seenFqns as a zero-deletion no-op
   instead of "every entity under the scope is stale". An empty seen-set cannot
   be distinguished from a connector run that crashed or discovered nothing, so
   interpreting it as fully-stale would silently delete (and with hardDelete
   permanently destroy) every entity under a service/database. Mirrors the
   existing scope-not-found no-op rather than rejecting with 400, keeping the
   well-formed-but-ambiguous request safe.

Tests:
- BulkDeleteStaleIT.test_emptySeenFqns_withPopulatedScope_deletesNothing - a
  populated scope with empty seenFqns deletes nothing (wipes 3 tables without
  the guard).
- BaseEntityIT.test_singleEntityPut_bot_updatesOwnersWhenPolicyAllows - a
  policy-allowed bot reassigns owners via PUT (reverts to the old owner without
  the fix).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
harshach and others added 2 commits June 4, 2026 10:19
… 403

test_unauthorizedUser_isForbidden minted a DataConsumer JWT but never created the
user, relying on another test in the concurrent suite to seed it. In isolation the
server rejects the token with 404 (user not found) instead of 403 (forbidden), and
even in the suite the assertion is order-dependent. Ensure the user exists up front
via UserTestFactory.getDataConsumer (the established pattern used by
PermissionsResourceIT / AppOperationPermissionsIT) so the principal resolves and the
DELETE authorization is what actually drives the 403.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ide ES contention

A multi-agent review of the remaining intermittent CI failures confirmed the
production code is correct: the main-entity index write runs inline in postCreate
(outside the deferral scope) and self-enqueues to the retry outbox on failure, the
post-commit deferral drain is exception-safe and thread-local-safe, and the
column dataAsset alias is a static index alias. The AsyncSearchIndexConsistencyIT
and ColumnInDataAssetTests timeouts only flake because, under the full ~14.5k-test
suite hitting one shared Elasticsearch node, a live write can be 429'd and land via
the retry worker slightly after the 60s window. Widen the poll window to 120s for
just these search-visibility assertions; no production change.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…e-text query (fixes OpenSearch 500)

The postgres-opensearch CI job failed deterministically on AsyncSearchIndexConsistencyIT
and ColumnInDataAssetTests. A prior change searched for the specific entity by free-text
querying its (long, many-token) fullyQualifiedName / column name. On the multi-index
dataAsset and table_search_index aliases that expands into >1024 boolean clauses and trips
OpenSearch's default index.query.bool.max_clause_count (Elasticsearch's default is far
higher), so the search returned HTTP 500 (query_shard_exception: maxClauseCount is set to
1024). Awaitility's ignoreExceptions() swallowed the 500 and the poll timed out — passing on
Elasticsearch, failing on OpenSearch.

Use match-all plus a term filter on a keyword field (fullyQualifiedName for the table,
name.keyword for the column) instead: one clause, exact match, no clause explosion, and the
target is found regardless of how many docs the suite has indexed (keeping the earlier
top-50-window fix). Revert the 60s->120s poll-window bump, which was based on the wrong
(env-overload) diagnosis; with the term filter the query succeeds immediately.

Verified on OpenSearch (opensearchproject/opensearch:3.4.0): both classes pass in ~2-3s with
zero maxClauseCount errors.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Jun 5, 2026

✅ PR checks passed

The linked issue has a description and all required Shipping project fields set. Thanks!

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented Jun 5, 2026

Copy link
Copy Markdown
Contributor

@manerow manerow left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me. The one-transaction wrap and deferring the external side effects out of the held connection reads cleanly. Approving.

@gitar-bot
Copy link
Copy Markdown

gitar-bot Bot commented Jun 5, 2026

Code Review ✅ Approved 7 resolved / 7 findings

Optimizes API latency by implementing single-transaction flushes and an ordered asynchronous indexing path for write operations. Resolved issues regarding stale entity snapshots, cache invalidation, and regression in bot owner updates.

✅ 7 resolved
Quality: invalidateCacheForTaggedEntities now always returns 0

📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java:3217-3231 📄 openmetadata-service/src/main/java/org/openmetadata/service/jdbi3/EntityRepository.java:3282-3294
invalidateCacheForTaggedEntities(String) previously returned the live count of invalidated entities. After the refactor it delegates to deferOrRunSearchBackedInvalidation(Runnable, String), which executes the invalidation as a Runnable and discards its result, hard-coding int result = 0 on both the inline and deferred paths (EntityRepository.java:3217-3244). The actual count produced by searchTaggedEntitiesAndInvalidate is dropped.

Impact is limited because no caller uses the count for control flow — the bulk variant invalidateCacheForTaggedEntities(Collection) only aggregates it for the log line "Invalidated cache for {} entities across {} renamed tag FQNs" (EntityRepository.java:3288-3290), which will now always report 0 (and the if (total > 0) guard means it never logs at all). The inner per-tag log inside searchTaggedEntitiesAndInvalidate still reports the true count, so the regression is purely misleading/absent aggregate logging. Suggested fix: have the inline path return the real count by using a small int-returning helper instead of a Runnable, or document that the return value is no longer meaningful and drop the dead total/logging in the bulk variant.

Bug: Async indexer may read entity POJO mutated after dispatch

📄 openmetadata-service/src/main/java/org/openmetadata/service/events/lifecycle/handlers/SearchIndexHandler.java:150-157 📄 openmetadata-service/src/main/java/org/openmetadata/service/events/lifecycle/EntityLifecycleEventDispatcher.java:328-342
SearchIndexHandler.isAsync() was flipped from false to true (SearchIndexHandler.java:150-157), so the search document is now built and written on a background OrderedLaneExecutor lane thread that captures the same EntityInterface instance by reference in the dispatched lambda (EntityLifecycleEventDispatcher.executeHandler -> laneTask). The previous synchronous behavior was deliberately chosen — the removed comment stated "Search indexing must be visible to follow-up operations in the same request flow" and indexing ran inline before the request thread continued.

With async indexing there is now a window where the request thread can continue mutating the very same entity instance after dispatch but before the lane thread reads it to build the doc — e.g. REST-layer PII masking, clearFields/field stripping for the HTTP response, or secret encryption/decryption on service-connection configs. If any such post-dispatch mutation touches the same instance, the async indexer can observe and persist a masked/cleared/partial document into the search index. Per-entity lane ordering does not protect against this because the race is on the in-memory POJO, not on lane submission order.

I was unable to fully confirm a concrete post-dispatch mutation site within this review, so treat this as a potential issue: please verify that the entity object handed to the dispatcher is either deep-copied before async submission or is never mutated by the resource/masking/encryption layers after onEntityCreated/onEntityUpdated is called. If it can be mutated, snapshot (deep-copy) the entity at dispatch time before submitting it to the lane.

Bug: Unguarded snapshot serialization can fail API after commit

📄 openmetadata-service/src/main/java/org/openmetadata/service/events/lifecycle/EntityLifecycleEventDispatcher.java:331-343
executeHandler now performs JsonUtils.readValue(JsonUtils.pojoToJson(entity), entity.getClass()) synchronously on the request thread, before the task is submitted to the ordered lane. This call is not wrapped in any try/catch:

  • The async dispatch entry points (onEntityCreated/onEntityUpdated/onEntityDeleted) are invoked from EntityRepository.postCreate/postUpdate/post-delete hooks, which run AFTER the DB transaction has committed and do not catch RuntimeExceptions escaping the dispatcher.
  • If pojoToJson/readValue throws for any entity (e.g. a field that fails to round-trip, an unexpected subtype, a serialization edge case), the exception propagates to the REST request thread. The client receives a 5xx even though the entity was already committed to the DB — a confusing 'write failed but actually succeeded' outcome.
  • Worse, this also defeats the durability model this PR carefully builds: because the throw happens before orderedLaneExecutor.submit(...), the enqueueLaneFailureRetry/SearchIndexRetryQueue outbox path never fires, so the search-index update is silently lost rather than retried.

Wrap the snapshot creation in a try/catch and, on failure, route the locator to the durable retry outbox (so the worker rebuilds the doc from committed DB state) instead of letting it break the request and drop the index update.

Performance: Entity re-serialized once per async handler in dispatch loop

📄 openmetadata-service/src/main/java/org/openmetadata/service/events/lifecycle/EntityLifecycleEventDispatcher.java:326-340 📄 openmetadata-service/src/main/java/org/openmetadata/service/events/lifecycle/EntityLifecycleEventDispatcher.java:144-148 📄 openmetadata-service/src/main/java/org/openmetadata/service/events/lifecycle/EntityLifecycleEventDispatcher.java:183-187
executeHandler is called once per applicable handler inside the dispatch loops (onEntityCreated, dispatchBulkCreate, onEntityUpdated, dispatchBulkUpdate, etc.). The snapshot JsonUtils.readValue(JsonUtils.pojoToJson(entity), ...) is therefore performed once per async handler for the same entity. When multiple async handlers apply to an entity type, the same POJO is serialized+deserialized N times on the request thread; for bulk create/update this multiplies to (entities × async handlers) full round-trips on the request path — exactly the path this PR is trying to make faster. Consider computing the snapshot once per entity (e.g. lazily on first async handler) and reusing it across handlers for the same dispatch.

Edge Case: Snapshot failure drops non-search async handlers silently

📄 openmetadata-service/src/main/java/org/openmetadata/service/events/lifecycle/EntityLifecycleEventDispatcher.java:355-360 📄 openmetadata-service/src/main/java/org/openmetadata/service/events/lifecycle/EntityLifecycleEventDispatcher.java:386-400
When snapshotOrEnqueueRetry fails to serialize an entity, it returns null and executeHandler skips the lane submit for every async handler (the shared/memoized supplier returns null for all of them). However, the only durable recovery path enqueued is SearchIndexRetryQueue.enqueue(...), which the worker uses to rebuild the search document. There are three async handlers dispatched through this path — SearchIndexHandler, DomainSyncHandler (task/announcement domain sync), and VectorEmbeddingHandler — and the latter two are NOT recovered by the search-index retry outbox. So on a (rare/pathological) post-commit serialization failure, domain-sync and vector-embedding side effects are silently lost with no retry, while the Javadoc only describes the search-doc recovery.

Impact is limited because serialization failures are pathological and the write itself is already committed, so this is minor. Suggested fix: either document this explicitly as a known best-effort residual (matching the RDF/rename-cascade caveats in the PR description), or enqueue the locator into the per-handler recovery mechanisms (not just the search outbox) so domain-sync/vector embeddings also converge.

...and 2 more resolved from earlier reviews

Options

Display: compact → Showing less information.

Comment with these commands to change:

Compact
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | Gitar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend safe to test Add this label to run secure Github workflows on PRs

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf: reduce write/read API latency — one-transaction flush + read-path bulk-field fetch

2 participants