Improve EventSubscription database connection usage #25250

harshach · 2026-01-13T07:10:05Z

High connection churn - not connection leaks. With JDBI's onDemand() proxy pattern, connections are released after each DAO method call. But when processing 100 events × 5 destinations = 500 individual INSERT statements, each acquiring a connection, this creates massive connection pool contention under load.

Changes Made

1. Batch Processing in AbstractEventConsumer

File: openmetadata-service/.../apps/bundles/changeEvent/AbstractEventConsumer.java

Key changes:

Added successfulEvents list to collect events
Modified publishEvents() to collect instead of write per-event
Added batchRecordSuccessfulEvents() method called in commit()

Impact

Impact
┌───────────────────────────────────────┬────────┬───────┐
│ Metric │ Before │ After │
├───────────────────────────────────────┼────────┼───────┤
│ DB connections per batch (100 events) │ ~500 │ ~2 │
├───────────────────────────────────────┼────────┼───────┤
│ Connection pool contention │ High │ Low │
├───────────────────────────────────────┼────────┼───────┤
│ N+1 queries in diagnostics │ Yes │ No │
└───────────────────────────────────────┴────────┴───────┘

2. Batch DAO Method

File: openmetadata-service/.../jdbi3/CollectionDAO.java

Added batchUpsertSuccessfulChangeEvents() using @ConnectionAwareSqlBatch for MySQL and PostgreSQL support.

3. N+1 Query Fix

File: openmetadata-service/.../events/scheduled/EventSubscriptionScheduler.java

Fixed getRelevantUnprocessedEvents() - fetch FilteringRules once before the loop instead of per-event
Removed 4 misleading @transaction annotations (no-op on non-SqlObject classes)

4. HTTP Timeout Caps

File: openmetadata-service/.../util/SubscriptionUtil.java

Added timeout caps to prevent runaway webhooks:

Connect: 5-30 seconds
Read: 10-120 seconds

Checklist:

I have read the CONTRIBUTING document.
My PR title is Fixes <issue-number>: <short explanation>
I have commented on my code, particularly in hard-to-understand areas.
For JSON Schema changes: I updated the migration scripts or explained why it is not needed.

...e/src/main/java/org/openmetadata/service/apps/bundles/changeEvent/AbstractEventConsumer.java

gitar-bot · 2026-01-15T03:49:43Z

🔍 CI failure analysis for c040a23: Comprehensive analysis: All CI failures (Python Bitbucket API, 20+ Playwright UI tests, SonarCloud timeouts, Collate builds) are unrelated to PR's Java backend changes

CI Failure Summary - All Unrelated to PR

This PR continues to face CI failures across multiple categories, all unrelated to the event subscription batch processing changes:

1. Python Tests - Bitbucket API (Earlier runs)

Issue: Persistent 503 Service Unavailable from Bitbucket API
Impact: 1 test failure in LookML Bitbucket reader
Cause: External infrastructure

2. Playwright UI Tests - Extensive Flakiness (Multiple runs)

Shard 2/6: 6 failures (Bulk Edit, Glossary, Impact Analysis)
Shard 3/6: 14+ failures (Settings, Explore, Metrics, Alerts, Panels, Forms)
Shard 3/6 (retry): Additional failures with element visibility issues
Root Cause: Flaky UI tests - timing-sensitive with toBeVisible(), toBeEnabled(), toBeAttached() failures
Impact: 20+ UI test failures across multiple shards

3. Maven SonarCloud CI (Earlier runs)

Issue: 3.5 hour runtime with workflow test timeouts
Cause: Same workflow timeout issues as PostgreSQL CI

4. Maven Collate CI (Persistent)

Issue: 2 consecutive failures in "Trigger Collate build & wait"
Cause: External Collate repository build failures
Impact: Blocking despite being unrelated to PR changes

PR Changes (All Approved)

This PR modifies only Java backend code:

AbstractEventConsumer.java - ✅ All review feedback addressed
CollectionDAO.java
SubscriptionUtil.java
EventSubscriptionScheduler.java

Does NOT modify: Python code, UI/frontend, Collate code, workflow definitions, or any code related to the failing tests.

Conclusion

All CI failures stem from:

External API outages (Bitbucket)
Flaky UI tests (20+ Playwright failures)
Workflow test timeouts (SonarCloud)
External dependency builds (Collate)

None are caused by this PR's event subscription optimizations.

Code Review ✅ Approved 5 resolved / 5 findings

Solid performance improvement for EventSubscription batching with comprehensive tests. The implementation correctly addresses connection pool contention.

Resolved ✅ 5 resolved

Edge Case: Redundant empty check in batchRecordSuccessfulEvents()

📄 openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/changeEvent/AbstractEventConsumer.java
The batchRecordSuccessfulEvents() method has an early return check for successfulEvents.isEmpty() at line 302, but this method is only called from commit() which already checks !successfulEvents.isEmpty() before invoking it.

While the redundant check doesn't cause any bugs (it's defensive programming), it adds unnecessary code complexity. Either remove the check from batchRecordSuccessfulEvents() since the caller guarantees the list is non-empty, or document that the method is designed to be called from multiple contexts where the empty check is needed.

This is a minor code quality issue and doesn't affect functionality.

Edge Case: Exception in batchRecordSuccessfulEvents prevents offset update

📄 openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/changeEvent/AbstractEventConsumer.java
In the commit() method, if batchRecordSuccessfulEvents() throws an exception (re-thrown after logging), the subsequent offset update and DAO upsert operations won't execute. While events were already sent to destinations, the offset won't advance, potentially causing:

Duplicate event processing: On next job execution, the same events may be processed again since the offset wasn't saved

Duplicate HTTP calls: This could result in duplicate webhook calls to external destinations

The current behavior logs the error and re-throws, which seems intentional to signal failure. However, consider whether the offset should still be updated to prevent reprocessing events that were already successfully sent to destinations, or implement idempotency at the destination level.

Suggested options:

Wrap the batch write in a try-catch and proceed with offset update even on failure (accepting loss of success records but avoiding duplicate sends)

Add a comment explaining the current failure semantics are intentional (events will be retried)

Consider using a transaction that encompasses both the batch write and offset update

Bug: Missing error handling in batchRecordSuccessfulEvents

📄 openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/changeEvent/AbstractEventConsumer.java
The batchRecordSuccessfulEvents() method calls the DAO's batch insert without any try-catch. If the database operation fails (e.g., connection timeout, constraint violation, database unavailable), the exception will propagate up and successfulEvents.clear() in commit() won't be reached, but the events will still be lost because they were already processed as "successful" from the HTTP perspective.

Impact: If the batch insert fails after successfully sending alerts, those successful events won't be recorded. On retry, the events may be re-sent to destinations (duplicates) or never recorded properly, breaking the event tracking consistency.

Suggested fix: Wrap the batch insert in try-catch and implement a retry mechanism or fail-safe:
try {
  Entity.getCollectionDAO()
      .eventSubscriptionDAO()
      .batchUpsertSuccessfulChangeEvents(...);
} catch (Exception e) {
  LOG.error("Failed to batch record {} successful events for subscription {}", 
      successfulEvents.size(), subscriptionId, e);
  // Consider: retry logic, or at minimum log for operational visibility
  throw e; // or handle gracefully depending on requirements
}

Edge Case: Timeout capping silently changes caller-specified values

📄 openmetadata-service/src/main/java/org/openmetadata/service/util/SubscriptionUtil.java
The timeout capping logic silently modifies the caller's requested timeout values without any logging or indication. If a caller explicitly requests a 1-second connect timeout (perhaps for a fast-fail requirement) or a 300-second read timeout (for a known slow endpoint), their values will be silently changed to 5 or 120 seconds respectively.

Impact: Callers may not realize their timeout configurations are being overridden, leading to unexpected behavior. A fast-fail scenario might wait too long (5s vs 1s), or a legitimately slow webhook might fail prematurely.

Suggested fix: Add debug/warn logging when values are clamped:
if (connectTimeout < 5 || connectTimeout > 30) {
  LOG.debug("Connect timeout {} clamped to {} (valid range: 5-30s)", 
      connectTimeout, effectiveConnectTimeout);
}
Alternatively, document these limits in the API/configuration where timeouts are specified.

Edge Case: Thread safety issue with shared successfulEvents list

📄 openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/changeEvent/AbstractEventConsumer.java
The successfulEvents list is a shared instance field (private final List<ChangeEvent> successfulEvents = new ArrayList<>()), but ArrayList is not thread-safe. If publishEvents() is called concurrently from multiple threads, or if there's any concurrent access pattern, this could lead to data corruption, lost events, or ConcurrentModificationException.

Impact: Under concurrent load, successful events may be lost or duplicated, or the system could throw runtime exceptions.

Suggested fix: Either:

Use a thread-safe collection: Collections.synchronizedList(new ArrayList<>()) or CopyOnWriteArrayList

Use explicit synchronization when accessing the list

If AbstractEventConsumer instances are guaranteed to be single-threaded (one instance per job), add a comment documenting this assumption

Given this is a job-based scheduler context, verify whether consumer instances are reused across concurrent job executions.

What Works Well

The batch processing approach effectively reduces database connections from ~500 to ~2 per batch. The N+1 query fix for FilteringRules is a valuable optimization. Thread-safety is properly documented with the reliance on @DisallowConcurrentExecution. Error handling in commit() is well-designed - it catches exceptions but still advances the offset to prevent duplicate HTTP calls.

Tip

Comment Gitar fix CI or enable auto-apply: gitar auto-apply:on

Options

Auto-apply is off Gitar will not commit updates to this branch.
Display: compact Hiding non-applicable rules.

Comment with these commands to change:

`Auto-apply`	`Compact`
`gitar auto-apply:on`	`gitar display:verbose`

_{Was this helpful? React with 👍 / 👎 | This comment will update automatically (Docs)}

sonarqubecloud · 2026-01-15T05:45:31Z

Quality Gate passed for 'open-metadata-ingestion'

Issues
0 New issues
0 Accepted issues

Measures
0 Security Hotspots
0.0% Coverage on New Code
0.0% Duplication on New Code

See analysis details on SonarQube Cloud

github-actions · 2026-01-15T22:42:58Z

Failed to cherry-pick changes to the 1.11.6 branch.
Please cherry-pick the changes manually.
You can find more details here.

Improve EventSubscription database connection usage

cc5aa9e

harshach requested a review from a team as a code owner January 13, 2026 07:10

harshach had a problem deploying to test January 13, 2026 07:10 — with GitHub Actions Error

github-actions bot added backend safe to test Add this label to run secure Github workflows on PRs labels Jan 13, 2026

harshach added the To release Will cherry-pick this PR into the release branch label Jan 13, 2026

harshach had a problem deploying to test January 13, 2026 07:12 — with GitHub Actions Failure

harshach temporarily deployed to test January 13, 2026 07:12 — with GitHub Actions Inactive

harshach had a problem deploying to test January 13, 2026 07:12 — with GitHub Actions Failure

harshach had a problem deploying to test January 13, 2026 10:49 — with GitHub Actions Error

Fix NPE, add missing @transaction annotation

63904a7

manerow had a problem deploying to test January 13, 2026 11:17 — with GitHub Actions Failure

manerow temporarily deployed to test January 13, 2026 11:17 — with GitHub Actions Inactive

manerow had a problem deploying to test January 13, 2026 11:17 — with GitHub Actions Failure

manerow temporarily deployed to test January 13, 2026 11:17 — with GitHub Actions Inactive

gitar-bot bot reviewed Jan 13, 2026

View reviewed changes

...e/src/main/java/org/openmetadata/service/apps/bundles/changeEvent/AbstractEventConsumer.java Show resolved Hide resolved

Merge branch 'main' into event_subscription_db_connections

5428235

manerow previously approved these changes Jan 14, 2026

View reviewed changes

Merge branch 'main' into event_subscription_db_connections

7294919

yan-3005 temporarily deployed to test January 14, 2026 14:50 — with GitHub Actions Inactive

yan-3005 had a problem deploying to test January 14, 2026 14:50 — with GitHub Actions Failure

yan-3005 temporarily deployed to test January 14, 2026 14:50 — with GitHub Actions Inactive

Address gitar bot comments

01714fa

harshach dismissed stale reviews from yan-3005 and manerow via 01714fa January 15, 2026 02:56

harshach had a problem deploying to test January 15, 2026 02:56 — with GitHub Actions Error

harshach temporarily deployed to test January 15, 2026 02:56 — with GitHub Actions Inactive

harshach had a problem deploying to test January 15, 2026 02:56 — with GitHub Actions Error

Address gitar bot comments

c040a23

harshach temporarily deployed to test January 15, 2026 03:45 — with GitHub Actions Inactive

harshach had a problem deploying to test January 15, 2026 03:45 — with GitHub Actions Failure

harshach temporarily deployed to test January 15, 2026 03:45 — with GitHub Actions Inactive

harshach had a problem deploying to test January 15, 2026 03:45 — with GitHub Actions Failure

harshach merged commit faf38d4 into main Jan 15, 2026
27 of 32 checks passed

harshach deleted the event_subscription_db_connections branch January 15, 2026 22:41

gitar-bot bot mentioned this pull request Jan 16, 2026

Fix: Upgrade MCP SDK to 0.17.1 #25311

Merged

9 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Improve EventSubscription database connection usage #25250

Improve EventSubscription database connection usage #25250

Uh oh!

harshach commented Jan 13, 2026 •

edited

Loading

Uh oh!

Uh oh!

gitar-bot bot commented Jan 15, 2026 •

edited

Loading

CI Failure Summary - All Unrelated to PR

1. Python Tests - Bitbucket API (Earlier runs)

2. Playwright UI Tests - Extensive Flakiness (Multiple runs)

3. Maven SonarCloud CI (Earlier runs)

4. Maven Collate CI (Persistent)

PR Changes (All Approved)

Conclusion

What Works Well

Uh oh!

sonarqubecloud bot commented Jan 15, 2026

Uh oh!

Uh oh!

github-actions bot commented Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Improve EventSubscription database connection usage #25250

Improve EventSubscription database connection usage #25250

Uh oh!

Conversation

harshach commented Jan 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Changes Made

1. Batch Processing in AbstractEventConsumer

Impact

2. Batch DAO Method

3. N+1 Query Fix

4. HTTP Timeout Caps

Checklist:

Uh oh!

Uh oh!

gitar-bot bot commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

CI Failure Summary - All Unrelated to PR

1. Python Tests - Bitbucket API (Earlier runs)

2. Playwright UI Tests - Extensive Flakiness (Multiple runs)

3. Maven SonarCloud CI (Earlier runs)

4. Maven Collate CI (Persistent)

PR Changes (All Approved)

Conclusion

What Works Well

Uh oh!

sonarqubecloud bot commented Jan 15, 2026

Quality Gate passed for 'open-metadata-ingestion'

Uh oh!

Uh oh!

github-actions bot commented Jan 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

harshach commented Jan 13, 2026 •

edited

Loading

gitar-bot bot commented Jan 15, 2026 •

edited

Loading