Skip to content

Conversation

@harshach
Copy link
Collaborator

@harshach harshach commented Jan 13, 2026

High connection churn - not connection leaks. With JDBI's onDemand() proxy pattern, connections are released after each DAO method call. But when processing 100 events × 5 destinations = 500 individual INSERT statements, each acquiring a connection, this creates massive connection pool contention under load.


Changes Made

1. Batch Processing in AbstractEventConsumer

File: openmetadata-service/.../apps/bundles/changeEvent/AbstractEventConsumer.java

Key changes:

  • Added successfulEvents list to collect events
  • Modified publishEvents() to collect instead of write per-event
  • Added batchRecordSuccessfulEvents() method called in commit()

Impact

Impact
┌───────────────────────────────────────┬────────┬───────┐
│ Metric │ Before │ After │
├───────────────────────────────────────┼────────┼───────┤
│ DB connections per batch (100 events) │ ~500 │ ~2 │
├───────────────────────────────────────┼────────┼───────┤
│ Connection pool contention │ High │ Low │
├───────────────────────────────────────┼────────┼───────┤
│ N+1 queries in diagnostics │ Yes │ No │
└───────────────────────────────────────┴────────┴───────┘

2. Batch DAO Method

File: openmetadata-service/.../jdbi3/CollectionDAO.java

Added batchUpsertSuccessfulChangeEvents() using @ConnectionAwareSqlBatch for MySQL and PostgreSQL support.

3. N+1 Query Fix

File: openmetadata-service/.../events/scheduled/EventSubscriptionScheduler.java

  • Fixed getRelevantUnprocessedEvents() - fetch FilteringRules once before the loop instead of per-event
  • Removed 4 misleading @transaction annotations (no-op on non-SqlObject classes)

4. HTTP Timeout Caps

File: openmetadata-service/.../util/SubscriptionUtil.java

Added timeout caps to prevent runaway webhooks:

  • Connect: 5-30 seconds
  • Read: 10-120 seconds

Checklist:

  • I have read the CONTRIBUTING document.
  • My PR title is Fixes <issue-number>: <short explanation>
  • I have commented on my code, particularly in hard-to-understand areas.
  • For JSON Schema changes: I updated the migration scripts or explained why it is not needed.

@harshach harshach requested a review from a team as a code owner January 13, 2026 07:10
@github-actions github-actions bot added backend safe to test Add this label to run secure Github workflows on PRs labels Jan 13, 2026
@harshach harshach added the To release Will cherry-pick this PR into the release branch label Jan 13, 2026
manerow
manerow previously approved these changes Jan 14, 2026
@gitar-bot
Copy link

gitar-bot bot commented Jan 15, 2026

🔍 CI failure analysis for c040a23: Comprehensive analysis: All CI failures (Python Bitbucket API, 20+ Playwright UI tests, SonarCloud timeouts, Collate builds) are unrelated to PR's Java backend changes

CI Failure Summary - All Unrelated to PR

This PR continues to face CI failures across multiple categories, all unrelated to the event subscription batch processing changes:

1. Python Tests - Bitbucket API (Earlier runs)

  • Issue: Persistent 503 Service Unavailable from Bitbucket API
  • Impact: 1 test failure in LookML Bitbucket reader
  • Cause: External infrastructure

2. Playwright UI Tests - Extensive Flakiness (Multiple runs)

  • Shard 2/6: 6 failures (Bulk Edit, Glossary, Impact Analysis)
  • Shard 3/6: 14+ failures (Settings, Explore, Metrics, Alerts, Panels, Forms)
  • Shard 3/6 (retry): Additional failures with element visibility issues
  • Root Cause: Flaky UI tests - timing-sensitive with toBeVisible(), toBeEnabled(), toBeAttached() failures
  • Impact: 20+ UI test failures across multiple shards

3. Maven SonarCloud CI (Earlier runs)

  • Issue: 3.5 hour runtime with workflow test timeouts
  • Cause: Same workflow timeout issues as PostgreSQL CI

4. Maven Collate CI (Persistent)

  • Issue: 2 consecutive failures in "Trigger Collate build & wait"
  • Cause: External Collate repository build failures
  • Impact: Blocking despite being unrelated to PR changes

PR Changes (All Approved)

This PR modifies only Java backend code:

  • AbstractEventConsumer.java - ✅ All review feedback addressed
  • CollectionDAO.java
  • SubscriptionUtil.java
  • EventSubscriptionScheduler.java

Does NOT modify: Python code, UI/frontend, Collate code, workflow definitions, or any code related to the failing tests.


Conclusion

All CI failures stem from:

  1. External API outages (Bitbucket)
  2. Flaky UI tests (20+ Playwright failures)
  3. Workflow test timeouts (SonarCloud)
  4. External dependency builds (Collate)

None are caused by this PR's event subscription optimizations.

Code Review ✅ Approved 5 resolved / 5 findings

Solid performance improvement for EventSubscription batching with comprehensive tests. The implementation correctly addresses connection pool contention.

Resolved ✅ 5 resolved
Edge Case: Redundant empty check in batchRecordSuccessfulEvents()

📄 openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/changeEvent/AbstractEventConsumer.java
The batchRecordSuccessfulEvents() method has an early return check for successfulEvents.isEmpty() at line 302, but this method is only called from commit() which already checks !successfulEvents.isEmpty() before invoking it.

While the redundant check doesn't cause any bugs (it's defensive programming), it adds unnecessary code complexity. Either remove the check from batchRecordSuccessfulEvents() since the caller guarantees the list is non-empty, or document that the method is designed to be called from multiple contexts where the empty check is needed.

This is a minor code quality issue and doesn't affect functionality.

Edge Case: Exception in batchRecordSuccessfulEvents prevents offset update

📄 openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/changeEvent/AbstractEventConsumer.java
In the commit() method, if batchRecordSuccessfulEvents() throws an exception (re-thrown after logging), the subsequent offset update and DAO upsert operations won't execute. While events were already sent to destinations, the offset won't advance, potentially causing:

  1. Duplicate event processing: On next job execution, the same events may be processed again since the offset wasn't saved
  2. Duplicate HTTP calls: This could result in duplicate webhook calls to external destinations

The current behavior logs the error and re-throws, which seems intentional to signal failure. However, consider whether the offset should still be updated to prevent reprocessing events that were already successfully sent to destinations, or implement idempotency at the destination level.

Suggested options:

  1. Wrap the batch write in a try-catch and proceed with offset update even on failure (accepting loss of success records but avoiding duplicate sends)
  2. Add a comment explaining the current failure semantics are intentional (events will be retried)
  3. Consider using a transaction that encompasses both the batch write and offset update
Bug: Missing error handling in batchRecordSuccessfulEvents

📄 openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/changeEvent/AbstractEventConsumer.java
The batchRecordSuccessfulEvents() method calls the DAO's batch insert without any try-catch. If the database operation fails (e.g., connection timeout, constraint violation, database unavailable), the exception will propagate up and successfulEvents.clear() in commit() won't be reached, but the events will still be lost because they were already processed as "successful" from the HTTP perspective.

Impact: If the batch insert fails after successfully sending alerts, those successful events won't be recorded. On retry, the events may be re-sent to destinations (duplicates) or never recorded properly, breaking the event tracking consistency.

Suggested fix: Wrap the batch insert in try-catch and implement a retry mechanism or fail-safe:

try {
  Entity.getCollectionDAO()
      .eventSubscriptionDAO()
      .batchUpsertSuccessfulChangeEvents(...);
} catch (Exception e) {
  LOG.error("Failed to batch record {} successful events for subscription {}", 
      successfulEvents.size(), subscriptionId, e);
  // Consider: retry logic, or at minimum log for operational visibility
  throw e; // or handle gracefully depending on requirements
}
Edge Case: Timeout capping silently changes caller-specified values

📄 openmetadata-service/src/main/java/org/openmetadata/service/util/SubscriptionUtil.java
The timeout capping logic silently modifies the caller's requested timeout values without any logging or indication. If a caller explicitly requests a 1-second connect timeout (perhaps for a fast-fail requirement) or a 300-second read timeout (for a known slow endpoint), their values will be silently changed to 5 or 120 seconds respectively.

Impact: Callers may not realize their timeout configurations are being overridden, leading to unexpected behavior. A fast-fail scenario might wait too long (5s vs 1s), or a legitimately slow webhook might fail prematurely.

Suggested fix: Add debug/warn logging when values are clamped:

if (connectTimeout < 5 || connectTimeout > 30) {
  LOG.debug("Connect timeout {} clamped to {} (valid range: 5-30s)", 
      connectTimeout, effectiveConnectTimeout);
}

Alternatively, document these limits in the API/configuration where timeouts are specified.

Edge Case: Thread safety issue with shared successfulEvents list

📄 openmetadata-service/src/main/java/org/openmetadata/service/apps/bundles/changeEvent/AbstractEventConsumer.java
The successfulEvents list is a shared instance field (private final List<ChangeEvent> successfulEvents = new ArrayList<>()), but ArrayList is not thread-safe. If publishEvents() is called concurrently from multiple threads, or if there's any concurrent access pattern, this could lead to data corruption, lost events, or ConcurrentModificationException.

Impact: Under concurrent load, successful events may be lost or duplicated, or the system could throw runtime exceptions.

Suggested fix: Either:

  1. Use a thread-safe collection: Collections.synchronizedList(new ArrayList<>()) or CopyOnWriteArrayList
  2. Use explicit synchronization when accessing the list
  3. If AbstractEventConsumer instances are guaranteed to be single-threaded (one instance per job), add a comment documenting this assumption

Given this is a job-based scheduler context, verify whether consumer instances are reused across concurrent job executions.

What Works Well

The batch processing approach effectively reduces database connections from ~500 to ~2 per batch. The N+1 query fix for FilteringRules is a valuable optimization. Thread-safety is properly documented with the reliance on @DisallowConcurrentExecution. Error handling in commit() is well-designed - it catches exceptions but still advances the offset to prevent duplicate HTTP calls.

Tip

Comment Gitar fix CI or enable auto-apply: gitar auto-apply:on

Options

Auto-apply is off Gitar will not commit updates to this branch.
Display: compact Hiding non-applicable rules.

Comment with these commands to change:

Auto-apply Compact
gitar auto-apply:on         
gitar display:verbose         

Was this helpful? React with 👍 / 👎 | This comment will update automatically (Docs)

@sonarqubecloud
Copy link

@harshach harshach merged commit faf38d4 into main Jan 15, 2026
27 of 32 checks passed
@harshach harshach deleted the event_subscription_db_connections branch January 15, 2026 22:41
@github-actions
Copy link
Contributor

Failed to cherry-pick changes to the 1.11.6 branch.
Please cherry-pick the changes manually.
You can find more details here.

@gitar-bot gitar-bot bot mentioned this pull request Jan 16, 2026
9 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

backend safe to test Add this label to run secure Github workflows on PRs To release Will cherry-pick this PR into the release branch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants