[refactor](be) Split paused query handling logic by mrhhsg · Pull Request #62485 · apache/doris

mrhhsg · 2026-04-14T07:45:33Z

Issue Number: None

Related PR: None

Problem Summary: Refactor WorkloadGroupMgr::handle_paused_queries by splitting cleanup, revoke-resume, per-workload-group processing, and memory-release paths into focused helpers while preserving the existing paused query behavior.

None

Test: Unit Test
- ./run-be-ut.sh --run --filter=WorkloadGroupManagerTest.* -j${DORIS_PARALLELISM}
Behavior changed: Yes
Does this need documentation: No

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

Test
- Regression test
- Unit Test
- Manual test (add detailed scripts or steps below)
- No need to test or manual test. Explain why:
  - This is a refactor/code format and no logic has been changed.
  - Previous test can cover this change.
  - No code files have been changed.
  - Other reason
Behavior changed:
- No.
- Yes.
Does this need documentation?
- No.
- Yes.

Check List (For Reviewer who merge this PR)

Confirm the release note
Confirm test cases
Confirm document
Add branch pick label

Issue Number: None Related PR: None Problem Summary: Refactor WorkloadGroupMgr::handle_paused_queries by splitting cleanup, revoke-resume, per-workload-group processing, and memory-release paths into focused helpers while preserving the existing paused query behavior. None - Test: Unit Test - `./run-be-ut.sh --run --filter=WorkloadGroupManagerTest.* -j${DORIS_PARALLELISM}` - Behavior changed: No - Does this need documentation: No

mrhhsg · 2026-04-14T07:45:47Z

/review

hello-stephen · 2026-04-14T07:45:50Z

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

What problem was fixed (it's best to include specific error reporting information). How it was fixed.
Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
What features were added. Why was this function added?
Which code was refactored and why was this part of the code refactored?
Which functions were optimized and what is the difference before and after the optimization?

Copilot

Pull request overview

Refactors BE WorkloadGroupMgr::handle_paused_queries() by splitting it into smaller helper routines for cleanup, revoke/resume handling, and per-workload-group processing, and adds a unit test ensuring empty workload groups aren’t inserted into the paused-query map.

Changes:

Split handle_paused_queries() into focused helper methods (cleanup_*, resume_*, process_*, etc.) and introduced supporting types/aliases.
Adjusted paused-query lifecycle management (cleanup/prune, revoke-resume flow, per-WG iteration) to work without pre-populating empty WG entries.
Added a BE unit test asserting handle_paused_queries() does not create empty per-WG paused-query entries.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
be/test/runtime/workload_group/workload_group_manager_test.cpp	Adds a test asserting empty workload groups are ignored by paused-query handling.
be/src/runtime/workload_group/workload_group_manager.h	Adds helper declarations/types for the refactor; updates paused-query container type aliasing.
be/src/runtime/workload_group/workload_group_manager.cpp	Implements the refactor into helper functions; adds new timeout-driven cancellation paths (not purely a refactor).

Comments suppressed due to low confidence (1)

be/src/runtime/workload_group/workload_group_manager.cpp:342

This comment refers to “Phase 4”, but handle_paused_queries() is now split into Phase 1–3 helpers. Please update/remove the Phase 4 reference so the code comments reflect the current structure.

                } else {
                    // Cancelled long ago — no longer releasing memory. Remove it now to
                    // avoid unnecessary processing in Phase 4.
                    LOG(INFO) << "Query: " << query_it->query_id()

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2026-04-14T07:50:15Z

be/src/runtime/workload_group/workload_group_manager.h

 #include <stdint.h>

+#include <map>
+#include <set>
 #include <shared_mutex>
 #include <unordered_map>


PausedQuery uses std::chrono types/functions, but this header doesn't include <chrono>. Please include <chrono> here to keep the header self-contained and avoid relying on transitive includes.

Copilot · 2026-04-14T07:50:15Z

be/src/runtime/workload_group/workload_group_manager.h

+    // handle_paused_queries() is the top-level entry called periodically. It runs
+    // four phases in order:
+    //   Phase 1 (cleanup):  cleanup_paused_queries_()
+    //   Phase 2 (revoke):   resume_paused_queries_after_revoke_()
+    //   Phase 3 (process):  process_paused_queries_()
+    //


The comment says handle_paused_queries() runs “four phases”, but only three phases are listed (Phase 1–3) and referenced below. Please update the comment to match the actual phases to avoid confusion during future maintenance.

Copilot · 2026-04-14T07:50:16Z

be/src/runtime/workload_group/workload_group_manager.cpp

+            if (time_in_queue > config::spill_in_paused_queue_timeout_ms) {
+                // Timed out waiting for WG memory to drop. Cancel the query.
+                Status error_status = Status::MemoryLimitExceeded(
+                        "Query {} paused due to WG memory exceeded with no revocable memory, "
+                        "timed out after {}ms waiting for memory recovery.",
+                        query_id, time_in_queue);
+                requestor->task_controller()->cancel(error_status);
+                return true;
+            }


This introduces a new behavior: queries paused due to WORKLOAD_GROUP_MEMORY_EXCEEDED (and still under their own limit) will now be cancelled after spill_in_paused_queue_timeout_ms. That is not a pure refactor and conflicts with the PR description (“Behavior changed: No”). Please either revert this timeout-cancel behavior or update the PR description and add/adjust unit tests covering the timeout cancellation path.

Copilot · 2026-04-14T07:50:16Z

be/src/runtime/workload_group/workload_group_manager.cpp

+    // Fallback: if we have waited too long and cannot revoke from anywhere,
+    // cancel the query or disable reserve memory to let it proceed.
+    if (query_it->elapsed_time() > config::spill_in_paused_queue_timeout_ms) {
+        // Cannot spill (no revocable memory), cannot revoke from other WGs,
+        // and process memory is still exceeded. Cancel the query to protect the system.
+        return release_query_memory_(queries_list, query_it, resource_ctx, true);
+    }


This adds a new timeout-triggered cancellation path for PROCESS_MEMORY_EXCEEDED when no revoke/spill succeeds. This is behavior change vs “refactor only” and should be called out explicitly (PR description + release note if needed) and covered by a unit test to ensure we don’t start cancelling long-waiting queries unexpectedly.

github-actions

One blocking regression found.

be/src/runtime/workload_group/workload_group_manager.cpp: the new timeout branch in handle_single_query_() changes WORKLOAD_GROUP_MEMORY_EXCEEDED handling for the memory_usage <= limit case from indefinite waiting to forced cancellation after spill_in_paused_queue_timeout_ms (60s by default). That is not a no-op refactor: these are exactly the queries that are still within their own query mem limit and are only waiting for workload-group pressure to ease. Before this patch they stayed paused until the WG recovered or another path adjusted limits; after this patch they fail permanently even though no per-query invariant was violated. The existing tests added in this PR only cover the new happy-path refactor branches and do not protect against this behavior change.

Critical checkpoint conclusions:

Goal of the task: Refactor paused-query handling into smaller helpers while preserving behavior. Conclusion: not satisfied, because the new timeout cancellation changes WG-memory-exceeded semantics.
Is the modification small, clear, and focused: mostly yes structurally, but no behaviorally because a refactor PR now contains a user-visible policy change.
Concurrency: _paused_queries_lock still serializes paused-query list access and the refactor reduces lock nesting by removing the _group_mutex prepopulation block. I did not find a new lock-order issue in the refactored helpers.
Lifecycle management: weak ResourceContext cleanup remains correct; no new static-init or ownership issue found.
Configuration items: no new config added. Existing spill_in_paused_queue_timeout_ms is now applied to a new code path, which changes runtime behavior without corresponding justification or compatibility discussion.
Incompatible changes: no ABI/storage compatibility issue found.
Parallel code paths: QUERY / PROCESS / WG paused-reason paths were split cleanly, but the WG path no longer matches the previous behavior.
Special conditional checks: the new timeout check lacks proof that cancellation is correct for queries still under their own limit.
Test coverage: insufficient for the behavior change. There is no test asserting that WORKLOAD_GROUP_MEMORY_EXCEEDED queries under their own limit continue waiting instead of being cancelled after timeout.
Test result modifications: only one small test was added in the actual PR diff; it does not cover the new timeout branch.
Observability: existing logging is adequate for this path.
Transaction/persistence and FE/BE variable passing: not applicable here.
Data-write / transactionality: not applicable here.
Performance: the helper split is fine; no blocking performance issue found.
Other issues: none beyond the blocking regression above.

Please keep this PR behavior-preserving by removing the timeout cancellation from the WORKLOAD_GROUP_MEMORY_EXCEEDED && memory_usage <= limit branch, or justify it as an intentional behavior change and add targeted coverage for that new policy.

github-actions · 2026-04-14T07:52:14Z

be/src/runtime/workload_group/workload_group_manager.cpp

            return true;
        } else if (memory_usage <= limit) {
+            if (time_in_queue > config::spill_in_paused_queue_timeout_ms) {
+                // Timed out waiting for WG memory to drop. Cancel the query.


This is no longer a pure refactor. In the WORKLOAD_GROUP_MEMORY_EXCEEDED path, the memory_usage <= limit branch used to keep the query paused until WG pressure eased; after this addition it cancels the query after spill_in_paused_queue_timeout_ms (60s by default). That changes user-visible behavior for queries that are still within their own query limit and only waiting on workload-group pressure.

Concrete path: handle_workload_group_memory_exceeded_() routes slot_memory_policy == NONE here via release_query_memory_() -> handle_single_query_(). When there is no revocable memory and wg->check_mem_used(size_to_reserve, ...) still reports exceed_high_watermark, the old code returned false and kept waiting. Now the same query is cancelled once time_in_queue crosses the timeout.

If the intention is to preserve behavior, this timeout branch should be removed. If the intention is to change policy, the PR title/description and tests need to be updated to cover and justify the cancellation semantics explicitly.

mrhhsg · 2026-04-14T07:55:35Z

run buildall

hello-stephen · 2026-04-14T10:08:15Z

BE Regression && UT Coverage Report

Increment line coverage 87.05% (168/193) 🎉

Increment coverage report
Complete coverage report

Category	Coverage
Function Coverage	73.61% (27377/37193)
Line Coverage	57.24% (295392/516101)
Region Coverage	54.31% (245375/451770)
Branch Coverage	56.07% (106532/189999)

mrhhsg requested a review from Copilot April 14, 2026 07:45

Copilot started reviewing on behalf of mrhhsg April 14, 2026 07:46 View session

Copilot AI reviewed Apr 14, 2026

View reviewed changes

github-actions bot requested changes Apr 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[refactor](be) Split paused query handling logic#62485

[refactor](be) Split paused query handling logic#62485
mrhhsg wants to merge 1 commit intoapache:masterfrom
mrhhsg:refactor_spill

mrhhsg commented Apr 14, 2026 •

edited

Loading

Uh oh!

mrhhsg commented Apr 14, 2026

Uh oh!

hello-stephen commented Apr 14, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

Copilot AI Apr 14, 2026

Uh oh!

github-actions bot left a comment

Uh oh!

github-actions bot Apr 14, 2026

Uh oh!

mrhhsg commented Apr 14, 2026

Uh oh!

hello-stephen commented Apr 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

mrhhsg commented Apr 14, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What problem does this PR solve?

Release note

Check List (For Author)

Check List (For Reviewer who merge this PR)

Uh oh!

mrhhsg commented Apr 14, 2026

Uh oh!

hello-stephen commented Apr 14, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

Copilot AI Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

github-actions bot left a comment

Choose a reason for hiding this comment

Uh oh!

github-actions bot Apr 14, 2026

Choose a reason for hiding this comment

Uh oh!

mrhhsg commented Apr 14, 2026

Uh oh!

hello-stephen commented Apr 14, 2026

BE Regression && UT Coverage Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

mrhhsg commented Apr 14, 2026 •

edited

Loading