WIP: Fix stuck workflow nodes that are starved by map tasks exceeding the max parallelism budget #6809

fg91 · 2025-12-18T10:54:00Z

Why are the changes needed?

In this minimal reproducible example ...

from flytekit import task, workflow, map_task
import time

@task
def sleep(inp: int) -> None:
    time.sleep(inp)


@workflow
def wf() -> None:
    map_task(sleep)(inp=[3600 for _ in range(30)])  # array node with long running tasks. 30 is chosen to exceed default max parallelism of 25
    sleep(inp=30)  # Task that runs more quickly

... I observe the following and in my opinion undesirable behaviour:

Both the map task and the node after it start running.
The n1 pod succeeds quickly:
The n1 node stays in Queued state in the Flyte UI despite the underlying pod already having succeeded:
The n1 node is updated in the Flyte UI from Queued to Succeeded only after the map task (or at least enough tasks within it) completes as well which could be hours later.

Reason for this behaviour

In the RecursiveNodeHandler which traverses the workflow graph, we check whether the current degree of parallelism has exceeded the max parallelism:

func (c *recursiveNodeExecutor) RecursiveNodeHandler(ctx context.Context, execContext executors.ExecutionContext, dag executors.DAGStructure, nl executors.NodeLookup, currentNode v1alpha1.ExecutableNode) ( interfaces.NodeStatus, error) {
    ...
    if IsMaxParallelismAchieved(ctx, currentNode, nodePhase, execContext) {
        ...
        return interfaces.NodeStatusRunning, nil

For n1, IsMaxParallelismAchieved gives true:
- In the example above, when evaluating n1, the current degree of parallelism takes a value of 31 which is bigger than the default max parallelism of 25.
- The degree of parallelism is increased by 1 for the array node itself here and 30 times for each of the tasks in the map task here.
- This means that the task phase of n1 will be evaluated only once less than 25 tasks are still running within the map task n0.

What changes were proposed in this pull request?

As a user, I would expect n1 to be marked as succeeded immediately after the pod completes and not hours later when enough of the array node tasks complete.
Alternatively, if there is no "parallelism budget" for n1, I would expect n1 to not start at all until n0 is done. But as a user I wouldn't expect this "mixed" behaviour with a node that is seemingly stuck for hours despite having completed.

Discussion

The behaviour can be avoided when modifying the parallelism tracking logic to count the map task as 1 and not as 1 + 30 (in this example). I would like to discuss which of the two is the intended behaviour.

How was this patch tested?

Check all the applicable boxes

I updated the documentation accordingly.
All new and existing tests passed.
All commits are signed-off.

Related PRs

Docs link

… the max parallelism budget Signed-off-by: Fabio Grätz <[email protected]>

flyte-bot · 2025-12-18T10:54:38Z

Bito Automatic Review Skipped - Draft PR

Bito didn't auto-review because this pull request is in draft status.
No action is needed if you didn't intend for the agent to review it. Otherwise, to manually trigger a review, type /review in a comment and save.
You can change draft PR review settings here, or contact your Bito workspace admin at [email protected].

codecov · 2025-12-18T10:57:46Z

Codecov Report

❌ Patch coverage is 66.66667% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.94%. Comparing base (be517bd) to head (d44e52f).

Files with missing lines	Patch %	Lines
...lytepropeller/pkg/controller/nodes/task/handler.go	66.66%	1 Missing and 2 partials ⚠️

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #6809   +/-   ##
=======================================
  Coverage   56.93%   56.94%           
=======================================
  Files         929      929           
  Lines       58139    58146    +7     
=======================================
+ Hits        33102    33110    +8     
+ Misses      21996    21993    -3     
- Partials     3041     3043    +2

Flag	Coverage Δ
unittests-datacatalog	`53.51% <ø> (ø)`
unittests-flyteadmin	`53.14% <ø> (+0.03%)`	⬆️
unittests-flytecopilot	`43.06% <ø> (ø)`
unittests-flytectl	`64.02% <ø> (-0.06%)`	⬇️
unittests-flyteidl	`75.71% <ø> (ø)`
unittests-flyteplugins	`60.13% <ø> (ø)`
unittests-flytepropeller	`53.54% <66.66%> (+<0.01%)`	⬆️
unittests-flytestdlib	`63.29% <ø> (+0.02%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

WIP: Fix stuck workflow nodes that are starved by map tasks exceeding…

d44e52f

… the max parallelism budget Signed-off-by: Fabio Grätz <[email protected]>

fg91 requested review from Sovietaced and hamersaw December 18, 2025 11:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

WIP: Fix stuck workflow nodes that are starved by map tasks exceeding the max parallelism budget #6809

WIP: Fix stuck workflow nodes that are starved by map tasks exceeding the max parallelism budget #6809

Uh oh!

fg91 commented Dec 18, 2025 •

edited

Loading

Uh oh!

flyte-bot commented Dec 18, 2025

Uh oh!

codecov bot commented Dec 18, 2025 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

WIP: Fix stuck workflow nodes that are starved by map tasks exceeding the max parallelism budget #6809

Are you sure you want to change the base?

WIP: Fix stuck workflow nodes that are starved by map tasks exceeding the max parallelism budget #6809

Uh oh!

Conversation

fg91 commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why are the changes needed?

Reason for this behaviour

What changes were proposed in this pull request?

Discussion

How was this patch tested?

Check all the applicable boxes

Related PRs

Docs link

Uh oh!

flyte-bot commented Dec 18, 2025

Uh oh!

codecov bot commented Dec 18, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

fg91 commented Dec 18, 2025 •

edited

Loading

codecov bot commented Dec 18, 2025 •

edited

Loading