Skip to content

Conversation

@fg91
Copy link
Member

@fg91 fg91 commented Dec 18, 2025

Why are the changes needed?

In this minimal reproducible example ...

from flytekit import task, workflow, map_task
import time

@task
def sleep(inp: int) -> None:
    time.sleep(inp)


@workflow
def wf() -> None:
    map_task(sleep)(inp=[3600 for _ in range(30)])  # array node with long running tasks. 30 is chosen to exceed default max parallelism of 25
    sleep(inp=30)  # Task that runs more quickly

... I observe the following and in my opinion undesirable behaviour:

  • Both the map task and the node after it start running.

  • The n1 pod succeeds quickly:

    Screenshot 2025-12-18 at 11 44 04
  • The n1 node stays in Queued state in the Flyte UI despite the underlying pod already having succeeded:
    Screenshot 2025-12-18 at 11 43 45

  • The n1 node is updated in the Flyte UI from Queued to Succeeded only after the map task (or at least enough tasks within it) completes as well which could be hours later.

Reason for this behaviour

  • In the RecursiveNodeHandler which traverses the workflow graph, we check whether the current degree of parallelism has exceeded the max parallelism:

    func (c *recursiveNodeExecutor) RecursiveNodeHandler(ctx context.Context, execContext executors.ExecutionContext, dag executors.DAGStructure, nl executors.NodeLookup, currentNode v1alpha1.ExecutableNode) ( interfaces.NodeStatus, error) {
        ...
        if IsMaxParallelismAchieved(ctx, currentNode, nodePhase, execContext) {
            ...
            return interfaces.NodeStatusRunning, nil
  • For n1, IsMaxParallelismAchieved gives true:

    • In the example above, when evaluating n1, the current degree of parallelism takes a value of 31 which is bigger than the default max parallelism of 25.
    • The degree of parallelism is increased by 1 for the array node itself here and 30 times for each of the tasks in the map task here.
    • This means that the task phase of n1 will be evaluated only once less than 25 tasks are still running within the map task n0.

What changes were proposed in this pull request?

  • As a user, I would expect n1 to be marked as succeeded immediately after the pod completes and not hours later when enough of the array node tasks complete.
  • Alternatively, if there is no "parallelism budget" for n1, I would expect n1 to not start at all until n0 is done. But as a user I wouldn't expect this "mixed" behaviour with a node that is seemingly stuck for hours despite having completed.

Discussion

The behaviour can be avoided when modifying the parallelism tracking logic to count the map task as 1 and not as 1 + 30 (in this example). I would like to discuss which of the two is the intended behaviour.

How was this patch tested?

Check all the applicable boxes

  • I updated the documentation accordingly.
  • All new and existing tests passed.
  • All commits are signed-off.

Related PRs

Docs link

… the max parallelism budget

Signed-off-by: Fabio Grätz <[email protected]>
@flyte-bot
Copy link
Collaborator

Bito Automatic Review Skipped - Draft PR

Bito didn't auto-review because this pull request is in draft status.
No action is needed if you didn't intend for the agent to review it. Otherwise, to manually trigger a review, type /review in a comment and save.
You can change draft PR review settings here, or contact your Bito workspace admin at [email protected].

@codecov
Copy link

codecov bot commented Dec 18, 2025

Codecov Report

❌ Patch coverage is 66.66667% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 56.94%. Comparing base (be517bd) to head (d44e52f).

Files with missing lines Patch % Lines
...lytepropeller/pkg/controller/nodes/task/handler.go 66.66% 1 Missing and 2 partials ⚠️
Additional details and impacted files
@@           Coverage Diff           @@
##           master    #6809   +/-   ##
=======================================
  Coverage   56.93%   56.94%           
=======================================
  Files         929      929           
  Lines       58139    58146    +7     
=======================================
+ Hits        33102    33110    +8     
+ Misses      21996    21993    -3     
- Partials     3041     3043    +2     
Flag Coverage Δ
unittests-datacatalog 53.51% <ø> (ø)
unittests-flyteadmin 53.14% <ø> (+0.03%) ⬆️
unittests-flytecopilot 43.06% <ø> (ø)
unittests-flytectl 64.02% <ø> (-0.06%) ⬇️
unittests-flyteidl 75.71% <ø> (ø)
unittests-flyteplugins 60.13% <ø> (ø)
unittests-flytepropeller 53.54% <66.66%> (+<0.01%) ⬆️
unittests-flytestdlib 63.29% <ø> (+0.02%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@fg91 fg91 requested review from Sovietaced and hamersaw December 18, 2025 11:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants