WIP: Fix stuck workflow nodes that are starved by map tasks exceeding the max parallelism budget #6809
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Why are the changes needed?
In this minimal reproducible example ...
... I observe the following and in my opinion undesirable behaviour:
Both the map task and the node after it start running.
The
n1pod succeeds quickly:The

n1node stays inQueuedstate in the Flyte UI despite the underlying pod already having succeeded:The
n1node is updated in the Flyte UI fromQueuedtoSucceededonly after the map task (or at least enough tasks within it) completes as well which could be hours later.Reason for this behaviour
In the
RecursiveNodeHandlerwhich traverses the workflow graph, we check whether the current degree of parallelism has exceeded the max parallelism:For
n1,IsMaxParallelismAchievedgivestrue:n1, the current degree of parallelism takes a value of31which is bigger than the default max parallelism of25.n1will be evaluated only once less than 25 tasks are still running within the map taskn0.What changes were proposed in this pull request?
n1to be marked as succeeded immediately after the pod completes and not hours later when enough of the array node tasks complete.n1, I would expectn1to not start at all untiln0is done. But as a user I wouldn't expect this "mixed" behaviour with a node that is seemingly stuck for hours despite having completed.Discussion
The behaviour can be avoided when modifying the parallelism tracking logic to count the map task as
1and not as1 + 30(in this example). I would like to discuss which of the two is the intended behaviour.How was this patch tested?
Check all the applicable boxes
Related PRs
Docs link