Skip to content

distributed partition resolution is easy to get wrong #395

@jayshrivastava

Description

@jayshrivastava

We have a few instances of "give an task index and/or total number of tasks, determine what partitions to execute". This logic is copy pasted in a few places.

Also, implementors of TaskEstimator or even ExecutionPlan are expected to use task_idx and num_tasks.

The risk is that an implementor may execute the wrong partition if they get the math wrong. Even within the code base, we have this math copy-pasted a few times. Depending on the type of network boundary, the math is different too - determining what partitions to run with a Coalesce is different than for a Shuffle.

let off = self.properties.partitioning.partition_count() * task_context.task_index;

off..(off + self.properties.partitioning.partition_count()),

Is there something better we can do? Ex. pass an explicit range around instead of num_tasks and task_idx.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions