Skip to content

fix: dynamic_partition_overwrite builds per-spec delete predicates after partition spec evolution#3149

Open
tusharchou wants to merge 3 commits intoapache:mainfrom
tusharchou:test/manifest-pruning-spec-evolution
Open

fix: dynamic_partition_overwrite builds per-spec delete predicates after partition spec evolution#3149
tusharchou wants to merge 3 commits intoapache:mainfrom
tusharchou:test/manifest-pruning-spec-evolution

Conversation

@tusharchou
Copy link
Contributor

Rationale

While reviewing PR #3011 (manifest pruning optimization), I identified a correctness
gap when tables have undergone partition spec evolution.

When dynamic_partition_overwrite is called on a table with mixed partition_spec_ids
in its snapshot, the delete predicate was built using only the current partition spec.
This caused inclusive_projection to fail silently when evaluating older manifests —
the predicate contained field references (e.g. region) that have no corresponding
partition field in the old spec, causing the manifest evaluator to skip those manifests
entirely. The result is silent data duplication: stale rows from old spec manifests are
never deleted.

Changes

  • pyiceberg/table/__init__.py: dynamic_partition_overwrite now iterates over all
    partition_spec_ids present in the current snapshot and builds a per-spec delete
    predicate, projecting the new data files' partition values into each historical spec's
    coordinate space before evaluating.

  • tests/integration/test_manifest_pruning_spec_evolution.py: two regression tests added:

    1. Mixed-spec snapshot — overwrite a partition present under both spec-0 and spec-1
    2. Overwrite a partition that exists only in spec-0 manifests (the silent data
      duplication case — no exception raised, wrong rows survive)

Are these changes tested?

Yes — two new integration tests using the SQLite in-memory catalog, no external
services required.

Are there any user-facing changes?

Yes — dynamic_partition_overwrite now correctly deletes all matching rows across
all historical partition specs, fixing silent data duplication on evolved tables.

Related

…s after partition spec evolution

Fixes apache#3148

When a table has undergone partition spec evolution, its snapshot may
contain manifests written under different partition_spec_ids. Previously,
dynamic_partition_overwrite built the delete predicate using only the
current spec, causing the manifest evaluator to incorrectly skip manifests
from older specs — leaving stale data files silently behind.

The fix builds the delete predicate per historical spec present in the
snapshot, projecting the new data files' partition values into each spec's
coordinate space before evaluating.

Regression tests added covering:
- Mixed-spec snapshot (manifests from both spec-0 and spec-1)
- Overwrite of a partition that only exists in spec-0 manifests (silent
  data duplication case)
@tusharchou
Copy link
Contributor Author

AI Disclosure
AI was used to help understand the code base and draft code changes. All code changes have been thoroughly reviewed, ensuring that the code changes are in line with a broader understanding of the codebase.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug: dynamic_partition_overwrite silently skips spec-0 manifests after partition spec evolution

1 participant