Skip to content

polygonize: auto-detect attrs['transform'] for output georeferencing (#2536)#2541

Merged
brendancol merged 3 commits into
mainfrom
issue-2536
May 28, 2026
Merged

polygonize: auto-detect attrs['transform'] for output georeferencing (#2536)#2541
brendancol merged 3 commits into
mainfrom
issue-2536

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Summary

polygonize(return_type='geopandas') already auto-detected attrs['crs'] and stamped it on the output GeoDataFrame, but never read attrs['transform'], so geometries stayed in raw pixel space. The GeoDataFrame ended up with a .crs claiming projected space (e.g. EPSG:3857) while the geometries were not in that space. Silent misalignment, worse than the missing-CRS bug fixed in #2149: the downstream caller could not detect the inconsistency from the GeoDataFrame alone.

This PR adds _detect_raster_transform() parallel to _detect_raster_crs(). Resolution order:

  1. raster.attrs['transform'] (xrspatial.geotiff convention, a rasterio-ordered 6-tuple)
  2. raster.rio.transform() (rioxarray, if installed)
  3. None

Skipped when attrs['_xrspatial_no_georef']=True (the xrspatial.geotiff no-georef marker), and when rio.transform() returns the identity affine (rioxarray's default when no transform info is available; otherwise the output would silently shift by zero and look identical to "no transform"). Malformed attrs['transform'] (wrong length, non-numeric) is silently ignored, matching the existing unparseable-CRS-attr policy. An explicit transform= argument still wins over the attr.

The geometries are transformed inside _scan, so this applies to all return_type options. The "numpy", "awkward", "spatialpandas", and "geojson" outputs all benefit. The simplify_tolerance path is unaffected -- simplification runs after transform application.

Closes #2536.

Backend coverage

  • numpy: covered by test_transform_attr_auto_detected_numpy and test_transform_attr_auto_detected_geopandas
  • dask+numpy: covered by test_transform_attr_auto_detected_dask
  • cupy / dask+cupy: not separately exercised here. The cupy path hits _polygonize_cupy, which transfers to CPU before calling _scan, and the dask+cupy path uses the same _polygonize_dask orchestrator as dask+numpy. Both honour the transform argument that this PR now populates from the raster attrs.

Test plan

  • pytest xrspatial/tests/test_polygonize.py::TestPolygonizeTransformPropagation (12 new tests pass)
  • pytest xrspatial/tests/test_polygonize.py (full file: 153 passed, 13 skipped, no regressions)
  • pytest xrspatial/tests/test_polygonize_coverage_2026_05_19.py xrspatial/tests/test_polygonize_issue_2172.py (96 passed)
  • CI green

…2536)

`polygonize(raster, return_type='geopandas')` already auto-detects
`attrs['crs']` and stamps it on the returned `GeoDataFrame`. It never
read `attrs['transform']`, so the geometries stayed in raw pixel space
even when the raster carried a valid affine transform. The result was
a `GeoDataFrame` whose `.crs` claimed projected space while the
geometries were not in that space -- silent misalignment by the
raster's origin offset, worse than the missing-CRS bug (#2149) because
downstream callers cannot detect the inconsistency from the
`GeoDataFrame` alone.

Add `_detect_raster_transform()` parallel to `_detect_raster_crs()`
with resolution order:

1. `raster.attrs['transform']` (xrspatial.geotiff convention, a
   rasterio-ordered 6-tuple)
2. `raster.rio.transform()` (rioxarray, if installed)
3. `None`

Skip auto-detection when `attrs['_xrspatial_no_georef']=True` (the
xrspatial.geotiff no-georef marker) and when `rio.transform()`
returns the identity (rioxarray's default when no transform info is
available -- otherwise the output would silently shift by zero and
look identical to "no transform"). Malformed `attrs['transform']` (wrong
length, non-numeric) is silently ignored, matching the existing
unparseable-CRS-attr policy. An explicit `transform=` argument still
wins over the attr.

The geometries themselves are transformed inside `_scan`, so this
applies to all return types -- the "numpy", "awkward", "spatialpandas",
and "geojson" outputs all benefit. The `simplify_tolerance` path is
unaffected (simplification runs after transform application).
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 27, 2026
Copy link
Copy Markdown
Contributor Author

@brendancol brendancol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: polygonize: auto-detect attrs['transform'] for output georeferencing (#2536)

Blockers

None.

Suggestions

  • Behavioural change deserves a CHANGELOG entry. Existing user code that calls polygonize(raster) without an explicit transform= will now return CRS-space coordinates rather than pixel coordinates whenever the raster carries attrs['transform'] or a non-identity rio.transform(). The fix matches the precedent set by #2149 for CRS auto-detection, but this is still observable from outside the package. A short note in CHANGELOG.md under the next release would help users who currently chain polygonize -> external transform application. (xrspatial/polygonize.py, top-level behaviour)

  • Coverage gap: mask= interaction. None of the new tests pass both mask= and a raster carrying attrs['transform']. The existing mask tests assert on polygon geometry without an auto-detected transform, and the new transform tests use no mask. A single combined assertion (mask set + attrs transform set + check bounds land in CRS space) would close the obvious chained-feature case. (xrspatial/tests/test_polygonize.py: TestPolygonizeTransformPropagation)

  • Coverage gap: connectivity=8. The transform is applied inside _scan regardless of connectivity, so 4-connectivity and 8-connectivity should produce identically-transformed geometries with the same auto-detected transform. A parametrized test over connectivity in (4, 8) would catch any future change to _scan that decouples the transform path from one of the connectivities. (xrspatial/tests/test_polygonize.py: TestPolygonizeTransformPropagation)

Nits

  • test_no_georef_marker_suppresses_auto_detect exercises a state the geotiff writer never produces. xrspatial/geotiff/_attrs.py:782-785 only emits _xrspatial_no_georef=True when has_georef=False, and in that branch it does NOT write attrs['transform']. The two attrs are mutually exclusive in practice, so the test is defensive against a malformed dict rather than a real read path. A one-line comment in the test acknowledging that would help future readers. (xrspatial/tests/test_polygonize.py:1432-1440)

  • Identity-affine short-circuit. _detect_raster_transform returns None when rio.transform() reports the identity. Applying the identity transform is a no-op, so the short-circuit is harmless but technically suppresses a legitimate "user really did stamp identity on this raster" case. The current comment explains the rationale; leaving it as-is is fine. (xrspatial/polygonize.py:644-650)

What looks good

  • The helper is structurally parallel to _detect_raster_crs, which keeps the file easy to scan.
  • Malformed attrs['transform'] falls back rather than raising, matching the existing unparseable-CRS policy. The new test_transform_attr_invalid_length_falls_back and test_explicit_transform_invalid_length_still_raises codify both sides of that split.
  • The dask path was already routing the user's transform through _merge_from_separated rather than through the per-chunk call, so the auto-detected transform reaches the merge step unchanged. test_transform_attr_auto_detected_dask exercises this end-to-end.
  • The docstring update is specific about the resolution order and notes that auto-detection applies to all return_type options, not only geopandas.

Checklist

  • Algorithm matches reference: transform application is unchanged from existing code, only the input path is widened.
  • All implemented backends produce consistent results (numpy and dask covered by tests; cupy / dask+cupy share the same orchestrators).
  • NaN handling is unchanged.
  • Edge cases covered: no transform, malformed attr (multiple shapes), explicit override, rioxarray-only source.
  • Dask chunk boundaries: transform application is at the merge step, not per-chunk, so chunk boundaries do not interact with the new path.
  • No premature materialization or unnecessary copies.
  • [N/A] Benchmark not needed for a metadata-detection helper.
  • [N/A] README feature matrix unchanged (no new function, no backend change).
  • Docstrings present and accurate.

- CHANGELOG.md: add an Unreleased "Fixed" entry calling out the
  behavioural change (auto-detected transform now applies to callers
  who previously got pixel coords from a georef'd raster), with the
  opt-out recipe.
- test_polygonize.py: add `test_transform_attr_with_mask` to verify
  the auto-detected transform composes with the `mask=` argument.
- test_polygonize.py: parametrize `test_transform_attr_works_with_
  both_connectivities` over `connectivity in (4, 8)` so a future
  change that decouples `_scan`'s transform call site from one of
  the connectivities cannot regress silently.
- test_polygonize.py: add a comment on
  `test_no_georef_marker_suppresses_auto_detect` explaining that the
  state it exercises is not produced by xrspatial.geotiff in practice
  (mutually exclusive attrs), so the test guards against malformed
  third-party dicts rather than a known read path.

The identity-affine short-circuit nit was left as-is per the review
write-up: applying identity is a no-op, the existing comment already
documents the rationale.
Copy link
Copy Markdown
Contributor Author

@brendancol brendancol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Follow-up review pass

Addressed in a2835e92:

  • Suggestion 1 (CHANGELOG entry): Fixed. Added an Unreleased > Fixed entry to CHANGELOG.md documenting the behavioural change and the opt-out (transform=np.array([1,0,0,0,1,0])).
  • Suggestion 2 (mask + transform coverage): Fixed. Added test_transform_attr_with_mask which masks out the right column and asserts the surviving polygon emerges in CRS space.
  • Suggestion 3 (connectivity=8 coverage): Fixed. Replaced the single-connectivity test with test_transform_attr_works_with_both_connectivities parametrized over connectivity in (4, 8).
  • Nit 1 (no_georef test comment): Fixed. Added an explanatory comment on test_no_georef_marker_suppresses_auto_detect noting that the geotiff writer never emits the marker and a transform attr together (mutually exclusive at _attrs.py:782-785), so the test guards against malformed third-party dicts rather than a real read path.
  • Nit 2 (identity-affine short-circuit): Dismissed. The existing comment in _detect_raster_transform already documents the rationale, and applying the identity affine is a no-op, so the short-circuit causes no observable difference. The original review write-up agreed leaving as-is is fine.

Test count: 15 tests in TestPolygonizeTransformPropagation, all passing. Full file: 156 passed, 13 skipped, no regressions.

@brendancol brendancol merged commit b3c55f2 into main May 28, 2026
7 checks passed
@brendancol
Copy link
Copy Markdown
Contributor Author

Post-merge notes on #2541

Re-ran the tests against the merge commit (b3c55f2): 15/15 in TestPolygonizeTransformPropagation, 20/20 across pytest -k transform. Code matches the PR description and follows the _detect_raster_crs resolution pattern.

One real finding

CHANGELOG entry undersells the scope. It only mentions polygonize(return_type='geopandas'), but _transform_points runs inside _scan, which means the auto-detected transform also lands on return_type='numpy', 'awkward', 'spatialpandas', and 'geojson'. A caller doing polygonize(rast, return_type='numpy') who relied on pixel-space coordinates will see their coords move into CRS space and won't have a CHANGELOG line warning them. The PR body has this right; the CHANGELOG should match.

Concrete fix: drop the return_type='geopandas' qualifier in the entry, and add one line saying the change applies to every return type.

Smaller note

The identity-affine short-circuit in _detect_raster_transform is asymmetric: rio.transform() returning identity short-circuits to None, but attrs['transform'] = (1,0,0,0,1,0) is honoured (as a no-op). No correctness impact, just worth a comment saying the asymmetry is intentional: the rio identity is rioxarray's "no info" sentinel, the attrs identity is an explicit user choice.

Confirming the parity rationale

Cupy and dask+cupy aren't separately tested. The PR body's reasoning checks out: _polygonize_cupy transfers to CPU before calling _scan (line 891), and _polygonize_dask is shared between dask+numpy and dask+cupy. Both receive the transform via the same call path, so they ride along on the existing numpy + dask coverage.

Things that look good

  • Malformed attrs['transform'] falls back silently, matching the existing unparseable-CRS policy.
  • Explicit transform= still wins over the attr (test_explicit_transform_overrides_attr).
  • _xrspatial_no_georef is honoured before either attrs or rio (test_no_georef_marker_suppresses_auto_detect).
  • Sorting happens in pixel space before transform application, so polygon order is transform-independent.

Can spin up a one-line CHANGELOG follow-up PR if that's useful.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

polygonize: GeoDataFrame output drops input attrs['transform'] while keeping attrs['crs']

1 participant