Skip to content

Keep hotspots dask+cupy classification on the GPU#2739

Merged
brendancol merged 3 commits into
mainfrom
deep-sweep-performance-focal-2026-05-29
May 30, 2026
Merged

Keep hotspots dask+cupy classification on the GPU#2739
brendancol merged 3 commits into
mainfrom
deep-sweep-performance-focal-2026-05-29

Conversation

@brendancol
Copy link
Copy Markdown
Contributor

Closes #2734

The dask+cupy path of hotspots classified each chunk on the CPU. _chunk_fn inside _hotspots_dask_cupy ran cupy.asnumpy(z) to pull the z-score array off the GPU, classified it with _calc_hotspots_numpy, then pushed the int8 result back with cupy.asarray. Every chunk paid a host round trip.

This reuses _run_gpu_hotspots, the kernel the single-GPU cupy path already uses, so classification stays on the device. Convolution, z-score, and classification now all run on the GPU within the chunk, and the per-chunk cupy.asnumpy round trip is gone.

Behavior is unchanged: it reuses the same kernel the single-GPU path runs, so results match.

Backends: only the dask+cupy path of hotspots changed. The numpy, cupy single-GPU, and dask+numpy paths are untouched.

Test plan:

  • Existing hotspots tests pass (numpy, dask+numpy, cupy, zero-std, NaN handling)
  • New test_hotspots_dask_cupy: result stays a cupy-backed dask array and matches the single-GPU cupy backend on the chunk interior
  • Full test_focal.py suite passes (115 tests)

The dask+cupy hotspots chunk function copied each chunk to the host
(cupy.asnumpy), ran the CPU classifier, then copied the int8 result
back. Reuse the existing _run_gpu_hotspots kernel so classification
runs on the device, matching the single-GPU cupy path and removing
the per-chunk host round trip.

Adds test_hotspots_dask_cupy: asserts the dask+cupy result stays a
cupy-backed dask array and matches the single-GPU cupy backend on the
chunk interior.
@github-actions github-actions Bot added the performance PR touches performance-sensitive code label May 30, 2026
Copy link
Copy Markdown
Contributor Author

@brendancol brendancol left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR Review: Keep hotspots dask+cupy classification on the GPU

Blockers (must fix before merge)

None.

Suggestions (should fix, not blocking)

None.

Nits (optional improvements)

  • _hotspots_dask_cupy (~line 1267): cupy.zeros_like(z, dtype=cupy.int8) could be cupy.empty since _run_gpu_hotspots writes every in-bounds cell. It mirrors the single-GPU path (_hotspots_cupy line 1338 uses the same zeros_like), so staying consistent is the better call. Leave it.

What looks good

  • The fix reuses _run_gpu_hotspots the same way the single-GPU _hotspots_cupy path does: same cuda_args, same int8 zeros_like, same kernel launch. So the dask+cupy interior matches the single-GPU result rather than approximating it. The removed cupy.asnumpy/cupy.asarray pair was the only host hop in the chunk function.
  • Classification is elementwise (each z cell maps to its own int8 code), so classifying per chunk-with-halo and trimming via map_overlap gives the same answer as classifying the whole array at once.
  • test_hotspots_dask_cupy checks both that the result stays a cupy-backed dask array (_meta is a cupy ndarray) and that the interior matches the single-GPU backend exactly, using the same interior-trim pattern as test_apply_dask_cupy and test_focal_stats_dask_cupy.
  • _calc_hotspots_numpy is still used by the numpy and dask+numpy paths, so nothing is left orphaned.

Checklist

  • Algorithm matches reference (reuses existing GPU kernel; behavior unchanged)
  • All implemented backends produce consistent results (new test asserts dask+cupy == cupy interior)
  • NaN handling is correct (unchanged; _run_gpu_hotspots classifies z-scores, NaN z stays 0)
  • Edge cases covered (existing zero-std and NaN hotspots tests pass)
  • Dask chunk boundaries handled correctly (depth = kernel//2, boundary forwarded, elementwise classify)
  • No premature materialization or unnecessary copies (removes the per-chunk host round trip)
  • Benchmark exists (benchmarks/benchmarks/focal.py)
  • README feature matrix - not applicable (no new function, no backend support change)
  • Docstrings present and accurate (no API change)

@brendancol brendancol merged commit 2ac4c58 into main May 30, 2026
6 of 7 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

performance PR touches performance-sensitive code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

hotspots dask+cupy backend round-trips every chunk through host memory

1 participant