Keep hotspots dask+cupy classification on the GPU#2739
Merged
Conversation
The dask+cupy hotspots chunk function copied each chunk to the host (cupy.asnumpy), ran the CPU classifier, then copied the int8 result back. Reuse the existing _run_gpu_hotspots kernel so classification runs on the device, matching the single-GPU cupy path and removing the per-chunk host round trip. Adds test_hotspots_dask_cupy: asserts the dask+cupy result stays a cupy-backed dask array and matches the single-GPU cupy backend on the chunk interior.
brendancol
commented
May 30, 2026
Contributor
Author
brendancol
left a comment
There was a problem hiding this comment.
PR Review: Keep hotspots dask+cupy classification on the GPU
Blockers (must fix before merge)
None.
Suggestions (should fix, not blocking)
None.
Nits (optional improvements)
_hotspots_dask_cupy(~line 1267):cupy.zeros_like(z, dtype=cupy.int8)could becupy.emptysince_run_gpu_hotspotswrites every in-bounds cell. It mirrors the single-GPU path (_hotspots_cupyline 1338 uses the samezeros_like), so staying consistent is the better call. Leave it.
What looks good
- The fix reuses
_run_gpu_hotspotsthe same way the single-GPU_hotspots_cupypath does: samecuda_args, same int8zeros_like, same kernel launch. So the dask+cupy interior matches the single-GPU result rather than approximating it. The removedcupy.asnumpy/cupy.asarraypair was the only host hop in the chunk function. - Classification is elementwise (each z cell maps to its own int8 code), so classifying per chunk-with-halo and trimming via
map_overlapgives the same answer as classifying the whole array at once. test_hotspots_dask_cupychecks both that the result stays a cupy-backed dask array (_metais a cupy ndarray) and that the interior matches the single-GPU backend exactly, using the same interior-trim pattern astest_apply_dask_cupyandtest_focal_stats_dask_cupy._calc_hotspots_numpyis still used by the numpy and dask+numpy paths, so nothing is left orphaned.
Checklist
- Algorithm matches reference (reuses existing GPU kernel; behavior unchanged)
- All implemented backends produce consistent results (new test asserts dask+cupy == cupy interior)
- NaN handling is correct (unchanged;
_run_gpu_hotspotsclassifies z-scores, NaN z stays 0) - Edge cases covered (existing zero-std and NaN hotspots tests pass)
- Dask chunk boundaries handled correctly (depth = kernel//2, boundary forwarded, elementwise classify)
- No premature materialization or unnecessary copies (removes the per-chunk host round trip)
- Benchmark exists (benchmarks/benchmarks/focal.py)
- README feature matrix - not applicable (no new function, no backend support change)
- Docstrings present and accurate (no API change)
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #2734
The dask+cupy path of
hotspotsclassified each chunk on the CPU._chunk_fninside_hotspots_dask_cupyrancupy.asnumpy(z)to pull the z-score array off the GPU, classified it with_calc_hotspots_numpy, then pushed the int8 result back withcupy.asarray. Every chunk paid a host round trip.This reuses
_run_gpu_hotspots, the kernel the single-GPU cupy path already uses, so classification stays on the device. Convolution, z-score, and classification now all run on the GPU within the chunk, and the per-chunkcupy.asnumpyround trip is gone.Behavior is unchanged: it reuses the same kernel the single-GPU path runs, so results match.
Backends: only the dask+cupy path of
hotspotschanged. The numpy, cupy single-GPU, and dask+numpy paths are untouched.Test plan:
test_hotspots_dask_cupy: result stays a cupy-backed dask array and matches the single-GPU cupy backend on the chunk interiortest_focal.pysuite passes (115 tests)