Forward-merge release/26.06 into main by rapids-bot[bot] · Pull Request #2113 · rapidsai/cuvs

rapids-bot · 2026-05-20T14:57:09Z

Forward-merge triggered by push to release/26.06 that creates a PR to keep main up-to-date. If this PR is unable to be immediately merged due to conflicts, it will remain open for the team to manually merge. See forward-merger docs for more info.

This PR converts the current Sphinx docs to Fern in preparation for the move to docs.nvidia.com. Instead of manually composing the API reference docs, this PR also generates API reference docs for all supported languages directly from the code (as is standard in Fern). There's a lot of files in this PR, and most of the markdown files are either copied over to the Fern directory format from the old Sphinx docs, or they've been auto-generated using the new API reference docs generation scripts (`generate_api_reference.py` in the changes). When reviewing this PR, it's probably better to start with the non-markdown files, then build the docs and run them locally. The docs can be built in the usual way with `./build.sh docs`. You can run them locally using the following command: ``` fern/build_docs.sh dev --port 3000 --backend-port 3001 ``` Authors: - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Bradley Dice (https://github.com/bdice) - Divye Gala (https://github.com/divyegala) - Robert Maynard (https://github.com/robertmaynard) URL: #2067

rapids-bot · 2026-05-20T14:57:12Z

FAILURE - Unable to forward-merge due to an error, manual merge is necessary. Do not use the Resolve conflicts option in this PR, follow these instructions https://docs.rapids.ai/maintainers/forward-merger/

IMPORTANT: When merging this PR, do not use the auto-merger (i.e. the /merge comment). Instead, an admin must manually merge by changing the merging strategy to Create a Merge Commit. Otherwise, history will be lost and the branches become incompatible.

Closes #1989. Adds multi-GPU support to KMeans fit for host-resident data, with two modes: - **OpenMP (cuVS SNMG)**: A single process drives all local GPUs via OMP threads and raw NCCL. Activated automatically when the handle is a `device_resources_snmg`. - **RAFT comms (Ray / Dask / MPI)**: Each rank is a separate process that calls fit with its own data shard and an initialized RAFT communicator. Coordination uses the RAFT comms. Both modes share the same core Lloyd's loop, batched streaming of host data, NCCL/comms allreduce of centroid sums and counts, and synchronized convergence. Supports sample weights, n_init best-of-N restarts, KMeansPlusPlus initialization, and float/double. Falls back to single-GPU when neither multi-GPU resources nor comms are present. Authors: - Victor Lafargue (https://github.com/viclafargue) - Tarang Jain (https://github.com/tarang-jain) Approvers: - Tarang Jain (https://github.com/tarang-jain) - Micka (https://github.com/lowener) - Dante Gama Dessavre (https://github.com/dantegd) URL: #2017

Closes #1901 Previous Code - We almost always allocate device side fp16 arrays. This was for... - allowing wmma usage - allowing data modification for `CosineExpanded` preprocessing Current PR Changes - This PR makes nn-descent reuse the user's device-side data buffer when input data is already on device, instead of always allocating and copying into a separate staging buffer. This roughly halves the peak device memory footprint for the common UMAP/HDBSCAN call path where the dataset is already on GPU. - Remove preprocessing for `CosineExpanded` metric (because we don't want to allocate additional device side data arrays) and do the computation inside the `calculate_metric` function. ### Peak memory usage Changes - food data (5M x 384) = 7.25GiB - sports data (13M x 284) = 18.55GiB - notice how for FP32->FP16 Device (meaning data is already on device), previous code allocates a new fp16 array, resulting in more gpu mem usage. This PR ensures that we convert to fp16 on-the-fly (resulting in the overhead in time) instead of allocating new fp16 memory for that. <img width="6570" height="3597" alt="performance_metrics" src="https://github.com/user-attachments/assets/9ea9efae-d8be-4990-a513-7c2cfcf0d718" /> ### Performance Changes - Conversion Overhead: On-the-fly conversion introduces negligible overhead. - Cosine Metric: Now reads l2 norms inside the `calculate_metric` function, aligning with access pattern used by the L2 distance metric. Adds minimal overhead (e.g. previously 18.2937s VS 18.7598s for 5Mx384 data) Authors: - Jinsol Park (https://github.com/jinsolp) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Divye Gala (https://github.com/divyegala) URL: #1928

Closes #1873 Authors: - Divye Gala (https://github.com/divyegala) Approvers: - Dante Gama Dessavre (https://github.com/dantegd) - Corey J. Nolet (https://github.com/cjnolet) URL: #2030

rapids-bot Bot requested review from a team as code owners May 20, 2026 14:57

rapids-bot Bot requested a review from msarahan May 20, 2026 14:57

github-project-automation Bot added this to Unstructured Data Processing May 20, 2026

rapids-bot Bot requested review from a team as code owners May 20, 2026 19:38

jinsolp and others added 2 commits May 20, 2026 19:53

Add UDF Usage and Developer docs (#2030)

e1db65c

Closes #1873 Authors: - Divye Gala (https://github.com/divyegala) Approvers: - Dante Gama Dessavre (https://github.com/dantegd) - Corey J. Nolet (https://github.com/cjnolet) URL: #2030

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Forward-merge release/26.06 into main#2113

Forward-merge release/26.06 into main#2113
rapids-bot[bot] wants to merge 4 commits into
mainfrom
release/26.06

rapids-bot Bot commented May 20, 2026

Uh oh!

rapids-bot Bot commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

rapids-bot Bot commented May 20, 2026

Uh oh!

rapids-bot Bot commented May 20, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants