Skip to content

Forward-merge release/26.06 into main#2113

Open
rapids-bot[bot] wants to merge 4 commits into
mainfrom
release/26.06
Open

Forward-merge release/26.06 into main#2113
rapids-bot[bot] wants to merge 4 commits into
mainfrom
release/26.06

Conversation

@rapids-bot
Copy link
Copy Markdown
Contributor

@rapids-bot rapids-bot Bot commented May 20, 2026

Forward-merge triggered by push to release/26.06 that creates a PR to keep main up-to-date. If this PR is unable to be immediately merged due to conflicts, it will remain open for the team to manually merge. See forward-merger docs for more info.

This PR converts the current Sphinx docs to Fern in preparation for the move to docs.nvidia.com. Instead of manually composing the API reference docs, this PR also generates API reference docs for all supported languages directly from the code (as is standard in Fern). 

There's a lot of files in this PR, and most of the markdown files are either copied over to the Fern directory format from the old Sphinx docs, or they've been auto-generated using the new API reference docs generation scripts (`generate_api_reference.py` in the changes). When reviewing this PR, it's probably better to start with the non-markdown files, then build the docs and run them locally. 

The docs can be built in the usual way with `./build.sh docs`. You can run them locally using the following command:

```
fern/build_docs.sh dev --port 3000 --backend-port 3001
```

Authors:
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Bradley Dice (https://github.com/bdice)
  - Divye Gala (https://github.com/divyegala)
  - Robert Maynard (https://github.com/robertmaynard)

URL: #2067
@rapids-bot rapids-bot Bot requested review from a team as code owners May 20, 2026 14:57
@rapids-bot rapids-bot Bot requested a review from msarahan May 20, 2026 14:57
@rapids-bot
Copy link
Copy Markdown
Contributor Author

rapids-bot Bot commented May 20, 2026

FAILURE - Unable to forward-merge due to an error, manual merge is necessary. Do not use the Resolve conflicts option in this PR, follow these instructions https://docs.rapids.ai/maintainers/forward-merger/

IMPORTANT: When merging this PR, do not use the auto-merger (i.e. the /merge comment). Instead, an admin must manually merge by changing the merging strategy to Create a Merge Commit. Otherwise, history will be lost and the branches become incompatible.

Closes #1989.

Adds multi-GPU support to KMeans fit for host-resident data, with two modes:
- **OpenMP (cuVS SNMG)**: A single process drives all local GPUs via OMP threads and raw NCCL. Activated automatically when the handle is a `device_resources_snmg`.
- **RAFT comms (Ray / Dask / MPI)**: Each rank is a separate process that calls fit with its own data shard and an initialized RAFT communicator. Coordination uses the RAFT comms.

Both modes share the same core Lloyd's loop, batched streaming of host data, NCCL/comms allreduce of centroid sums and counts, and synchronized convergence. Supports sample weights, n_init best-of-N restarts, KMeansPlusPlus initialization, and float/double. Falls back to single-GPU when neither multi-GPU resources nor comms are present.

Authors:
  - Victor Lafargue (https://github.com/viclafargue)
  - Tarang Jain (https://github.com/tarang-jain)

Approvers:
  - Tarang Jain (https://github.com/tarang-jain)
  - Micka (https://github.com/lowener)
  - Dante Gama Dessavre (https://github.com/dantegd)

URL: #2017
@rapids-bot rapids-bot Bot requested review from a team as code owners May 20, 2026 19:38
jinsolp and others added 2 commits May 20, 2026 19:53
Closes #1901

Previous Code
- We almost always allocate device side fp16 arrays. This was for... 
  - allowing wmma usage
  - allowing data modification for `CosineExpanded` preprocessing

Current PR Changes
- This PR makes nn-descent reuse the user's device-side data buffer when input data is already on device, instead of always allocating and copying into a separate staging buffer. This roughly halves the peak device memory footprint for the common UMAP/HDBSCAN call path where the dataset is already on GPU.
- Remove preprocessing for `CosineExpanded` metric (because we don't want to allocate additional device side data arrays) and do the computation inside the `calculate_metric` function.
  
### Peak memory usage Changes
  - food data (5M x 384) = 7.25GiB
  - sports data (13M x 284) = 18.55GiB

- notice how for FP32->FP16 Device (meaning data is already on device), previous code allocates a new fp16 array, resulting in more gpu mem usage. This PR ensures that we convert to fp16 on-the-fly (resulting in the overhead in time) instead of allocating new fp16 memory for that.
<img width="6570" height="3597" alt="performance_metrics" src="https://github.com/user-attachments/assets/9ea9efae-d8be-4990-a513-7c2cfcf0d718" />

### Performance Changes
- Conversion Overhead: On-the-fly conversion introduces negligible overhead.
- Cosine Metric: Now reads l2 norms inside the `calculate_metric` function, aligning with access pattern used by the L2 distance metric. Adds minimal overhead (e.g. previously 18.2937s VS 18.7598s for 5Mx384 data)

Authors:
  - Jinsol Park (https://github.com/jinsolp)
  - Corey J. Nolet (https://github.com/cjnolet)

Approvers:
  - Divye Gala (https://github.com/divyegala)

URL: #1928
Closes #1873

Authors:
  - Divye Gala (https://github.com/divyegala)

Approvers:
  - Dante Gama Dessavre (https://github.com/dantegd)
  - Corey J. Nolet (https://github.com/cjnolet)

URL: #2030
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

4 participants