Forward-merge release/26.06 into main#2113
Open
rapids-bot[bot] wants to merge 4 commits into
Open
Conversation
This PR converts the current Sphinx docs to Fern in preparation for the move to docs.nvidia.com. Instead of manually composing the API reference docs, this PR also generates API reference docs for all supported languages directly from the code (as is standard in Fern). There's a lot of files in this PR, and most of the markdown files are either copied over to the Fern directory format from the old Sphinx docs, or they've been auto-generated using the new API reference docs generation scripts (`generate_api_reference.py` in the changes). When reviewing this PR, it's probably better to start with the non-markdown files, then build the docs and run them locally. The docs can be built in the usual way with `./build.sh docs`. You can run them locally using the following command: ``` fern/build_docs.sh dev --port 3000 --backend-port 3001 ``` Authors: - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Bradley Dice (https://github.com/bdice) - Divye Gala (https://github.com/divyegala) - Robert Maynard (https://github.com/robertmaynard) URL: #2067
Contributor
Author
|
FAILURE - Unable to forward-merge due to an error, manual merge is necessary. Do not use the IMPORTANT: When merging this PR, do not use the auto-merger (i.e. the |
Closes #1989. Adds multi-GPU support to KMeans fit for host-resident data, with two modes: - **OpenMP (cuVS SNMG)**: A single process drives all local GPUs via OMP threads and raw NCCL. Activated automatically when the handle is a `device_resources_snmg`. - **RAFT comms (Ray / Dask / MPI)**: Each rank is a separate process that calls fit with its own data shard and an initialized RAFT communicator. Coordination uses the RAFT comms. Both modes share the same core Lloyd's loop, batched streaming of host data, NCCL/comms allreduce of centroid sums and counts, and synchronized convergence. Supports sample weights, n_init best-of-N restarts, KMeansPlusPlus initialization, and float/double. Falls back to single-GPU when neither multi-GPU resources nor comms are present. Authors: - Victor Lafargue (https://github.com/viclafargue) - Tarang Jain (https://github.com/tarang-jain) Approvers: - Tarang Jain (https://github.com/tarang-jain) - Micka (https://github.com/lowener) - Dante Gama Dessavre (https://github.com/dantegd) URL: #2017
Closes #1901 Previous Code - We almost always allocate device side fp16 arrays. This was for... - allowing wmma usage - allowing data modification for `CosineExpanded` preprocessing Current PR Changes - This PR makes nn-descent reuse the user's device-side data buffer when input data is already on device, instead of always allocating and copying into a separate staging buffer. This roughly halves the peak device memory footprint for the common UMAP/HDBSCAN call path where the dataset is already on GPU. - Remove preprocessing for `CosineExpanded` metric (because we don't want to allocate additional device side data arrays) and do the computation inside the `calculate_metric` function. ### Peak memory usage Changes - food data (5M x 384) = 7.25GiB - sports data (13M x 284) = 18.55GiB - notice how for FP32->FP16 Device (meaning data is already on device), previous code allocates a new fp16 array, resulting in more gpu mem usage. This PR ensures that we convert to fp16 on-the-fly (resulting in the overhead in time) instead of allocating new fp16 memory for that. <img width="6570" height="3597" alt="performance_metrics" src="https://github.com/user-attachments/assets/9ea9efae-d8be-4990-a513-7c2cfcf0d718" /> ### Performance Changes - Conversion Overhead: On-the-fly conversion introduces negligible overhead. - Cosine Metric: Now reads l2 norms inside the `calculate_metric` function, aligning with access pattern used by the L2 distance metric. Adds minimal overhead (e.g. previously 18.2937s VS 18.7598s for 5Mx384 data) Authors: - Jinsol Park (https://github.com/jinsolp) - Corey J. Nolet (https://github.com/cjnolet) Approvers: - Divye Gala (https://github.com/divyegala) URL: #1928
Closes #1873 Authors: - Divye Gala (https://github.com/divyegala) Approvers: - Dante Gama Dessavre (https://github.com/dantegd) - Corey J. Nolet (https://github.com/cjnolet) URL: #2030
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Forward-merge triggered by push to release/26.06 that creates a PR to keep main up-to-date. If this PR is unable to be immediately merged due to conflicts, it will remain open for the team to manually merge. See forward-merger docs for more info.