Run the ExecuTorch TensorRT delegate on a caller-selected CUDA stream (green-context support)#4314
Draft
shoumikhin wants to merge 1 commit into
Draft
Run the ExecuTorch TensorRT delegate on a caller-selected CUDA stream (green-context support)#4314shoumikhin wants to merge 1 commit into
shoumikhin wants to merge 1 commit into
Conversation
…tream The delegate created and owned a private CUDA stream in init() and ran every enqueueV3() on it, so an application could not place inference on a specific CUDA stream or context (for example a CUDA green context for SM partitioning). Let the caller select the stream instead, bringing the libtorch-free ExecuTorch runtime the same caller-stream capability the libtorch TensorRT runtime has (pytorch#4232): - Add a scoped CudaStreamGuard (mirroring c10::cuda::CUDAStreamGuard) to select, per calling thread, the CUDA stream the delegate runs TensorRT on. With no guard active the delegate runs on cudaStreamPerThread. - execute() runs enqueueV3() and the staging copies on the selected stream; init() no longer creates a stream and the delegate owns none. - To confine inference to a CUDA green context's SM partition the caller scopes a guard with a stream created on that green context (cuGreenCtxStreamCreate); the partition confinement travels with the stream, so the green context need not be made current. cudaStreamPerThread is invalid while a green context is current (cudaErrorInvalidResourceHandle), so a green-context caller must scope a guard. - cudaSetDevice() is applied only when the engine's device differs from the current device and is restored on exit, so it no longer clobbers a context the caller established. - execute() leaves device-resident outputs enqueued (no end sync) only while a guard is active; the default path and host-staged outputs still synchronize before returning, preserving existing behavior. The caller synchronizes the selected stream when it reads device-resident results. No dependency on the libtorch Torch-TensorRT runtime or libtorch is added.
a17b517 to
283d1f3
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The ExecuTorch TensorRT delegate created and owned a private CUDA stream and ran every
enqueueV3()on it, so an application could not place inference on a specific CUDA stream or context — in particular a CUDA green context for SM partitioning.This lets the caller select the stream, giving the libtorch-free ExecuTorch runtime the same caller-stream capability the libtorch TensorRT runtime gained in #4232.
Changes
CudaStreamGuard(mirroringc10::cuda::CUDAStreamGuard) to select, per calling thread, the CUDA stream the delegate runs TensorRT on. With no guard active the delegate runs oncudaStreamPerThread.execute()runsenqueueV3()and the staging copies on the selected stream;init()no longer creates a stream (the delegate owns none).cuGreenCtxStreamCreate; the partition confinement travels with the stream, so the green context need not be made current.cudaStreamPerThreadis invalid while a green context is current (cudaErrorInvalidResourceHandle), so a green-context caller must scope a guard.cudaSetDevice()is applied only when the engine's device differs from the current device and is restored on exit, so it no longer clobbers a context the caller established.Validation
Verified on an H100 (CUDA 12.8) with an
%smidprobe: acuGreenCtxStreamCreatestream confines kernels to the green context's SM partition even when the primary context is current;cudaStreamPerThreaderrors withcudaErrorInvalidResourceHandlewhile a green context is current; the non-green default path uses the full device.No dependency on the libtorch Torch-TensorRT runtime or libtorch is added.
Follow-up: a unit test for the stream selection (guarded vs. default) can be added.