Run the ExecuTorch TensorRT delegate on a caller-selected CUDA stream (green-context support) by shoumikhin · Pull Request #4314 · pytorch/TensorRT

shoumikhin · 2026-05-30T14:18:05Z

Summary

The ExecuTorch TensorRT delegate created and owned a private CUDA stream and ran every enqueueV3() on it, so an application could not place inference on a specific CUDA stream or context — in particular a CUDA green context for SM partitioning.

This lets the caller select the stream, giving the libtorch-free ExecuTorch runtime the same caller-stream capability the libtorch TensorRT runtime gained in #4232.

Changes

Add a scoped CudaStreamGuard (mirroring c10::cuda::CUDAStreamGuard) to select, per calling thread, the CUDA stream the delegate runs TensorRT on. With no guard active the delegate runs on cudaStreamPerThread.
execute() runs enqueueV3() and the staging copies on the selected stream; init() no longer creates a stream (the delegate owns none).
Green context: scope a guard with a stream created on the green context via cuGreenCtxStreamCreate; the partition confinement travels with the stream, so the green context need not be made current. cudaStreamPerThread is invalid while a green context is current (cudaErrorInvalidResourceHandle), so a green-context caller must scope a guard.
cudaSetDevice() is applied only when the engine's device differs from the current device and is restored on exit, so it no longer clobbers a context the caller established.
Backward compatible: device-resident outputs are left enqueued (no end sync) only while a guard is active; the default path and host-staged outputs still synchronize before returning, preserving the prior "results ready on return" behavior.

Validation

Verified on an H100 (CUDA 12.8) with an %smid probe: a cuGreenCtxStreamCreate stream confines kernels to the green context's SM partition even when the primary context is current; cudaStreamPerThread errors with cudaErrorInvalidResourceHandle while a green context is current; the non-green default path uses the full device.

No dependency on the libtorch Torch-TensorRT runtime or libtorch is added.

Follow-up: a unit test for the stream selection (guarded vs. default) can be added.

…tream The delegate created and owned a private CUDA stream in init() and ran every enqueueV3() on it, so an application could not place inference on a specific CUDA stream or context (for example a CUDA green context for SM partitioning). Let the caller select the stream instead, bringing the libtorch-free ExecuTorch runtime the same caller-stream capability the libtorch TensorRT runtime has (pytorch#4232): - Add a scoped CudaStreamGuard (mirroring c10::cuda::CUDAStreamGuard) to select, per calling thread, the CUDA stream the delegate runs TensorRT on. With no guard active the delegate runs on cudaStreamPerThread. - execute() runs enqueueV3() and the staging copies on the selected stream; init() no longer creates a stream and the delegate owns none. - To confine inference to a CUDA green context's SM partition the caller scopes a guard with a stream created on that green context (cuGreenCtxStreamCreate); the partition confinement travels with the stream, so the green context need not be made current. cudaStreamPerThread is invalid while a green context is current (cudaErrorInvalidResourceHandle), so a green-context caller must scope a guard. - cudaSetDevice() is applied only when the engine's device differs from the current device and is restored on exit, so it no longer clobbers a context the caller established. - execute() leaves device-resident outputs enqueued (no end sync) only while a guard is active; the default path and host-staged outputs still synchronize before returning, preserving existing behavior. The caller synchronizes the selected stream when it reads device-resident results. No dependency on the libtorch Torch-TensorRT runtime or libtorch is added.

meta-cla Bot added the cla signed label May 30, 2026

github-actions Bot added the component: api [C++] Issues re: C++ API label May 30, 2026

github-actions Bot requested a review from narendasan May 30, 2026 14:18

shoumikhin force-pushed the fix/et-trt-caller-cuda-stream branch from a17b517 to 283d1f3 Compare May 30, 2026 22:26

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run the ExecuTorch TensorRT delegate on a caller-selected CUDA stream (green-context support)#4314

Run the ExecuTorch TensorRT delegate on a caller-selected CUDA stream (green-context support)#4314
shoumikhin wants to merge 1 commit into
pytorch:mainfrom
shoumikhin:fix/et-trt-caller-cuda-stream

shoumikhin commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

shoumikhin commented May 30, 2026

Summary

Changes

Validation

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant