trtutils.core package¶
Submodules¶
Module contents¶
CUDA backend for TRTEngine.
This module provides the CUDA backend for the TRTEngine class. It provides utilities for managing device memory, copying data between host and device, and running inference on the engine.
Submodules¶
cacheTools for managing cached TensorRT engines.
Classes¶
BindingA class for managing a CUDA allocation.
CUDAGraphWrapper around CUDA graph capture and execution.
DeviceContext manager for saving and restoring the current CUDA device.
TRTEngineInterfaceAn interface for the TRTEngine class.
KernelWrapper around CUDA kernels.
Functions¶
allocate_bindings()Allocate the bindings for a TensorRT engine.
allocate_pinned_memory()Allocate pagelocked memory using CUDA.
allocate_managed_memory()Allocate managed memory using CUDA.
create_binding()Create a Binding from a np.ndarray.
create_context()Create a CUDA context.
create_engine()Create a TensorRT engine from a serialized engine file.
create_stream()Create a CUDA stream.
cuda_call()A function for checking the return status of a CUDA call.
cuda_malloc()Allocate memory on the CUDA device using CUDA runtime.
destroy_context()Destroy a CUDA context.
destroy_stream()Destroy a CUDA stream.
memcpy_device_to_host()Copy data from device to host.
memcpy_host_to_device()Copy data from host to device.
memcpy_device_to_host_async()Copy data from device to host async.
memcpy_host_to_device_async()Copy data from host to device async.
memcpy_device_to_device()Copy data from device to device.
memcpy_device_to_device_async()Copy data from device to device async.
memcpy_host_to_device_offset()Copy data from host to device with an offset.
memcpy_host_to_device_offset_async()Copy data from host to device with an offset async.
stream_synchronize()Synchronize the CUDA stream.
cuda_stream_begin_capture()Begin capturing a CUDA graph on a stream.
cuda_stream_end_capture()End capturing a CUDA graph and return the captured graph.
cuda_graph_instantiate()Instantiate a CUDA graph executable.
cuda_graph_launch()Launch a CUDA graph executable.
cuda_graph_destroy()Destroy a CUDA graph.
cuda_graph_exec_destroy()Destroy a CUDA graph executable.
nvrtc_call()A function for checking the return status of a NVRTC call.
compile_kernel()Compile a kernel using NVRTC.
load_kernel()Load a CUDA module and kernel from PTX from NVRTC.
compile_and_load_kernel()Compile and load a kernel using NVRTC.
launch_kernel()Launch a CUDA kernel.
create_kernel_args()Create the argument array for a kernel call.
get_compute_capability()Get the compute capability (SM version) of a CUDA device.
get_device()Get the current CUDA device.
get_device_count()Get the number of CUDA devices available.
get_device_name()Get the name of a CUDA device.
get_sm_arch()Get the GPU architecture name from a compute capability version.
set_device()Set the current CUDA device.
init_cuda()Initialize CUDA.
cuda_free()Free a CUDA device pointer.
cuda_host_free()Free a CUDA host pointer.
allocate_to_device()Allocate device memory for each numpy array and copy the data over.
free_device_ptrs()Free a list of CUDA device pointers.
- class trtutils.core.Binding(index: int, name: str, dtype: dtype, shape: list[int], is_input: bool, allocation: int, host_allocation: ndarray, tensor_format: TensorFormat, pagelocked_mem: bool, unified_mem: bool)[source]¶
Bases:
objectSmall wrapper for a host/device allocation pair.
- tensor_format: TensorFormat¶
- class trtutils.core.CUDAGraph(stream: cudart.cudaStream_t)[source]¶
Bases:
objectWrapper around CUDA graph capture and execution.
- stop() bool[source]¶
End graph capture and instantiate the graph.
- Returns:
True if capture and instantiation succeeded, False otherwise.
- Return type:
- launch() None[source]¶
Launch the captured graph.
- Raises:
RuntimeError – If no graph has been captured.
- class trtutils.core.Device(device: int | None)[source]¶
Bases:
objectContext manager that saves and restores the current CUDA device.
When
deviceisNonethe guard is a no-op:__enter__and__exit__only check a single attribute, adding negligible overhead on the hot path.Instances are reusable — engines store one as
self._device_guardand enter/exit it on everyexecute()call.
- class trtutils.core.Kernel(kernel_file: Path | str, name: str, max_arg_cache: int = 1, *, verbose: bool | None = None)[source]¶
Bases:
objectHolds kernel coda and PTX for execution.
- create_args(*args: int | float | np.ndarray, verbose: bool | None = False) np.ndarray[source]¶
Create the argument pointer array for a CUDA kernel call.
Is a wrapper around
trtutils.core.create_kernel_args(), which stores the intermediate pointer results in inside of the class. The intermediate arrays can be cleaned up by the garbage collector if the kernel does not access the memory fast enough.- Parameters:
*args (int | float | np.ndarray) – All args to pass to the kernel as integers, floats, or pre-formed args. If arrays are to be passed to the kernel, they should be given as an integer representing the pointer returned from CUDA malloc. A preformed arg is one which is already wrapped as an np.ndarray with specific type.
verbose (bool, optional) – Whether or not to output additional information about the passed args.
- Returns:
The np.ndarray of argument pointers (one pointer per arg)
- Return type:
np.ndarray
- call(num_blocks: tuple[int, int, int], num_threads: tuple[int, int, int], stream: cudart.cudaStream_t, args: np.ndarray, *, verbose: bool | None = None) None[source]¶
Launch the kernel with the specified blocks, threads, and args in a stream.
- Parameters:
num_blocks (tuple[int, int, int]) – The number of blocks to use for the kernel calls.
num_threads (tuple[int, int, int]) – The number of threads to use for the kernel calls.
stream (cudart.cudaStream_t) – The CUDA stream to execute the kernel in.
args (np.ndarray) – The NumPy array containing the pointers to the arguments. This array should be 1D containing int64 pointers to a NumPy array containing each individual argument.
verbose (bool, optional) – Whether or not to output additional information to stdout. If not provided, will default to overall engines verbose setting.
- class trtutils.core.TRTEngineInterface(engine_path: Path | str, stream: cuda.cudaStream_t | None = None, dla_core: int | None = None, device: int | None = None, *, pagelocked_mem: bool | None = None, unified_mem: bool | None = None, no_warn: bool | None = None, verbose: bool | None = None)[source]¶
Bases:
ABC- property engine: trt.ICudaEngine¶
Access the raw TensorRT CUDA engine.
- property context: trt.IExecutionContext¶
Access the TensorRT execution context for the engine.
- property logger: trt.ILogger¶
Access the TensorRT logger used for the engine.
- property stream: cudart.cudaStream_t¶
Access the underlying CUDA stream.
- property input_spec: list[tuple[list[int], np.dtype]]¶
Get the specs for the input tensor of the network. Useful to prepare memory allocations.
- property batch_size: int¶
Get the batch size of the engine (first dim of first input).
- Returns:
The batch size. Returns -1 if dynamic, 1 if no inputs.
- Return type:
- property is_dynamic_batch: bool¶
Check if the engine has dynamic batch size (-1 in first dim).
- Returns:
True if the engine has dynamic batch size.
- Return type:
- property input_dtypes: list[np.dtype]¶
Get the datatypes for the input tensors of the network.
- Returns:
A list with the datatype of each input tensor.
- Return type:
list[np.dtype]
- property output_spec: list[tuple[list[int], np.dtype]]¶
Get the specs for the output tensor of the network. Useful to prepare memory allocations.
- property output_shapes: list[tuple[int, ...]]¶
Get the shapes for the output tensors of the network.
- property output_dtypes: list[np.dtype]¶
Get the datatypes for the output tensors of the network.
- Returns:
A list with the datatype of each output tensor.
- Return type:
list[np.dtype]
- abstract execute(data: list[np.ndarray], *, no_copy: bool | None = None, verbose: bool | None = None, debug: bool | None = None) list[np.ndarray][source]¶
Execute the network with the given inputs.
- Parameters:
data (list[np.ndarray]) – The inputs to the network.
no_copy (bool, optional) – If True, the outputs will not be copied out from the cuda allocated host memory. Instead, the host memory will be returned directly. This memory WILL BE OVERWRITTEN INPLACE by future inferences.
verbose (bool, optional) – Whether or not to output additional information to stdout. If not provided, will default to overall engines verbose setting.
debug (bool, optional) – Enable intermediate stream synchronize for debugging.
- Returns:
The outputs of the network.
- Return type:
list[np.ndarray]
- abstract direct_exec(pointers: list[int], *, no_warn: bool | None = None, verbose: bool | None = None, debug: bool | None = None) list[np.ndarray][source]¶
Execute the network with the given GPU memory pointers.
The outputs of this function are not copied on return. The data will be updated inplace if execute or direct_exec is called. Calling this method while giving bad pointers will also cause CUDA runtime to crash and program to crash.
- Parameters:
- Returns:
The outputs of the network.
- Return type:
list[np.ndarray]
- get_random_input(*, new: bool | None = None, verbose: bool | None = None) list[np.ndarray][source]¶
Generate a random input for the network.
- Parameters:
- Returns:
The random input to the network.
- Return type:
list[np.ndarray]
- mock_execute(data: list[np.ndarray] | None = None, *, verbose: bool | None = None, debug: bool | None = None) list[np.ndarray][source]¶
Perform a mock execution of the network.
This call is useful for warming up the network and for testing/benchmarking purposes.
- Parameters:
data (list[np.ndarray], optional) – The inputs to the network, by default None If None, random inputs will be generated.
verbose (bool, optional) – Whether or not to output additional information to stdout. If not provided, will default to overall engines verbose setting.
debug (bool, optional) – Enable intermediate stream synchronize for debugging.
- Returns:
The outputs of the network.
- Return type:
list[np.ndarray]
- trtutils.core.allocate_bindings(engine: trt.IEngine, context: trt.IExecutionContext, *, pagelocked_mem: bool | None = None, unified_mem: bool | None = None) tuple[list[Binding], list[Binding], list[int]][source]¶
Allocate memory for the input and output tensors of a TensorRT engine.
- Parameters:
engine (trt.IEngine) – The TensorRT engine to allocate memory for.
context (trt.IExecutionContext) – The execution context to use.
pagelocked_mem (bool, optional) – Whether or not to use pagelocked memory for host allocations. By default None, which means pagelocked memory will be used.
unified_mem (bool, optional) – Whether or not the system has unified memory. If True, use cudaHostAllocMapped to take advantage of unified memory. By default None, which means the default host allocation will be used.
- Returns:
A tuple containing the input bindings, output bindings, and gpu memory pointers.
- Return type:
- Raises:
RuntimeError – If no optimization profiles are found. If the profile shape is not correct.
ValueError – If no input tensors are found. If no output tensors are found. If no memory allocations are found
- trtutils.core.allocate_managed_memory(nbytes: int, stream: cudaStream_t | None = None) int[source]¶
Allocate managed memory.
- trtutils.core.allocate_pinned_memory(nbytes: int, dtype: dtype, shape: tuple[int, ...] | None = None, *, unified_mem: bool | None = None) ndarray[source]¶
Allocate pinned (page-locked) memory on the host, required for asynchronous memory transfers.
The shape of the pagelocked memory is a 1D numpy array, so CPU side reshaping is required for some applications. If shape is passed, then the shape will not be 1D, but memory transfer may have complications.
- Parameters:
nbytes (int) – The number of bytes to allocate.
dtype (np.dtype) – The data type for the allocated memory.
shape (tuple[int, ...], optional) – An optional shape for the pagelocked memory array. If not provided, the array will be 1D.
unified_mem (bool, optional) – If True, use cudaHostAllocMapped to take advantage of unified memory.
- Returns:
A numpy array backed by pinned memory.
- Return type:
np.ndarray
- trtutils.core.allocate_to_device(data: list[ndarray]) list[int][source]¶
Allocate device memory for each numpy array and copy the data over.
- trtutils.core.compile_and_load_kernel(kernel_code: str, name: str, opts: list[str] | None = None, *, verbose: bool | None = None) tuple[CUmodule, CUkernel][source]¶
Compile and load a kernel from source definiton.
- Parameters:
kernel_code (str) – The code definition of the kernel.
name (str) – The name of the kernel.
opts (list[str]) – The optional additional arguments to pass to NVRTC during the compilation of the kernel.
verbose (bool, optional) – Whether or not to output additional information to stdout. If not provided, will default to overall engines verbose setting.
- Returns:
The CUDA module and kernel
- Return type:
tuple[cuda.CUmodule, cuda.CUkernel]
- trtutils.core.compile_kernel(kernel: str, name: str, opts: list[str] | None = None, *, verbose: bool | None = None) chararray[source]¶
Compile a CUDA kernel into PTX using NVRTC.
- Parameters:
kernel (str) – The kernel definition in CUDA.
name (str) – The name of the kernel in the definition.
opts (list[str]) – The optional additional arguments to pass to NVRTC during the compilation of the kernel.
verbose (bool, optional) – Whether or not to output additional information to stdout. If not provided, will default to overall engines verbose setting.
- Returns:
The compiled PTX kernel and the kernel name.
- Return type:
- Raises:
RuntimeError – If the version of cuda-python installed does not match the version of CUDA installed.
- trtutils.core.create_binding(array: ~numpy.ndarray, bind_id: int = 0, name: str = 'binding', tensor_format: ~tensorrt_bindings.tensorrt.TensorFormat = <TensorFormat.LINEAR: 0>, *, use_array_data: bool | None = None, is_input: bool | None = None, pagelocked_mem: bool | None = None, unified_mem: bool | None = None) Binding[source]¶
Create a binding for a TensorRT engine.
- Parameters:
array (np.ndarray) – The array to use for the binding.
bind_id (int, optional) – The index of the binding.
name (str, optional) – The name of the binding.
tensor_format (trt.TensorFormat, optional) – The format of the tensor.
use_array_data (bool, optional) – Whether to use the data from the array for the binding. By default None, which means the data will not be copied.
is_input (bool, optional) – Whether the binding is an input or output.
pagelocked_mem (bool, optional) – Whether or not to use pagelocked memory for host allocations. By default None, which means pagelocked memory will be used.
unified_mem (bool, optional) – Whether or not the system has unified memory. If True, use cudaHostAllocMapped to take advantage of unified memory.
- Returns:
The binding for the host/device memory.
- Return type:
- trtutils.core.create_context(device: int = 0) CUcontext[source]¶
Create a CUDA context.
- Parameters:
device (int) – The device to make a context for. By default 0.
- Returns:
The created CUDA context
- Return type:
cuda.CUcontext
- trtutils.core.create_engine(engine_path: Path | str, stream: cudart.cudaStream_t | None = None, dla_core: int | None = None, device: int | None = None, *, no_warn: bool | None = None) tuple[trt.ICudaEngine, trt.IExecutionContext, trt.ILogger, cudart.cudaStream_t][source]¶
Load a serialized engine from disk.
- Parameters:
engine_path (Path | str) – The path to the serialized engine file.
stream (cudart.cudaStream_t, optional) – When an already made stream is passed, no new stream is created. Useful if you want multiple engines to share the same stream. Although there is no explicit link between engine and stream, the stream returned by this function should be used for execution.
dla_core (int, optional) – The DLA core to assign DLA layers of the engine to. Default is None. If None, any DLA layers will be assigned to DLA core 0.
device (int, optional) – The CUDA device index to create the engine on. Default is None, which uses the current device.
no_warn (bool | None, optional) – If True, suppresses warnings from TensorRT. Default is None.
- Returns:
The deserialized engine, execution context, logger used, and stream created. Logger returned is the same as the input logger if not None.
- Return type:
tuple[trt.ICudaEngine, trt.IExecutionContext, trt.ILogger, cudart.cudaStream_t]
- Raises:
FileNotFoundError – If the engine file is not found.
RuntimeError – If the TRT runtime could not be created. If the engine could not be deserialized. If the execution context could not be created.
- trtutils.core.create_kernel_args(*args: int | float | ndarray, verbose: bool | None = False) tuple[ndarray, list[ndarray]][source]¶
Create the argument pointer array for a CUDA kernel call.
Adapted from the workflow present in: https://nvidia.github.io/cuda-python/overview.html#cuda-python-workflow This MUST be called for each kernel call. If the args are not regenerated the CUDA runtime will crash.
The intermediate argument buffers MUST be saved as variable to ensure the garbage collector does not delete them before use. The Kernel wrapper class handles this and is the recomended way to interact with kernels inside of trtutils.
- Parameters:
*args (int | float | np.ndarray) – All args to pass to the kernel as integers, floats, or pre-formed args. If arrays are to be passed to the kernel, they should be given as an integer representing the pointer returned from CUDA malloc. A preformed arg is one which is already wrapped as an np.ndarray with specific type.
verbose (bool, optional) – Whether or not to output additional information about the passed args.
- Returns:
The np.ndarray of argument pointers (one pointer per arg), and the allocated arrays
- Return type:
- Raises:
TypeError – If the type of an argument is not integer or float
- trtutils.core.create_stream() cudaStream_t[source]¶
Create a CUDA Stream.
- Returns:
The CUDA stream.
- Return type:
cudart.cudaStream_t
- trtutils.core.cuda_call(call: tuple[CUresult | cudaError_t, T]) T[source]¶
Call a CUDA function and check for errors.
- Parameters:
call (tuple[cuda.CUresult | cudart.cudaError_t, T]) – The CUDA function to call and its arguments.
- Returns:
The result of the CUDA function call.
- Return type:
T
- trtutils.core.cuda_free(device_ptr: int) None[source]¶
Free a CUDA device pointer.
- Parameters:
device_ptr (int) – The device pointer to free.
- trtutils.core.cuda_graph_destroy(graph: cudaGraph_t) None[source]¶
Destroy a CUDA graph.
- Parameters:
graph (cudart.cudaGraph_t) – The CUDA graph to destroy.
- trtutils.core.cuda_graph_exec_destroy(graph_exec: cudaGraphExec_t) None[source]¶
Destroy a CUDA graph executable.
- Parameters:
graph_exec (cudart.cudaGraphExec_t) – The graph executable to destroy.
- trtutils.core.cuda_graph_instantiate(graph: cudaGraph_t, flags: int = 0) cudaGraphExec_t[source]¶
Instantiate a CUDA graph executable.
- Parameters:
graph (cudart.cudaGraph_t) – The CUDA graph to instantiate.
flags (int, optional) – Flags for graph instantiation. Default is 0.
- Returns:
The instantiated graph executable.
- Return type:
cudart.cudaGraphExec_t
- trtutils.core.cuda_graph_launch(graph_exec: cudaGraphExec_t, stream: cudaStream_t) None[source]¶
Launch a CUDA graph executable.
- Parameters:
graph_exec (cudart.cudaGraphExec_t) – The graph executable to launch.
stream (cudart.cudaStream_t) – The CUDA stream to launch on.
- trtutils.core.cuda_host_free(host_ptr: int | ndarray) None[source]¶
Free a CUDA host pointer.
- Parameters:
host_ptr (int) – The host pointer to free.
- trtutils.core.cuda_malloc(nbytes: int) int[source]¶
Perform a memory allocation using cudart.cudaMalloc.
- trtutils.core.cuda_stream_begin_capture(stream: cudaStream_t, mode: cudaStreamCaptureMode | None = None) None[source]¶
Begin capturing a CUDA graph on the given stream.
- Parameters:
stream (cudart.cudaStream_t) – The CUDA stream to begin capture on.
mode (cudart.cudaStreamCaptureMode, optional) – The capture mode to use. Default is ThreadLocal, which only checks CUDA calls from the capturing thread. Global mode would cause any uncapturable call in any thread to fail during capture.
- trtutils.core.cuda_stream_end_capture(stream: cudaStream_t) cudaGraph_t[source]¶
End capturing a CUDA graph and return the captured graph.
- Parameters:
stream (cudart.cudaStream_t) – The CUDA stream to end capture on.
- Returns:
The captured CUDA graph.
- Return type:
cudart.cudaGraph_t
- trtutils.core.destroy_context(context: CUcontext) None[source]¶
Destory a CUDA context.
- Parameters:
context (cuda.CUcontext) – The CUDA context to destroy.
- trtutils.core.destroy_stream(stream: cudaStream_t) None[source]¶
Destroy a CUDA Stream.
- Parameters:
stream (cudart.cudaStream_t) – The CUDA stream to destroy.
- trtutils.core.get_compute_capability(device: int = 0) tuple[int, int][source]¶
Get the compute capability (SM version) of a CUDA device.
- trtutils.core.get_device() int[source]¶
Get the current CUDA device.
- Returns:
The current CUDA device index.
- Return type:
- trtutils.core.get_device_count() int[source]¶
Get the number of CUDA devices available.
- Returns:
The number of CUDA devices.
- Return type:
- trtutils.core.get_engine_names(engine: ICudaEngine) tuple[list[str], list[str]][source]¶
Get the input/output names of a TensorRT engine in order.
- trtutils.core.get_sm_arch(major: int, minor: int) str[source]¶
Get the GPU architecture name from a compute capability version.
- trtutils.core.launch_kernel(kernel: cuda.CUkernel, num_blocks: tuple[int, int, int], num_threads: tuple[int, int, int], stream: cudart.cudaStream_t, args: np.ndarray) None[source]¶
Launch a CUDA kernel with specified blocks, threads, and args in a stream.
- Parameters:
kernel (cuda.CUKernel) – The CUDA kernel as compiled by NVRTC using the compile_kernel function.
num_blocks (tuple[int, int, int]) – The number of blocks to use for the kernel call.
num_threads (tuple[int, int, int]) – The number of threads to use for the kernel call.
stream (cudart.cudaStream_t) – The CUDA stream to execute the kernel in.
args (np.ndarray) – The NumPy array containing the pointers to the arguments. This array should be 1D containing int64 pointers to a NumPy array containing each individual argument.
- trtutils.core.load_kernel(kernel_ptx: chararray, name: str, *, verbose: bool | None = None) tuple[CUmodule, CUkernel][source]¶
Load a kernel from a PTX definition.
- Parameters:
kernel_ptx (np.char.chararray) – The PTX generated by NVRTC, use the compile_kernel function.
name (str) – The name of the kernel inside the PTX definiton.
verbose (bool, optional) – Whether or not to output additional information to stdout. If not provided, will default to overall engines verbose setting.
- Returns:
The CUDA module and kernel
- Return type:
tuple[cuda.CUmodule, cuda.CUkernel]
- trtutils.core.memcpy_device_to_device(dst_ptr: int, src_ptr: int, nbytes: int) None[source]¶
Copy from one device pointer to another with error checking.
- trtutils.core.memcpy_device_to_device_async(dst_ptr: int, src_ptr: int, nbytes: int, stream: cudaStream_t) None[source]¶
Copy from one device pointer to another asynchronously.
- trtutils.core.memcpy_device_to_host(host_arr: ndarray, device_ptr: int) None[source]¶
Copy a device pointer to a numpy array with error checking.
- Parameters:
host_arr (np.ndarray) – The numpy array to copy to.
device_ptr (int) – The device pointer to copy.
- trtutils.core.memcpy_device_to_host_async(host_arr: ndarray, device_ptr: int, stream: cudaStream_t) None[source]¶
Copy a device pointer to a numpy array with error checking.
- Parameters:
host_arr (np.ndarray) – The numpy array to copy to.
device_ptr (int) – The device pointer to copy.
stream (cudart.cudaStream_t) – The stream to utilize.
- trtutils.core.memcpy_host_to_device(device_ptr: int, host_arr: ndarray) None[source]¶
Copy a numpy array to a device pointer with error checking.
- Parameters:
device_ptr (int) – The device pointer to copy to.
host_arr (np.ndarray) – The numpy array to copy.
- trtutils.core.memcpy_host_to_device_async(device_ptr: int, host_arr: ndarray, stream: cudaStream_t) None[source]¶
Copy a numpy array to a device pointer with error checking.
- Parameters:
device_ptr (int) – The device pointer to copy to.
host_arr (np.ndarray) – The numpy array to copy.
stream (cudart.cudaStream_t) – The stream to utilize.
- trtutils.core.memcpy_host_to_device_offset(device_ptr: int, host_arr: ndarray, offset_bytes: int) None[source]¶
Copy a numpy array to a device pointer at a specific offset.
- trtutils.core.memcpy_host_to_device_offset_async(device_ptr: int, host_arr: ndarray, offset_bytes: int, stream: cudaStream_t) None[source]¶
Copy a numpy array to a device pointer at a specific offset asynchronously.
- trtutils.core.nvrtc_call(call: tuple[nvrtcResult, T]) T[source]¶
Call a NVRTC function and check for errors.
- Parameters:
call (tuple[cuda.CUresult | cudart.cudaError_t, T]) – The NVRTC function to call and its arguments.
- Returns:
The result of the NVRTC function call.
- Return type:
T