trtutils.core package¶

Submodules¶

trtutils.core.cache module

Module contents¶

CUDA backend for TRTEngine.

This module provides the CUDA backend for the TRTEngine class. It provides utilities for managing device memory, copying data between host and device, and running inference on the engine.

Submodules¶

cache: Tools for managing cached TensorRT engines.

Classes¶

Binding: A class for managing a CUDA allocation.
CUDAGraph: Wrapper around CUDA graph capture and execution.
Device: Context manager for saving and restoring the current CUDA device.
TRTEngineInterface: An interface for the TRTEngine class.
Kernel: Wrapper around CUDA kernels.

Functions¶

allocate_bindings(): Allocate the bindings for a TensorRT engine.
allocate_pinned_memory(): Allocate pagelocked memory using CUDA.
allocate_managed_memory(): Allocate managed memory using CUDA.
create_binding(): Create a Binding from a np.ndarray.
create_context(): Create a CUDA context.
create_engine(): Create a TensorRT engine from a serialized engine file.
create_stream(): Create a CUDA stream.
cuda_call(): A function for checking the return status of a CUDA call.
cuda_malloc(): Allocate memory on the CUDA device using CUDA runtime.
destroy_context(): Destroy a CUDA context.
destroy_stream(): Destroy a CUDA stream.
memcpy_device_to_host(): Copy data from device to host.
memcpy_host_to_device(): Copy data from host to device.
memcpy_device_to_host_async(): Copy data from device to host async.
memcpy_host_to_device_async(): Copy data from host to device async.
memcpy_device_to_device(): Copy data from device to device.
memcpy_device_to_device_async(): Copy data from device to device async.
memcpy_host_to_device_offset(): Copy data from host to device with an offset.
memcpy_host_to_device_offset_async(): Copy data from host to device with an offset async.
stream_synchronize(): Synchronize the CUDA stream.
cuda_stream_begin_capture(): Begin capturing a CUDA graph on a stream.
cuda_stream_end_capture(): End capturing a CUDA graph and return the captured graph.
cuda_graph_instantiate(): Instantiate a CUDA graph executable.
cuda_graph_launch(): Launch a CUDA graph executable.
cuda_graph_destroy(): Destroy a CUDA graph.
cuda_graph_exec_destroy(): Destroy a CUDA graph executable.
nvrtc_call(): A function for checking the return status of a NVRTC call.
compile_kernel(): Compile a kernel using NVRTC.
load_kernel(): Load a CUDA module and kernel from PTX from NVRTC.
compile_and_load_kernel(): Compile and load a kernel using NVRTC.
launch_kernel(): Launch a CUDA kernel.
create_kernel_args(): Create the argument array for a kernel call.
get_compute_capability(): Get the compute capability (SM version) of a CUDA device.
get_device(): Get the current CUDA device.
get_device_count(): Get the number of CUDA devices available.
get_device_name(): Get the name of a CUDA device.
get_sm_arch(): Get the GPU architecture name from a compute capability version.
set_device(): Set the current CUDA device.
init_cuda(): Initialize CUDA.
cuda_free(): Free a CUDA device pointer.
cuda_host_free(): Free a CUDA host pointer.
allocate_to_device(): Allocate device memory for each numpy array and copy the data over.
free_device_ptrs(): Free a list of CUDA device pointers.

class trtutils.core.Binding(index: int, name: str, dtype: dtype, shape: list[int], is_input: bool, allocation: int, host_allocation: ndarray, tensor_format: TensorFormat, pagelocked_mem: bool, unified_mem: bool)[source]¶

Bases: object

Small wrapper for a host/device allocation pair.

index: int¶

name: str¶

dtype: dtype¶

shape: list[int]¶

is_input: bool¶

allocation: int¶

host_allocation: ndarray¶

tensor_format: TensorFormat¶

pagelocked_mem: bool¶

unified_mem: bool¶

free() → None[source]¶: Free the memory of the binding.

class trtutils.core.CUDAGraph(stream: cudart.cudaStream_t)[source]¶

Bases: object

Wrapper around CUDA graph capture and execution.

start() → None[source]¶

Begin graph capture.

This should be called before the operations to capture.

stop() → bool[source]¶

End graph capture and instantiate the graph.

Returns:: True if capture and instantiation succeeded, False otherwise.
Return type:: bool

launch() → None[source]¶

Launch the captured graph.

Raises:: RuntimeError – If no graph has been captured.

invalidate() → None[source]¶: Destroy the graph and graph executable, resetting state.

property is_captured: bool¶

Check if a graph has been captured.

Returns:: True if a graph has been captured, False otherwise.
Return type:: bool

class trtutils.core.Device(device: int | None)[source]¶

Bases: object

Context manager that saves and restores the current CUDA device.

When device is None the guard is a no-op: __enter__ and __exit__ only check a single attribute, adding negligible overhead on the hot path.

Instances are reusable — engines store one as self._device_guard and enter/exit it on every execute() call.

class trtutils.core.Kernel(kernel_file: Path | str, name: str, max_arg_cache: int = 1, *, verbose: bool | None = None)[source]¶

Bases: object

Holds kernel coda and PTX for execution.

free() → None[source]¶: Free the memory of the loaded kernel.

create_args(*args: int | float | np.ndarray, verbose: bool | None = False) → np.ndarray[source]¶

Create the argument pointer array for a CUDA kernel call.

Is a wrapper around trtutils.core.create_kernel_args(), which stores the intermediate pointer results in inside of the class. The intermediate arrays can be cleaned up by the garbage collector if the kernel does not access the memory fast enough.

Parameters:

*args (int | float | np.ndarray) – All args to pass to the kernel as integers, floats, or pre-formed args. If arrays are to be passed to the kernel, they should be given as an integer representing the pointer returned from CUDA malloc. A preformed arg is one which is already wrapped as an np.ndarray with specific type.
verbose (bool, optional) – Whether or not to output additional information about the passed args.

Returns:

The np.ndarray of argument pointers (one pointer per arg)

Return type:

np.ndarray

call(num_blocks: tuple[int, int, int], num_threads: tuple[int, int, int], stream: cudart.cudaStream_t, args: np.ndarray, *, verbose: bool | None = None) → None[source]¶

Launch the kernel with the specified blocks, threads, and args in a stream.

Parameters:

num_blocks (tuple[int, int, int]) – The number of blocks to use for the kernel calls.
num_threads (tuple[int, int, int]) – The number of threads to use for the kernel calls.
stream (cudart.cudaStream_t) – The CUDA stream to execute the kernel in.
args (np.ndarray) – The NumPy array containing the pointers to the arguments. This array should be 1D containing int64 pointers to a NumPy array containing each individual argument.
verbose (bool, optional) – Whether or not to output additional information to stdout. If not provided, will default to overall engines verbose setting.

Bases: ABC

property name: str¶: The name of the engine, as the stem of the Path.

property engine: trt.ICudaEngine¶: Access the raw TensorRT CUDA engine.

property context: trt.IExecutionContext¶: Access the TensorRT execution context for the engine.

property logger: trt.ILogger¶: Access the TensorRT logger used for the engine.

property stream: cudart.cudaStream_t¶: Access the underlying CUDA stream.

property memsize: int¶: The size of the engine in bytes.

property dla_core: int | None¶: The DLA core assigned to the engine.

property device: int | None¶: The CUDA device assigned to the engine.

property pagelocked_mem: bool¶: Whether or not the system has pagelocked memory.

property unified_mem: bool¶: Whether or not the system has unified memory.

property input_spec: list[tuple[list[int], np.dtype]]¶

Get the specs for the input tensor of the network. Useful to prepare memory allocations.

Returns:: A list with two items per element, the shape and (numpy) datatype of each input tensor.
Return type:: list[tuple[list[int], np.dtype]]

property input_shapes: list[tuple[int, ...]]¶

Get the shapes for the input tensors of the network.

Returns:: A list with the shape of each input tensor.
Return type:: list[tuple[int, …]]

property batch_size: int¶

Get the batch size of the engine (first dim of first input).

Returns:: The batch size. Returns -1 if dynamic, 1 if no inputs.
Return type:: int

property is_dynamic_batch: bool¶

Check if the engine has dynamic batch size (-1 in first dim).

Returns:: True if the engine has dynamic batch size.
Return type:: bool

property input_dtypes: list[np.dtype]¶

Get the datatypes for the input tensors of the network.

Returns:: A list with the datatype of each input tensor.
Return type:: list[np.dtype]

property input_names: list[str]¶

Get the names of the input tensors of the network.

Returns:: A list with the name of each input tensor.
Return type:: list[str]

property output_spec: list[tuple[list[int], np.dtype]]¶

Get the specs for the output tensor of the network. Useful to prepare memory allocations.

Returns:: A list with two items per element, the shape and (numpy) datatype of each output tensor.
Return type:: list[tuple[list[int], np.dtype]]

property output_shapes: list[tuple[int, ...]]¶

Get the shapes for the output tensors of the network.

Returns:: A list with the shape of each output tensor.
Return type:: list[tuple[int, …]]

property output_dtypes: list[np.dtype]¶

Get the datatypes for the output tensors of the network.

Returns:: A list with the datatype of each output tensor.
Return type:: list[np.dtype]

property output_names: list[str]¶

Get the names of the output tensors of the network.

Returns:: A list with the name of each output tensor.
Return type:: list[str]

property input_bindings: list[Binding]¶

Get the input bindings.

Returns:: The input bindings.
Return type:: list[Binding]

property output_bindings: list[Binding]¶

Get the output bindings.

Returns:: The output bindings.
Return type:: list[Binding]

abstract execute(data: list[np.ndarray], *, no_copy: bool | None = None, verbose: bool | None = None, debug: bool | None = None) → list[np.ndarray][source]¶

Execute the network with the given inputs.

Parameters:

data (list[np.ndarray]) – The inputs to the network.
no_copy (bool, optional) – If True, the outputs will not be copied out from the cuda allocated host memory. Instead, the host memory will be returned directly. This memory WILL BE OVERWRITTEN INPLACE by future inferences.
verbose (bool, optional) – Whether or not to output additional information to stdout. If not provided, will default to overall engines verbose setting.
debug (bool, optional) – Enable intermediate stream synchronize for debugging.

Returns:

The outputs of the network.

Return type:

list[np.ndarray]

abstract direct_exec(pointers: list[int], *, no_warn: bool | None = None, verbose: bool | None = None, debug: bool | None = None) → list[np.ndarray][source]¶

Execute the network with the given GPU memory pointers.

The outputs of this function are not copied on return. The data will be updated inplace if execute or direct_exec is called. Calling this method while giving bad pointers will also cause CUDA runtime to crash and program to crash.

Parameters:

pointers (list[int]) – The inputs to the network.
no_warn (bool, optional) – If True, do not warn about usage.
verbose (bool, optional) – Whether or not to output additional information to stdout. If not provided, will default to overall engines verbose setting.
debug (bool, optional) – Enable intermediate stream synchronize for debugging.

Returns:

The outputs of the network.

Return type:

list[np.ndarray]

get_random_input(*, new: bool | None = None, verbose: bool | None = None) → list[np.ndarray][source]¶

Generate a random input for the network.

Parameters:

new (bool, optional) – Whether or not to generate new input. By default None/False.
verbose (bool, optional) – Whether or not to output additional information to stdout. If not provided, will default to overall engines verbose setting.

Returns:

The random input to the network.

Return type:

list[np.ndarray]

mock_execute(data: list[np.ndarray] | None = None, *, verbose: bool | None = None, debug: bool | None = None) → list[np.ndarray][source]¶

Perform a mock execution of the network.

This call is useful for warming up the network and for testing/benchmarking purposes.

Parameters:

data (list[np.ndarray], optional) – The inputs to the network, by default None If None, random inputs will be generated.
verbose (bool, optional) – Whether or not to output additional information to stdout. If not provided, will default to overall engines verbose setting.
debug (bool, optional) – Enable intermediate stream synchronize for debugging.

Returns:

The outputs of the network.

Return type:

list[np.ndarray]

warmup(iterations: int, *, verbose: bool | None = None, debug: bool | None = None) → None[source]¶

Warmup the network for a given number of iterations.

Parameters:

iterations (int) – The number of iterations to warmup the network.
verbose (bool, optional) – Whether or not to output additional information to stdout. If not provided, will default to overall engines verbose setting.
debug (bool, optional) – Enable intermediate stream synchronize for debugging.

trtutils.core.allocate_bindings(engine: trt.IEngine, context: trt.IExecutionContext, *, pagelocked_mem: bool | None = None, unified_mem: bool | None = None) → tuple[list[Binding], list[Binding], list[int]][source]¶

Allocate memory for the input and output tensors of a TensorRT engine.

Parameters:

engine (trt.IEngine) – The TensorRT engine to allocate memory for.
context (trt.IExecutionContext) – The execution context to use.
pagelocked_mem (bool, optional) – Whether or not to use pagelocked memory for host allocations. By default None, which means pagelocked memory will be used.
unified_mem (bool, optional) – Whether or not the system has unified memory. If True, use cudaHostAllocMapped to take advantage of unified memory. By default None, which means the default host allocation will be used.

Returns:

A tuple containing the input bindings, output bindings, and gpu memory pointers.

Return type:

tuple[list[Binding], list[Binding], list[int]]

Raises:

RuntimeError – If no optimization profiles are found. If the profile shape is not correct.
ValueError – If no input tensors are found. If no output tensors are found. If no memory allocations are found

trtutils.core.allocate_managed_memory(nbytes: int, stream: cudaStream_t | None = None) → int[source]¶

Allocate managed memory.

Parameters:

nbytes (int) – The number of bytes to allocate.
stream (cudart.cudaStream_t, optional) – The stream to utilize.

Returns:

The pointer to the allocated memory.

Return type:

int

trtutils.core.allocate_pinned_memory(nbytes: int, dtype: dtype, shape: tuple[int, ...] | None = None, *, unified_mem: bool | None = None) → ndarray[source]¶

Allocate pinned (page-locked) memory on the host, required for asynchronous memory transfers.

The shape of the pagelocked memory is a 1D numpy array, so CPU side reshaping is required for some applications. If shape is passed, then the shape will not be 1D, but memory transfer may have complications.

Parameters:

nbytes (int) – The number of bytes to allocate.
dtype (np.dtype) – The data type for the allocated memory.
shape (tuple[int, ...], optional) – An optional shape for the pagelocked memory array. If not provided, the array will be 1D.
unified_mem (bool, optional) – If True, use cudaHostAllocMapped to take advantage of unified memory.

Returns:

A numpy array backed by pinned memory.

Return type:

np.ndarray

trtutils.core.allocate_to_device(data: list[ndarray]) → list[int][source]¶

Allocate device memory for each numpy array and copy the data over.

Parameters:: data (list[np.ndarray]) – The numpy arrays to copy.
Returns:: The device pointers to the allocated memory.
Return type:: list[int]

trtutils.core.compile_and_load_kernel(kernel_code: str, name: str, opts: list[str] | None = None, *, verbose: bool | None = None) → tuple[CUmodule, CUkernel][source]¶

Compile and load a kernel from source definiton.

Parameters:

kernel_code (str) – The code definition of the kernel.
name (str) – The name of the kernel.
opts (list[str]) – The optional additional arguments to pass to NVRTC during the compilation of the kernel.
verbose (bool, optional) – Whether or not to output additional information to stdout. If not provided, will default to overall engines verbose setting.

Returns:

The CUDA module and kernel

Return type:

tuple[cuda.CUmodule, cuda.CUkernel]

trtutils.core.compile_kernel(kernel: str, name: str, opts: list[str] | None = None, *, verbose: bool | None = None) → chararray[source]¶

Compile a CUDA kernel into PTX using NVRTC.

Parameters:

kernel (str) – The kernel definition in CUDA.
name (str) – The name of the kernel in the definition.
opts (list[str]) – The optional additional arguments to pass to NVRTC during the compilation of the kernel.
verbose (bool, optional) – Whether or not to output additional information to stdout. If not provided, will default to overall engines verbose setting.

Returns:

The compiled PTX kernel and the kernel name.

Return type:

tuple[np.char.chararray, str]

Raises:

RuntimeError – If the version of cuda-python installed does not match the version of CUDA installed.

trtutils.core.create_binding(array: ~numpy.ndarray, bind_id: int = 0, name: str = 'binding', tensor_format: ~tensorrt_bindings.tensorrt.TensorFormat = <TensorFormat.LINEAR: 0>, *, use_array_data: bool | None = None, is_input: bool | None = None, pagelocked_mem: bool | None = None, unified_mem: bool | None = None) → Binding[source]¶

Create a binding for a TensorRT engine.

Parameters:

array (np.ndarray) – The array to use for the binding.
bind_id (int, optional) – The index of the binding.
name (str, optional) – The name of the binding.
tensor_format (trt.TensorFormat, optional) – The format of the tensor.
use_array_data (bool, optional) – Whether to use the data from the array for the binding. By default None, which means the data will not be copied.
is_input (bool, optional) – Whether the binding is an input or output.
pagelocked_mem (bool, optional) – Whether or not to use pagelocked memory for host allocations. By default None, which means pagelocked memory will be used.
unified_mem (bool, optional) – Whether or not the system has unified memory. If True, use cudaHostAllocMapped to take advantage of unified memory.

Returns:

The binding for the host/device memory.

Return type:

Binding

trtutils.core.create_context(device: int = 0) → CUcontext[source]¶

Create a CUDA context.

Parameters:: device (int) – The device to make a context for. By default 0.
Returns:: The created CUDA context
Return type:: cuda.CUcontext

trtutils.core.create_engine(engine_path: Path | str, stream: cudart.cudaStream_t | None = None, dla_core: int | None = None, device: int | None = None, *, no_warn: bool | None = None) → tuple[trt.ICudaEngine, trt.IExecutionContext, trt.ILogger, cudart.cudaStream_t][source]¶

Load a serialized engine from disk.

Parameters:

engine_path (Path | str) – The path to the serialized engine file.
stream (cudart.cudaStream_t, optional) – When an already made stream is passed, no new stream is created. Useful if you want multiple engines to share the same stream. Although there is no explicit link between engine and stream, the stream returned by this function should be used for execution.
dla_core (int, optional) – The DLA core to assign DLA layers of the engine to. Default is None. If None, any DLA layers will be assigned to DLA core 0.
device (int, optional) – The CUDA device index to create the engine on. Default is None, which uses the current device.
no_warn (bool | None, optional) – If True, suppresses warnings from TensorRT. Default is None.

Returns:

The deserialized engine, execution context, logger used, and stream created. Logger returned is the same as the input logger if not None.

Return type:

tuple[trt.ICudaEngine, trt.IExecutionContext, trt.ILogger, cudart.cudaStream_t]

Raises:

FileNotFoundError – If the engine file is not found.
RuntimeError – If the TRT runtime could not be created. If the engine could not be deserialized. If the execution context could not be created.

trtutils.core.create_kernel_args(*args: int | float | ndarray, verbose: bool | None = False) → tuple[ndarray, list[ndarray]][source]¶

Create the argument pointer array for a CUDA kernel call.

Adapted from the workflow present in: https://nvidia.github.io/cuda-python/overview.html#cuda-python-workflow This MUST be called for each kernel call. If the args are not regenerated the CUDA runtime will crash.

The intermediate argument buffers MUST be saved as variable to ensure the garbage collector does not delete them before use. The Kernel wrapper class handles this and is the recomended way to interact with kernels inside of trtutils.

Parameters:

*args (int | float | np.ndarray) – All args to pass to the kernel as integers, floats, or pre-formed args. If arrays are to be passed to the kernel, they should be given as an integer representing the pointer returned from CUDA malloc. A preformed arg is one which is already wrapped as an np.ndarray with specific type.
verbose (bool, optional) – Whether or not to output additional information about the passed args.

Returns:

The np.ndarray of argument pointers (one pointer per arg), and the allocated arrays

Return type:

tuple[np.ndarray, list[np.ndarray]]

Raises:

TypeError – If the type of an argument is not integer or float

trtutils.core.create_stream() → cudaStream_t[source]¶

Create a CUDA Stream.

Returns:: The CUDA stream.
Return type:: cudart.cudaStream_t

trtutils.core.cuda_call(call: tuple[CUresult | cudaError_t, T]) → T[source]¶

Call a CUDA function and check for errors.

Parameters:: call (tuple[cuda.CUresult | cudart.cudaError_t, T]) – The CUDA function to call and its arguments.
Returns:: The result of the CUDA function call.
Return type:: T

trtutils.core.cuda_free(device_ptr: int) → None[source]¶

Free a CUDA device pointer.

Parameters:: device_ptr (int) – The device pointer to free.

trtutils.core.cuda_graph_destroy(graph: cudaGraph_t) → None[source]¶

Destroy a CUDA graph.

Parameters:: graph (cudart.cudaGraph_t) – The CUDA graph to destroy.

trtutils.core.cuda_graph_exec_destroy(graph_exec: cudaGraphExec_t) → None[source]¶

Destroy a CUDA graph executable.

Parameters:: graph_exec (cudart.cudaGraphExec_t) – The graph executable to destroy.

trtutils.core.cuda_graph_instantiate(graph: cudaGraph_t, flags: int = 0) → cudaGraphExec_t[source]¶

Instantiate a CUDA graph executable.

Parameters:

graph (cudart.cudaGraph_t) – The CUDA graph to instantiate.
flags (int, optional) – Flags for graph instantiation. Default is 0.

Returns:

The instantiated graph executable.

Return type:

cudart.cudaGraphExec_t

trtutils.core.cuda_graph_launch(graph_exec: cudaGraphExec_t, stream: cudaStream_t) → None[source]¶

Launch a CUDA graph executable.

Parameters:

graph_exec (cudart.cudaGraphExec_t) – The graph executable to launch.
stream (cudart.cudaStream_t) – The CUDA stream to launch on.

trtutils.core.cuda_host_free(host_ptr: int | ndarray) → None[source]¶

Free a CUDA host pointer.

Parameters:: host_ptr (int) – The host pointer to free.

trtutils.core.cuda_malloc(nbytes: int) → int[source]¶

Perform a memory allocation using cudart.cudaMalloc.

Parameters:: nbytes (int) – The number of bytes to allocate.
Returns:: The pointer to the allocated memory.
Return type:: int

trtutils.core.cuda_stream_begin_capture(stream: cudaStream_t, mode: cudaStreamCaptureMode | None = None) → None[source]¶

Begin capturing a CUDA graph on the given stream.

Parameters:

stream (cudart.cudaStream_t) – The CUDA stream to begin capture on.
mode (cudart.cudaStreamCaptureMode, optional) – The capture mode to use. Default is ThreadLocal, which only checks CUDA calls from the capturing thread. Global mode would cause any uncapturable call in any thread to fail during capture.

trtutils.core.cuda_stream_end_capture(stream: cudaStream_t) → cudaGraph_t[source]¶

End capturing a CUDA graph and return the captured graph.

Parameters:: stream (cudart.cudaStream_t) – The CUDA stream to end capture on.
Returns:: The captured CUDA graph.
Return type:: cudart.cudaGraph_t

trtutils.core.destroy_context(context: CUcontext) → None[source]¶

Destory a CUDA context.

Parameters:: context (cuda.CUcontext) – The CUDA context to destroy.

trtutils.core.destroy_stream(stream: cudaStream_t) → None[source]¶

Destroy a CUDA Stream.

Parameters:: stream (cudart.cudaStream_t) – The CUDA stream to destroy.

trtutils.core.free_device_ptrs(ptrs: list[int]) → None[source]¶

Free a list of CUDA device pointers.

Parameters:: ptrs (list[int]) – The device pointers to free.

trtutils.core.get_compute_capability(device: int = 0) → tuple[int, int][source]¶

Get the compute capability (SM version) of a CUDA device.

Parameters:: device (int, optional) – The CUDA device index. Default is 0.
Returns:: A tuple of (major, minor) compute capability version.
Return type:: tuple[int, int]

trtutils.core.get_device() → int[source]¶

Get the current CUDA device.

Returns:: The current CUDA device index.
Return type:: int

trtutils.core.get_device_count() → int[source]¶

Get the number of CUDA devices available.

Returns:: The number of CUDA devices.
Return type:: int

trtutils.core.get_device_name(device: int = 0) → str[source]¶

Get the name of a CUDA device.

Parameters:: device (int, optional) – The CUDA device index. Default is 0.
Returns:: The device name (e.g. “NVIDIA GeForce RTX 5080”).
Return type:: str

trtutils.core.get_engine_names(engine: ICudaEngine) → tuple[list[str], list[str]][source]¶

Get the input/output names of a TensorRT engine in order.

Parameters:: engine (trt.ICudaEngine) – The TensorRT engine to get the input and output names from.
Returns:: The input and output tensors in order of enumeration.
Return type:: tuple[list[str], list[str]]

trtutils.core.get_sm_arch(major: int, minor: int) → str[source]¶

Get the GPU architecture name from a compute capability version.

Parameters:

major (int) – The major compute capability version.
minor (int) – The minor compute capability version.

Returns:

The architecture name (e.g. “turing”, “blackwell”). Returns “unknown” if the compute capability is not recognized.

Return type:

str

trtutils.core.init_cuda() → None[source]¶: Initialize CUDA.

trtutils.core.launch_kernel(kernel: cuda.CUkernel, num_blocks: tuple[int, int, int], num_threads: tuple[int, int, int], stream: cudart.cudaStream_t, args: np.ndarray) → None[source]¶

Launch a CUDA kernel with specified blocks, threads, and args in a stream.

Parameters:

kernel (cuda.CUKernel) – The CUDA kernel as compiled by NVRTC using the compile_kernel function.
num_blocks (tuple[int, int, int]) – The number of blocks to use for the kernel call.
num_threads (tuple[int, int, int]) – The number of threads to use for the kernel call.
stream (cudart.cudaStream_t) – The CUDA stream to execute the kernel in.
args (np.ndarray) – The NumPy array containing the pointers to the arguments. This array should be 1D containing int64 pointers to a NumPy array containing each individual argument.

trtutils.core.load_kernel(kernel_ptx: chararray, name: str, *, verbose: bool | None = None) → tuple[CUmodule, CUkernel][source]¶

Load a kernel from a PTX definition.

Parameters:

kernel_ptx (np.char.chararray) – The PTX generated by NVRTC, use the compile_kernel function.
name (str) – The name of the kernel inside the PTX definiton.
verbose (bool, optional) – Whether or not to output additional information to stdout. If not provided, will default to overall engines verbose setting.

Returns:

The CUDA module and kernel

Return type:

tuple[cuda.CUmodule, cuda.CUkernel]

trtutils.core.memcpy_device_to_device(dst_ptr: int, src_ptr: int, nbytes: int) → None[source]¶

Copy from one device pointer to another with error checking.

Parameters:

dst_ptr (int) – The destination device pointer.
src_ptr (int) – The source device pointer.
nbytes (int) – The number of bytes to copy.

trtutils.core.memcpy_device_to_device_async(dst_ptr: int, src_ptr: int, nbytes: int, stream: cudaStream_t) → None[source]¶

Copy from one device pointer to another asynchronously.

Parameters:

dst_ptr (int) – The destination device pointer.
src_ptr (int) – The source device pointer.
nbytes (int) – The number of bytes to copy.
stream (cudart.cudaStream_t) – The stream to utilize.

trtutils.core.memcpy_device_to_host(host_arr: ndarray, device_ptr: int) → None[source]¶

Copy a device pointer to a numpy array with error checking.

Parameters:

host_arr (np.ndarray) – The numpy array to copy to.
device_ptr (int) – The device pointer to copy.

trtutils.core.memcpy_device_to_host_async(host_arr: ndarray, device_ptr: int, stream: cudaStream_t) → None[source]¶

Copy a device pointer to a numpy array with error checking.

Parameters:

host_arr (np.ndarray) – The numpy array to copy to.
device_ptr (int) – The device pointer to copy.
stream (cudart.cudaStream_t) – The stream to utilize.

trtutils.core.memcpy_host_to_device(device_ptr: int, host_arr: ndarray) → None[source]¶

Copy a numpy array to a device pointer with error checking.

Parameters:

device_ptr (int) – The device pointer to copy to.
host_arr (np.ndarray) – The numpy array to copy.

trtutils.core.memcpy_host_to_device_async(device_ptr: int, host_arr: ndarray, stream: cudaStream_t) → None[source]¶

Copy a numpy array to a device pointer with error checking.

Parameters:

device_ptr (int) – The device pointer to copy to.
host_arr (np.ndarray) – The numpy array to copy.
stream (cudart.cudaStream_t) – The stream to utilize.

trtutils.core.memcpy_host_to_device_offset(device_ptr: int, host_arr: ndarray, offset_bytes: int) → None[source]¶

Copy a numpy array to a device pointer at a specific offset.

Parameters:

device_ptr (int) – The base device pointer.
host_arr (np.ndarray) – The numpy array to copy.
offset_bytes (int) – The byte offset into the device buffer.

trtutils.core.memcpy_host_to_device_offset_async(device_ptr: int, host_arr: ndarray, offset_bytes: int, stream: cudaStream_t) → None[source]¶

Copy a numpy array to a device pointer at a specific offset asynchronously.

Parameters:

device_ptr (int) – The base device pointer.
host_arr (np.ndarray) – The numpy array to copy.
offset_bytes (int) – The byte offset into the device buffer.
stream (cudart.cudaStream_t) – The stream to utilize.

trtutils.core.nvrtc_call(call: tuple[nvrtcResult, T]) → T[source]¶

Call a NVRTC function and check for errors.

Parameters:: call (tuple[cuda.CUresult | cudart.cudaError_t, T]) – The NVRTC function to call and its arguments.
Returns:: The result of the NVRTC function call.
Return type:: T

trtutils.core.set_device(device: int) → None[source]¶

Set the current CUDA device.

Parameters:: device (int) – The CUDA device index to set.

trtutils.core.stream_synchronize(stream: cudaStream_t) → None[source]¶

Copy a numpy array to a device pointer with error checking.

Parameters:: stream (cudart.cudaStream_t) – The stream to synchronize calls for.