trtutils package¶

Subpackages¶

Module contents¶

A package for enabling high-level usage of TensorRT in Python.

This package provides a high-level interface for using TensorRT in Python. It provides a class for creating TensorRT engines from serialized engine files, a class for running inference on those engines, and a variety of other utilities.

Submodules¶

builder: A module for building TensorRT engines.
core: A module for the core functionality of the package.
jetson: A module implementating additional functionality for Jetson devices.
impls: A module containing implementations for different neural networks.
inspect: A module for inspecting TensorRT engines.
trtexec: A module for utilities related to the trtexec tool.

Classes¶

BenchmarkResult: A dataclass for storing profiling information from benchmarking engines.
Metric: A dataclass storing specific metric information from benchmarking.
TRTEngine: A class for creating TensorRT engines from serialized engine files.
TRTModel: A class for running inference on TensorRT engines.
ParallelTRTEngines: A class for running many TRTEngines in parallel.
ParallelTRTModels: A class for running many TRTModels in parallel.
QueuedTRTEngine: A class for running a TRTEngine in a seperate thread asynchronously.
QueuedTRTModel: A class for running a TRTModel in a seperate thread asynchronously.

Functions¶

benchmark_engine(): Benchmark a TensorRT engine.
benchmark_engines(): Benchmark TensorRT engines in parallel or serially.
build_engine(): Build a TensorRT engine.
find_trtexec(): Find an instance of the trtexec binary on the system.
inspect_engine(): Inspect a TensorRT engine.
run_trtexec(): Run a command with trtexec.
set_log_level(): Set the log level of the trtutils package.
enable_jit(): Enable just-in-time compilation using Numba.
disable_jit(): Disable just-in-time compilation using Numba.
register_jit(): Decorator for registering functions for potential JIT compilation.

Objects¶

FLAGS: The flag storage object for trtutils.
LOG: The TensorRT compatible logger for trtutils.
JIT: A context manager for enabling just-in-time compilation using Numba.

class trtutils.BenchmarkResult(latency: Metric)[source]¶

Bases: object

A dataclass to store the results of a benchmark.

latency: Metric¶

Bases: object

A dataclass to store the results of a benchmark.

raw: list[float | int]¶

mean: float | int = -1.0¶

median: float | int = -1.0¶

min: float | int = -1.0¶

max: float | int = -1.0¶

Bases: object

Handle many TRTEngines in parallel.

get_random_input(*, new: bool | None = None) → list[list[np.ndarray]][source]¶

Get a random input to the underlying TRTEngines.

Parameters:: new (bool, optional) – Whether or not to get a new input or the cached already generated one. By default, None/False
Returns:: The random inputs.
Return type:: list[list[np.ndarray]]

stop() → None[source]¶: Stop the underlying engine threads.

submit(inputs: list[list[np.ndarray]]) → None[source]¶

Submit data to be processed by the engines.

Parameters:: inputs (list[list[np.ndarray]]) – The inputs to pass to the engines. Should be a list of the same lenght of engines created.
Raises:: ValueError – If the inputs are not the same size as the engines.

mock_submit() → None[source]¶: Send random data to the engines.

retrieve(timeout: float | None = None) → list[list[np.ndarray] | None][source]¶

Get the outputs from the engines.

Parameters:: timeout (float, optional) – Timeout for waiting for data.
Returns:: The output from the engines.
Return type:: list[np.ndarray]

class trtutils.ParallelTRTModels(engine_paths: Sequence[Path | str], preprocess: Callable[[list[np.ndarray]], list[np.ndarray]] | list[Callable[[list[np.ndarray]], list[np.ndarray]]] = <function _identity>, postprocess: Callable[[list[np.ndarray]], list[np.ndarray]] | list[Callable[[list[np.ndarray]], list[np.ndarray]]] = <function _identity>, warmup_iterations: int = 5, *, warmup: bool | None = None)[source]¶

Bases: object

Handle many TRTModels in parallel.

stop() → None[source]¶: Stop the underlying engine threads.

submit(inputs: list[list[np.ndarray]], *, preprocessed: bool | None = None) → None[source]¶

Submit data to be processed by the engines.

Parameters:

inputs (list[list[np.ndarray]]) – The inputs to pass to the engines. Should be a list of the same lenght of engines created.
preprocessed (bool, optional) – Whether or not the inputs are already preprocessed.

Raises:

ValueError – If the inputs are not the same size as the engines.

retrieve(timeout: float | None = None) → list[list[np.ndarray] | None][source]¶

Get the outputs from the engines.

Parameters:: timeout (float, optional) – Timeout for waiting for data.
Returns:: The output from the engines.
Return type:: list[np.ndarray]

class trtutils.QueuedTRTEngine(engine: TRTEngine | Path | str, warmup_iterations: int = 5, dla_core: int | None = None, *, warmup: bool | None = None)[source]¶

Bases: object

Interact with TRTEngine over Thread and Queue.

property input_spec: list[tuple[list[int], np.dtype]]¶

Get the specs for the input tensor of the network. Useful to prepare memory allocations.

Returns:: A list with two items per element, the shape and (numpy) datatype of each input tensor.
Return type:: list[tuple[list[int], np.dtype]]

property input_shapes: list[tuple[int, ...]]¶

Get the shapes for the input tensors of the network.

Returns:: A list with the shape of each input tensor.
Return type:: list[tuple[int, …]]

property input_dtypes: list[np.dtype]¶

Get the datatypes for the input tensors of the network.

Returns:: A list with the datatype of each input tensor.
Return type:: list[np.dtype]

property output_spec: list[tuple[list[int], np.dtype]]¶

Get the specs for the output tensor of the network. Useful to prepare memory allocations.

Returns:: A list with two items per element, the shape and (numpy) datatype of each output tensor.
Return type:: list[tuple[list[int], np.dtype]]

property output_shapes: list[tuple[int, ...]]¶

Get the shapes for the output tensors of the network.

Returns:: A list with the shape of each output tensor.
Return type:: list[tuple[int, …]]

property output_dtypes: list[np.dtype]¶

Get the datatypes for the output tensors of the network.

Returns:: A list with the datatype of each output tensor.
Return type:: list[np.dtype]

get_random_input(*, new: bool | None = None) → list[np.ndarray][source]¶

Get a random input to the underlying TRTEngine.

Parameters:: new (bool, optional) – Whether or not to get a new input or the cached already generated one. By default, None/False
Returns:: The random input.
Return type:: list[np.ndarray]

stop() → None[source]¶: Stop the thread containing the TRTEngine.

submit(data: list[np.ndarray]) → None[source]¶

Put data in the input queue.

Parameters:: data (list[np.ndarray]) – The data to have the engine run.

mock_submit() → None[source]¶: Send a random input to the engine.

retrieve(timeout: float | None = None) → list[np.ndarray] | None[source]¶

Get an output from the engine thread.

Parameters:: timeout (float, optional) – Timeout for waiting for data.
Returns:: The output from the engine.
Return type:: list[np.ndarray]

class trtutils.QueuedTRTModel(engine_path: Path | str, preprocess: Callable[[list[np.ndarray]], list[np.ndarray]] = <function _identity>, postprocess: Callable[[list[np.ndarray]], list[np.ndarray]] = <function _identity>, warmup_iterations: int = 5, engine_type: type[TRTEngine] | None = None, *, warmup: bool | None = None)[source]¶

Bases: object

Interact with TRTModel over a Thread and Queue.

stop() → None[source]¶: Stop the thread containing the TRTEngine.

submit(data: list[np.ndarray], *, preprocessed: bool | None = None) → None[source]¶

Put data in the input queue.

Parameters:

data (list[np.ndarray]) – The data to have the engine run.
preprocessed (bool, optional) – Whether or not the input is already preprocessed.

retrieve(timeout: float | None = None) → list[np.ndarray] | None[source]¶

Get an output from the engine thread.

Parameters:: timeout (float, optional) – Timeout for waiting for data.
Returns:: The output from the engine.
Return type:: list[np.ndarray]

Bases: TRTEngineInterface

Implements a generic interface for TensorRT engines.

It is thread and process safe to create multiple TRTEngines. It is valid to create a TRTEngine in one thread and use in another. Each TRTEngine has its own CUDA context and there is no safeguards implemented in the class for datarace conditions. As such, a single TRTEngine should not be used in multiple threads or processes.

execute(data: list[np.ndarray], *, no_copy: bool | None = None, verbose: bool | None = None, debug: bool | None = None) → list[np.ndarray][source]¶

Execute the network with the given inputs.

Parameters:

data (list[np.ndarray]) – The inputs to the network.
no_copy (bool, optional) – If True, the outputs will not be copied out from the cuda allocated host memory. Instead, the host memory will be returned directly. This memory WILL BE OVERWRITTEN INPLACE by future inferences.
verbose (bool, optional) – Whether or not to output additional information to stdout. If not provided, will default to overall engines verbose setting.
debug (bool, optional) – Enable intermediate stream synchronize for debugging.

Returns:

The outputs of the network.

Return type:

list[np.ndarray]

direct_exec(pointers: list[int], *, no_warn: bool | None = None, verbose: bool | None = None, debug: bool | None = None) → list[np.ndarray][source]¶

Execute the network with the given GPU memory pointers.

The outputs of this function are not copied on return. The data will be updated inplace if execute or direct_exec is called. Calling this method while giving bad pointers will also cause CUDA runtime to crash and program to crash.

Parameters:

pointers (list[int]) – The inputs to the network.
no_warn (bool, optional) – If True, do not warn about usage.
verbose (bool, optional) – Whether or not to output additional information to stdout. If not provided, will default to overall engines verbose setting.
debug (bool, optional) – Enable intermediate stream synchronize for debugging.

Returns:

The outputs of the network.

Return type:

list[np.ndarray]

raw_exec(pointers: list[int], *, no_warn: bool | None = None, verbose: bool | None = None, debug: bool | None = None) → list[int][source]¶

Execute the network with the given GPU memory pointers.

The outputs of this function are the direct GPU pointers of the output allocations.

Parameters:

pointers (list[int]) – The inputs to the network.
no_warn (bool, optional) – If True, do not warn about usage.
verbose (bool, optional) – Whether or not to output additional information to stdout. If not provided, will default to overall engines verbose setting.
debug (bool, optional) – Enable intermediate stream synchronize for debugging.

Returns:

The pointers to the network outputs.

Return type:

list[int]

class trtutils.TRTModel(engine_path: Path | str, preprocess: Callable[[list[np.ndarray]], list[np.ndarray]] = <function _identity>, postprocess: Callable[[list[np.ndarray]], list[np.ndarray]] = <function _identity>, warmup_iterations: int = 5, engine_type: type[TRTEngine] | None = None, *, warmup: bool | None = None)[source]¶

Bases: object

A wrapper around a TensorRT engine that handles the device memory.

It is thread and process safe to create multiple TRTModels. It is valid to create a TRTModel in one thread and use in another. Each TRTModel has its own CUDA context and there is no safeguards implemented in the class for datarace conditions. As such, a single TRTModel should not be used in multiple threads or processes.

property engine: TRTEngine¶: Access the underlying TRTEngine.

property stream: cudart.cudaStream_t¶: Access the underlying CUDA stream.

property preprocessor: Callable[[list[np.ndarray]], list[np.ndarray]]¶: The preprocessing function used in this model.

property postprocessor: Callable[[list[np.ndarray]], list[np.ndarray]]¶: The postprocessing function used in this model.

mock_run(data: list[np.ndarray] | None = None) → list[np.ndarray][source]¶

Execute the model with random inputs.

Parameters:: data (list[np.ndarray], optional) – The inputs to the model, by default None If None, random inputs will be used
Returns:: The outputs of the model
Return type:: list[np.ndarray]

preprocess(inputs: list[np.ndarray]) → list[np.ndarray][source]¶

Preprocess the inputs.

Parameters:: inputs (list[np.ndarray]) – The inputs to preprocess
Returns:: The preprocessed inputs
Return type:: list[np.ndarray]

postprocess(outputs: list[np.ndarray]) → list[np.ndarray][source]¶

Postprocess the outputs.

Parameters:: outputs (list[np.ndarray]) – The outputs to postprocess
Returns:: The postprocessed outputs
Return type:: list[np.ndarray]

run(inputs: list[np.ndarray], *, preprocessed: bool | None = None, postprocess: bool | None = None) → list[np.ndarray][source]¶

Execute the model with the given inputs.

Parameters:

inputs (list[np.ndarray]) – The inputs to the model
preprocessed (bool, optional) – Whether the inputs are already preprocessed, by default None If None, the inputs will be preprocessed
postprocess (bool, optional) – Whether or not to postprocess the outputs, by default None If None, the outputs will be postprocessed

Returns:

The outputs of the model

Return type:

list[np.ndarray]

Benchmark a TensorRT engine.

Parameters:

engine (TRTEngine | Path | str) – The engine to benchmark. Either a TRTEngine object or path to the engine file. If a path is given, then a TRTEngine will be created automatically.
iterations (int, optional) – The number of iterations to run the benchmark for, by default 1000.
warmup_iterations (int, optional) – The number of warmup iterations to run before the benchmark, by default 50.
dla_core (int, optional) – The DLA core to assign DLA layers of the engine to. Default is None. If None, any DLA layers will be assigned to DLA core 0.
warmup (bool, optional) – Whether to do warmup iterations, by default None If None, warmup will be set to True.
verbose (bool, optional) – Whether ot not to output additional information to stdout. Default None/False.

Returns:

A dataclass containing the results of the benchmark.

Return type:

BenchmarkResult

Benchmark a TensorRT engine.

Parameters:

engines (Sequence[TRTEngine | Path | str | tuple[TRTEngine | Path | str, int]],) – The engines to benchmark as paths to the engine files.
iterations (int, optional) – The number of iterations to run the benchmark for, by default 1000.
warmup_iterations (int, optional) – The number of warmup iterations to run before the benchmark, by default 50.
warmup (bool, optional) – Whether to do warmup iterations, by default None If None, warmup will be set to True.
parallel (bool, optional) – Whether or not to process the engines in parallel. Useful for assessing concurrent execution performance. Will execute the engines in lockstep. If None, will benchmark each engine individually.
verbose (bool, optional) – Whether ot not to output additional information to stdout. Default None/False.

Returns:

A list of dataclasses containing the results of the benchmark. If parallel was True, will only contain one item.

Return type:

list[BenchmarkResult]

trtutils.build_engine(onnx: Path | str, output: Path | str, default_device: trt.DeviceType | str = <DeviceType.GPU: 0>, timing_cache: Path | str | None = None, workspace: float = 4.0, dla_core: int | None = None, calibration_cache: Path | str | None = None, data_batcher: AbstractBatcher | None = None, layer_precision: list[tuple[int, trt.DataType | None]] | None = None, layer_device: list[tuple[int, trt.DeviceType | None]] | None = None, shapes: list[tuple[str, tuple[int, ...]]] | None = None, input_tensor_formats: list[tuple[str, trt.DataType, trt.TensorFormat]] | None = None, output_tensor_formats: list[tuple[str, trt.DataType, trt.TensorFormat]] | None = None, hooks: list[Callable[[trt.INetworkDefinition], trt.INetworkDefinition]] | None = None, *, gpu_fallback: bool = False, direct_io: bool = False, prefer_precision_constraints: bool = False, reject_empty_algorithms: bool = False, ignore_timing_mismatch: bool = False, fp16: bool | None = None, int8: bool | None = None, cache: bool | None = None, verbose: bool | None = None) → None[source]¶

Build a TensorRT engine from an ONNX model.

The order in which operations occur inside build_engine:

Parse the ONNX model
Apply any network hooks
Create optimization profile and apply any manual shapes
Apply builder flags (precision constraints, empty algorithms, direct I/O)
Configure tensor formats if specified
Configure precision (FP16, INT8)
Set default device and DLA core
Apply individual layer precision and device settings
Set up timing cache
Build the engine
Save timing cache and engine

Parameters:

onnx (Path, str) – The path to the onnx model.
output (Path, str) – The location to save the TensorRT engine.
default_device (trt.DeviceType, str, optional) – The device to use for the engine. By default, trt.DeviceType.GPU. Options are trt.DeviceType.GPU, trt.DeviceType.DLA, or a string of “gpu” or “dla”.
timing_cache (Path, str, optional) – Where to store the timing cache data. Default is None.
workspace (float) – The size of the workspace in gigabytes. Default is 4.0 GiB.
calibration_cache (Path, str, optional) – The path to the calibration cache.
data_batcher (AbstractBatcher, optional) – The data batcher to use for calibration.
dla_core (int, optional) – The DLA core to build the engine for. By default, None or build the engine for GPU.
layer_precision (list[tuple[int, trt.DataType | None]], optional) – The precision to use for specific layers. By default, None.
layer_device (list[tuple[int, trt.DeviceType | None]], optional) – The device to use for specific layers. By default, None.
shapes (list[tuple[str, tuple[int, ...]]], optional) – A list of (input_name, shape) pairs to specify the shapes of the input layers. For example, shapes=[(“images”, (1, 3, imgsz, imgsz))] will set the input “images” to a fixed shape. This shape will be used as the min, optimal, and max shape for the binding. By default, None.
input_tensor_formats (list[tuple[str, trt.DataType, trt.TensorFormat]], optional) – A list of (name, dtype format) to allow deep specification of input layers. For example, input_tensor_formats=[(“input”, trt.DataType.UINT8, trt.TensorFormat.HWC)] By default, None
output_tensor_formats (list[tuple[str, trt.DataType, trt.TensorFormat]], optional) – A list of (name, dtype format) to allow deep specification of output layers. For example, output_tensor_formats=[(“output”, trt.DataType.HALF, trt.TensorFormat.LINEAR)] By default, None
hooks (list[Callable[[trt.INetworkDefinition], trt.INetworkDefinition]], optional) – An optional list of ‘hook’ functions to modify the TensorRT network before the remainder of the build phase occurs. By default, None
gpu_fallback (bool) – Whether or not to allow GPU fallback for unsupported layers when building the engine for DLA. By default, False
direct_io (bool) – Use direct IO for the engine. By default, False
prefer_precision_constraints (bool) – Whether or not to prefer precision constraints. By default, False
reject_empty_algorithms (bool) – Whether or not to reject empty algorithms. By default, False
ignore_timing_mismatch (bool) – Whether or not to allow different CUDA device generated timing caches to be used in the building of engines. By default, False
fp16 (bool, optional) – If True, quantize the engine to FP16 precision.
int8 (bool, optional) – If True, quantize the engine to INT8 precision.
cache (bool, optional) – Whether or not to cache the engine in the trtutils engine cache. If an existing version is found will use that. Uses the name of the output file to assess if the engine has been compiled before. As such, naming the output ‘engine’, ‘model’ or similiar will result in unintended caching behavior. By default None, will not cache the engine.
verbose (bool, optional) – If True, print verbose output. By default, None or False

Raises:

RuntimeError – If the ONNX model cannot be parsed
RuntimeError – If the TensorRT engines fails to build
ValueError – If layer is manually assigned to DLA and DLA is not supported and gpu_fallback is False

trtutils.disable_jit() → None[source]¶: Disable JIT compilation.

trtutils.enable_jit() → None[source]¶: Enable just-in-time compilation using Numba for some functions.

trtutils.find_trtexec() → Path[source]¶

Find an instance of the trtexec binary on the system.

Requires the locate command to be installed on the system. As such, only works on Unix-like systems.

Returns:: The path to the trtexec binary
Return type:: Path
Raises:: FileNotFoundError – If the trtexec binary is not found on the system

trtutils.inspect_engine(engine: Path | str | ICudaEngine, *, verbose: bool | None = None) → tuple[int, int, list[tuple[str, tuple[int, ...], DataType, TensorFormat]], list[tuple[str, tuple[int, ...], DataType, TensorFormat]]][source]¶

Inspect a TensorRT engine.

Parameters:

engine (Path | str | trt.ICudaEngine) – Path to the TensorRT engine file or an already loaded engine
verbose (bool | None, optional) – Whether to print verbose output, by default None

Returns:

The size in bytes of the engine, the max batch size, and two lists of input and output tensors

Return type:

tuple[int, int, list[tuple[str, tuple[int, …], trt.DataType, trt.TensorFormat]], list[tuple[str, tuple[int, …], trt.DataType, trt.TensorFormat]]]

trtutils.register_jit(*, fastmath: bool = False, parallel: bool = False, nogil: bool = False, cache: bool = False, inline: str = 'never') → Callable[[Callable[_P, _R]], Callable[_P, _R]][source]¶

Parameters:

func (Callable[_P, _R], optional) – The function to optionally JIT compile. If None, the decorator returns a partially applied function.
fastmath (bool, optional) – If True, enable fastmath during jit. Default is False.
parallel (bool, optional) – If True, enable parallel jit. Default is False.
nogil (bool, optional) – If True, disable the GIL when running jit compiled functions. Default is False.
cache (bool, optional) – If True, cache jit compiled functions to disk. Default is False.
inline (str, optional) – Whether or not to inline functions at the Numba IR level. Default is ‘never’. Options are: [‘never’, ‘always’]

Returns:

The registered and optionally JIT-compiled function.

Return type:

Callable[[Callable[_P, _R]], Callable[_P, _R]]

Examples

>>> @register_jit(fastmath=True, parallel=True)
... def my_func(x):
...     return x * x

trtutils.run_trtexec(command: str, trtexec_path: Path | str | None = None) → tuple[bool, str, str][source]¶

Run a command using trtexec.

The goal of this function is make it easier to use trtexec within Python scripts. By returning the stdout/stderr streams via strings back to the Python program it can simplify logic or scripts which utilize trtexec.

Parameters:

command (str) – The command to run using trtexec
trtexec_path (Path | str | None, optional) – The path to the trtexec binary to use. If None, find_trtexec will be used.

Returns:

A tuple containing the following elements: (success, stdout, stderr)

Return type:

tuple[bool, str, str]

trtutils.set_log_level(level: str) → None[source]¶

Set the log level for the trtutils package.

Parameters:: level (str) – The log level to set. One of “DEBUG”, “INFO”, “WARNING”, “ERROR”, “CRITICAL”.
Raises:: ValueError – If the level is not one of the allowed values.