trtutils package

Subpackages

Module contents

A package for enabling high-level usage of TensorRT in Python.

This package provides a high-level interface for using TensorRT in Python. It provides a class for creating TensorRT engines from serialized engine files, a class for running inference on those engines, and a variety of other utilities.

Submodules

builder

A module for building TensorRT engines.

compat

A module for compatibility with other libraries.

core

A module for the core functionality of the package.

download

A module for downloading and converting models to ONNX.

jetson

A module implementating additional functionality for Jetson devices.

image

A module for image processing with TensorRT.

models

A module containing implementations of DNN models.

inspect

A module for inspecting TensorRT engines.

trtexec

A module for utilities related to the trtexec tool.

Classes

BenchmarkResult

A dataclass for storing profiling information from benchmarking engines.

Metric

A dataclass storing specific metric information from benchmarking.

TRTEngine

A class for creating TensorRT engines from serialized engine files.

Functions

benchmark_engine()

Benchmark a TensorRT engine.

benchmark_engines()

Benchmark TensorRT engines in parallel or serially.

build_engine()

Build a TensorRT engine.

find_trtexec()

Find an instance of the trtexec binary on the system.

inspect_engine()

Inspect a TensorRT engine.

run_trtexec()

Run a command with trtexec.

set_log_level()

Set the log level of the trtutils package.

enable_jit()

Enable just-in-time compilation using Numba.

disable_jit()

Disable just-in-time compilation using Numba.

register_jit()

Decorator for registering functions for potential JIT compilation.

enable_nvtx()

Enable trtutils NVTX profiling.

disable_nvtx()

Disable trtutils NVTX profiling.

Objects

CONFIG

The config storage object for trtutils.

FLAGS

The flag storage object for trtutils.

LOG

The TensorRT compatible logger for trtutils.

JIT

A context manager for enabling just-in-time compilation using Numba.

NVTX

A context manager for enabling NVTX profiling.

class trtutils.NVTX(name: str)[source]

Bases: object

Context manager and static helpers for trtutils NVTX profiling.

class trtutils.BenchmarkResult(latency: Metric)[source]

Bases: object

A dataclass to store the results of a benchmark.

latency: Metric
class trtutils.Device(device: int | None)[source]

Bases: object

Context manager that saves and restores the current CUDA device.

When device is None the guard is a no-op: __enter__ and __exit__ only check a single attribute, adding negligible overhead on the hot path.

Instances are reusable — engines store one as self._device_guard and enter/exit it on every execute() call.

class trtutils.Metric(raw: list[float | int], mean: float | int = -1.0, median: float | int = -1.0, min: float | int = -1.0, max: float | int = -1.0, std: float = 0.0, ci95: float = 0.0)[source]

Bases: object

A dataclass to store the results of a benchmark.

raw: list[float | int]
mean: float | int = -1.0
median: float | int = -1.0
min: float | int = -1.0
max: float | int = -1.0
std: float = 0.0
ci95: float = 0.0
class trtutils.TRTEngine(engine_path: Path | str, warmup_iterations: int = 5, backend: str = 'auto', stream: cuda.cudaStream_t | None = None, dla_core: int | None = None, device: int | None = None, *, warmup: bool | None = None, pagelocked_mem: bool | None = None, unified_mem: bool | None = None, cuda_graph: bool | None = None, no_warn: bool | None = None, verbose: bool | None = None)[source]

Bases: TRTEngineInterface

Implements a generic interface for TensorRT engines.

It is thread and process safe to create multiple TRTEngines. It is valid to create a TRTEngine in one thread and use in another. Each TRTEngine has its own CUDA context and there is no safeguards implemented in the class for datarace conditions. As such, a single TRTEngine should not be used in multiple threads or processes.

execute(data: list[np.ndarray], *, no_copy: bool | None = None, verbose: bool | None = None, debug: bool | None = None) list[np.ndarray][source]

Execute the network with the given inputs.

Parameters:
  • data (list[np.ndarray]) – The inputs to the network.

  • no_copy (bool, optional) – If True, the outputs will not be copied out from the cuda allocated host memory. Instead, the host memory will be returned directly. This memory WILL BE OVERWRITTEN INPLACE by future inferences.

  • verbose (bool, optional) – Whether or not to output additional information to stdout. If not provided, will default to overall engines verbose setting.

  • debug (bool, optional) – Enable intermediate stream synchronize for debugging.

Returns:

The outputs of the network.

Return type:

list[np.ndarray]

Notes

This method always synchronizes the stream before returning, ensuring outputs are ready to read on the host.

graph_exec(*, debug: bool | None = None) None[source]

Launch the captured CUDA graph.

This method only launches the graph - it does not handle input/output memory transfers or graph capture. The graph must already be captured (via warmup or prior execute() calls).

This method does NOT synchronize the stream by default, allowing the graph to be embedded in a larger pipeline. Use debug=True to force synchronization.

Parameters:

debug (bool, optional) – If True, synchronize the stream after graph launch. By default False (no synchronization).

Raises:

RuntimeError – If no CUDA graph has been captured or CUDA graphs are disabled.

direct_exec(pointers: list[int], *, set_pointers: bool = True, no_warn: bool | None = None, verbose: bool | None = None, debug: bool | None = None) list[np.ndarray][source]

Execute the network with the given GPU memory pointers.

The outputs of this function are not copied on return. The data will be updated inplace if execute or direct_exec is called. Calling this method while giving bad pointers will also cause CUDA runtime to crash and program to crash.

Parameters:
  • pointers (list[int]) – The inputs to the network. Pointers must be in the order of expected inputs for the engine.

  • set_pointers (bool, optional) – Whether to set tensor addresses before execution. If True (default), tensor addresses will be set. If False, tensor addresses are assumed to already be configured. By default True.

  • no_warn (bool, optional) – If True, do not warn about usage.

  • verbose (bool, optional) – Whether or not to output additional information to stdout. If not provided, will default to overall engines verbose setting.

  • debug (bool, optional) – Enable intermediate stream synchronize for debugging.

Returns:

The outputs of the network.

Return type:

list[np.ndarray]

Notes

This method always synchronizes the stream before returning, ensuring outputs are ready to read on the host.

raw_exec(pointers: list[int], *, set_pointers: bool = True, no_warn: bool | None = None, verbose: bool | None = None, debug: bool | None = None) list[int][source]

Execute the network with the given GPU memory pointers.

The outputs of this function are the direct GPU pointers of the output allocations.

Parameters:
  • pointers (list[int]) – The inputs to the network. Pointers must be in the order of expected inputs for the engine.

  • set_pointers (bool, optional) – Whether to set tensor addresses before execution. If True (default), tensor addresses will be set. If False, tensor addresses are assumed to already be configured. By default True.

  • no_warn (bool, optional) – If True, do not warn about usage.

  • verbose (bool, optional) – Whether or not to output additional information to stdout. If not provided, will default to overall engines verbose setting.

  • debug (bool, optional) – Enable intermediate stream synchronize for debugging.

Returns:

The pointers to the network outputs.

Return type:

list[int]

Notes

This method does NOT synchronize the stream by default. The caller is responsible for synchronization if needed. Use debug=True to force synchronization after execution.

trtutils.benchmark_engine(engine: TRTEngine | Path | str, iterations: int = 1000, warmup_iterations: int = 50, dla_core: int | None = None, device: int | None = None, *, warmup: bool | None = None, verbose: bool | None = None) BenchmarkResult[source]

Benchmark a TensorRT engine.

Parameters:
  • engine (TRTEngine | Path | str) – The engine to benchmark. Either a TRTEngine object or path to the engine file. If a path is given, then a TRTEngine will be created automatically.

  • iterations (int, optional) – The number of iterations to run the benchmark for, by default 1000.

  • warmup_iterations (int, optional) – The number of warmup iterations to run before the benchmark, by default 50.

  • dla_core (int, optional) – The DLA core to assign DLA layers of the engine to. Default is None. If None, any DLA layers will be assigned to DLA core 0.

  • device (int, optional) – The CUDA device index to use for the engine. Default is None, which uses the current device.

  • warmup (bool, optional) – Whether to do warmup iterations, by default None If None, warmup will be set to True.

  • verbose (bool, optional) – Whether ot not to output additional information to stdout. Default None/False.

Returns:

A dataclass containing the results of the benchmark.

Return type:

BenchmarkResult

trtutils.benchmark_engines(engines: Sequence[TRTEngine | Path | str | tuple[TRTEngine | Path | str, int] | tuple[TRTEngine | Path | str, int | None, int | None]], iterations: int = 1000, warmup_iterations: int = 50, *, warmup: bool | None = None, parallel: bool | None = None, verbose: bool | None = None) list[BenchmarkResult][source]

Benchmark a TensorRT engine.

Parameters:
  • engines (Sequence[...]) – The engines to benchmark. Each element can be a TRTEngine, Path, str, a 2-tuple of (engine, dla_core), or a 3-tuple of (engine, dla_core, device).

  • iterations (int, optional) – The number of iterations to run the benchmark for, by default 1000.

  • warmup_iterations (int, optional) – The number of warmup iterations to run before the benchmark, by default 50.

  • warmup (bool, optional) – Whether to do warmup iterations, by default None If None, warmup will be set to True.

  • parallel (bool, optional) – Whether or not to process the engines in parallel. Useful for assessing concurrent execution performance. Will execute the engines in lockstep. If None, will benchmark each engine individually.

  • verbose (bool, optional) – Whether ot not to output additional information to stdout. Default None/False.

Returns:

A list of dataclasses containing the results of the benchmark. If parallel was True, will only contain one item.

Return type:

list[BenchmarkResult]

trtutils.build_engine(onnx: Path | str, output: Path | str, default_device: trt.DeviceType | str = <DeviceType.GPU: 0>, workspace: float = 4.0, dla_core: int | None = None, calibration_cache: Path | str | None = None, data_batcher: AbstractBatcher | None = None, layer_precision: list[tuple[int, trt.DataType | None]] | None = None, layer_device: list[tuple[int, trt.DeviceType | None]] | None = None, shapes: Sequence[tuple[str, tuple[int, ...]]] | None = None, input_tensor_formats: list[tuple[str, trt.DataType, trt.TensorFormat]] | None = None, output_tensor_formats: list[tuple[str, trt.DataType, trt.TensorFormat]] | None = None, hooks: list[Callable[[trt.INetworkDefinition], trt.INetworkDefinition]] | None = None, optimization_level: int = 3, profiling_verbosity: trt.ProfilingVerbosity | None = None, tiling_optimization_level: trt.TilingOptimizationLevel | None = None, tiling_l2_cache_limit: int | None = None, device: int | None = None, *, timing_cache: Path | str | bool | None = None, gpu_fallback: bool = False, direct_io: bool = False, prefer_precision_constraints: bool = False, reject_empty_algorithms: bool = False, ignore_timing_mismatch: bool = False, fp16: bool | None = None, fp8: bool | None = None, int8: bool | None = None, cache: bool | None = None, verbose: bool | None = None) None[source]

Build a TensorRT engine from an ONNX model.

The order in which operations occur inside build_engine:

  1. Parse the ONNX model

  2. Apply any network hooks

  3. Create optimization profile and apply any manual shapes

  4. Apply builder flags (precision constraints, empty algorithms, direct I/O)

  5. Configure tensor formats if specified

  6. Configure precision (FP16, FP8, INT8)

  7. Set default device and DLA core

  8. Apply individual layer precision and device settings

  9. Set up timing cache

  10. Build the engine

  11. Save timing cache and engine

Parameters:
  • onnx (Path, str) – The path to the onnx model.

  • output (Path, str) – The location to save the TensorRT engine.

  • default_device (trt.DeviceType, str, optional) – The device to use for the engine. By default, trt.DeviceType.GPU. Options are trt.DeviceType.GPU, trt.DeviceType.DLA, or a string of “gpu” or “dla”.

  • timing_cache (Path, str, bool, optional) – Where to store the timing cache data. Can be a Path or str to a specific file, “global” or True to use the global timing cache stored in the trtutils cache directory, or None to not use a timing cache. Default is None.

  • workspace (float) – The size of the workspace in gigabytes. Default is 4.0 GiB.

  • calibration_cache (Path, str, optional) – The path to the calibration cache.

  • data_batcher (AbstractBatcher, optional) – The data batcher to use for calibration.

  • dla_core (int, optional) – The DLA core to build the engine for. By default, None or build the engine for GPU.

  • layer_precision (list[tuple[int, trt.DataType | None]], optional) – The precision to use for specific layers. By default, None.

  • layer_device (list[tuple[int, trt.DeviceType | None]], optional) – The device to use for specific layers. By default, None.

  • shapes (list[tuple[str, tuple[int, ...]]], optional) – A list of (input_name, shape) pairs to specify the shapes of the input layers. For example, shapes=[(“images”, (1, 3, imgsz, imgsz))] will set the input “images” to a fixed shape. This shape will be used as the min, optimal, and max shape for the binding. By default, None.

  • input_tensor_formats (list[tuple[str, trt.DataType, trt.TensorFormat]], optional) – A list of (name, dtype format) to allow deep specification of input layers. For example, input_tensor_formats=[(“input”, trt.DataType.UINT8, trt.TensorFormat.HWC)] By default, None

  • output_tensor_formats (list[tuple[str, trt.DataType, trt.TensorFormat]], optional) – A list of (name, dtype format) to allow deep specification of output layers. For example, output_tensor_formats=[(“output”, trt.DataType.HALF, trt.TensorFormat.LINEAR)] By default, None

  • hooks (list[Callable[[trt.INetworkDefinition], trt.INetworkDefinition]], optional) – An optional list of ‘hook’ functions to modify the TensorRT network before the remainder of the build phase occurs. By default, None

  • optimization_level (int, optional) – Optimization level to apply to the TensorRT builder config (0-5). By default, 3.

  • profiling_verbosity (trt.ProfilingVerbosity | None, optional) – Level of detail for profiling information in the built engine. Options are: trt.ProfilingVerbosity.NONE, trt.ProfilingVerbosity.LAYER_NAMES_ONLY, trt.ProfilingVerbosity.DETAILED DETAILED is recommended for best layer names when using profile_engine. By default, None (uses TensorRT’s default).

  • tiling_optimization_level (int, optional) – Tiling optimization level to enable cross-kernel tiled inference. By default, 0 (no tiling optimization).

  • tiling_l2_cache_limit (int, None, optional) – L2 cache limit (in bytes) for tiling optimization. By default, None (TensorRT manages the default value).

  • device (int, optional) – The CUDA device index to build the engine on. Default is None, which uses the current device.

  • gpu_fallback (bool) – Whether or not to allow GPU fallback for unsupported layers when building the engine for DLA. By default, False

  • direct_io (bool) – Use direct IO for the engine. By default, False

  • prefer_precision_constraints (bool) – Whether or not to prefer precision constraints. By default, False

  • reject_empty_algorithms (bool) – Whether or not to reject empty algorithms. By default, False

  • ignore_timing_mismatch (bool) – Whether or not to allow different CUDA device generated timing caches to be used in the building of engines. By default, False

  • fp16 (bool, optional) – If True, quantize the engine to FP16 precision.

  • fp8 (bool, optional) – If True, enable FP8 precision for the engine. Requires compute capability >= 8.9 (Ada Lovelace / Hopper or newer).

  • int8 (bool, optional) – If True, quantize the engine to INT8 precision.

  • cache (bool, optional) – Whether or not to cache the engine in the trtutils engine cache. If an existing version is found will use that. Uses the name of the output file to assess if the engine has been compiled before. As such, naming the output ‘engine’, ‘model’ or similiar will result in unintended caching behavior. By default None, will not cache the engine.

  • verbose (bool, optional) – If True, print verbose output. By default, None or False

Raises:
  • RuntimeError – If the ONNX model cannot be parsed

  • RuntimeError – If the TensorRT engines fails to build

  • ValueError – If layer is manually assigned to DLA and DLA is not supported and gpu_fallback is False

trtutils.disable_jit() None[source]

Disable JIT compilation.

trtutils.disable_nvtx() None[source]

Disable trtutils NVTX profiling.

trtutils.enable_jit() None[source]

Enable just-in-time compilation using Numba for some functions.

trtutils.enable_nvtx() None[source]

Enable trtutils NVTX profiling.

trtutils.find_trtexec() Path[source]

Find an instance of the trtexec binary on the system.

Requires the locate command to be installed on the system. As such, only works on Unix-like systems.

Returns:

The path to the trtexec binary

Return type:

Path

Raises:

FileNotFoundError – If the trtexec binary is not found on the system

trtutils.get_device() int[source]

Get the current CUDA device.

Returns:

The current CUDA device index.

Return type:

int

trtutils.inspect_engine(engine: Path | str | ICudaEngine, *, verbose: bool | None = None) tuple[int, int, list[tuple[str, tuple[int, ...], DataType, TensorFormat]], list[tuple[str, tuple[int, ...], DataType, TensorFormat]]][source]

Inspect a TensorRT engine.

Parameters:
  • engine (Path | str | trt.ICudaEngine) – Path to the TensorRT engine file or an already loaded engine

  • verbose (bool | None, optional) – Whether to print verbose output, by default None

Returns:

The size in bytes of the engine, the max batch size, and two lists of input and output tensors

Return type:

tuple[int, int, list[tuple[str, tuple[int, …], trt.DataType, trt.TensorFormat]], list[tuple[str, tuple[int, …], trt.DataType, trt.TensorFormat]]]

trtutils.profile_engine(engine: Path | str | TRTEngine, iterations: int = 100, warmup_iterations: int = 10, dla_core: int | None = None, device: int | None = None, tegra_interval: int = 5, *, jetson: bool = False, warmup: bool | None = None, verbose: bool | None = None) ProfilerResult | JetsonProfilerResult[source]

Profile a TensorRT engine layer-by-layer.

This is a dispatcher function that calls either the standard profiler or the Jetson-specific profiler based on the jetson parameter.

This function runs inference multiple times and collects per-layer execution times using TensorRT’s IProfiler interface. On Jetson devices with jetson=True, it also collects power and energy metrics. It returns aggregated statistics (mean, median, min, max) for each layer across all iterations.

Notes

For best results, build the engine with profiling_verbosity set to DETAILED when calling build_engine. Otherwise, layer names may be numeric indices.

When jetson=True, the Jetson profiler function has a default of 10000 iterations (instead of 100) to ensure adequate tegrastats sampling coverage across all layers. You can override this by explicitly providing the iterations parameter.

Parameters:
  • engine (Path | str | TRTEngine) – The engine to profile. Either a TRTEngine object or path to the engine file. If a path is given, then a TRTEngine will be created automatically.

  • iterations (int, optional) – The number of profiling iterations to run, by default 100 for standard profiling. Note: The Jetson profiler uses 10000 by default if not explicitly specified.

  • warmup_iterations (int, optional) – The number of warmup iterations to run before profiling, by default 10.

  • dla_core (int, optional) – The DLA core to assign DLA layers of the engine to. Default is None. If None, any DLA layers will be assigned to DLA core 0.

  • device (int, optional) – The CUDA device index to use for the engine. Default is None, which uses the current device.

  • tegra_interval (int, optional) – The interval in milliseconds between tegrastats samples (Jetson only), by default 5. Only used when jetson=True.

  • jetson (bool, optional) – Whether to use Jetson-specific profiling with power/energy metrics, by default False.

  • warmup (bool, optional) – Whether to do warmup iterations, by default None. If None, warmup will be set to True.

  • verbose (bool, optional) – Whether to output additional information to stdout. Default None/False.

Returns:

If jetson=False: ProfilerResult containing per-layer timing statistics and total execution time. If jetson=True: JetsonProfilerResult containing per-layer timing statistics with power/energy data, total execution time, overall power draw, and overall energy consumption.

Return type:

ProfilerResult | JetsonProfilerResult

trtutils.register_jit(*, fastmath: bool = False, parallel: bool = False, nogil: bool = False, cache: bool = False, inline: str = 'never') Callable[[Callable[..., _R]], Callable[..., _R]][source]

Decorate a function to register to be re-imported whenever JIT status changes.

Parameters:
  • fastmath (bool, optional) – If True, enable fastmath during jit. Default is False.

  • parallel (bool, optional) – If True, enable parallel jit. Default is False.

  • nogil (bool, optional) – If True, disable the GIL when running jit compiled functions. Default is False.

  • cache (bool, optional) – If True, cache jit compiled functions to disk. Default is False.

  • inline (str, optional) – Whether or not to inline functions at the Numba IR level. Default is ‘never’. Options are: [‘never’, ‘always’]

Returns:

The registered and optionally JIT-compiled function.

Return type:

Callable[[Callable[_P, _R]], Callable[_P, _R]]

Examples

>>> @register_jit(fastmath=True, parallel=True)
... def my_func(x):
...     return x * x
trtutils.run_trtexec(command: str, trtexec_path: Path | str | None = None) tuple[bool, str, str][source]

Run a command using trtexec.

The goal of this function is make it easier to use trtexec within Python scripts. By returning the stdout/stderr streams via strings back to the Python program it can simplify logic or scripts which utilize trtexec.

Parameters:
  • command (str) – The command to run using trtexec

  • trtexec_path (Path | str | None, optional) – The path to the trtexec binary to use. If None, find_trtexec will be used.

Returns:

A tuple containing the following elements: (success, stdout, stderr)

Return type:

tuple[bool, str, str]

trtutils.set_device(device: int) None[source]

Set the current CUDA device.

Parameters:

device (int) – The CUDA device index to set.

trtutils.set_log_level(level: str) None[source]

Set the log level for the trtutils package.

Parameters:

level (str) – The log level to set. One of “DEBUG”, “INFO”, “WARNING”, “ERROR”, “CRITICAL”.

Raises:

ValueError – If the level is not one of the allowed values.