trtutils.builder package¶

Module contents¶

Submodule containing tools for building TensorRT engines.

Submodules¶

hooks: Submodule containing hooks for building TensorRT engines.
onnx: Submodule containing tools for working with ONNX models.
quantize: Submodule for ONNX post-training quantization using NVIDIA modelopt.

Classes¶

AbstractBatcher: Abstract base class for data batching classes.
EngineCalibrator: Calibrates an engine during quantization.
ImageBatcher: Batches images for calibration during engine building.
SyntheticBatcher: Generates synthetic data batches for calibration during engine building.
ProgressBar: Progress bar implementation for TensorRT engine building.

Functions¶

build_engine(): Build a TensorRT engine from an ONNX file.
build_dla_engine(): Build an efficient TensorRT engine for DLA.
can_run_on_dla(): Evaluate if the model can run on a DLA.
read_onnx(): Read an ONNX file and get TensorRT objects.

class trtutils.builder.AbstractBatcher[source]¶

Bases: ABC

Abstract base class for data batching classes.

abstract property num_batches: int¶: Get the number of batches.

abstract property batch_size: int¶: Get the batch size.

abstract get_next_batch() → np.ndarray | None[source]¶: Get the batch of data.

save_calibration_data(output_path: Path | str, *, verbose: bool | None = None) → Path[source]¶

Drain all batches and save concatenated calibration data to a .npy file.

Parameters:

output_path (Path, str) – The path to save the calibration data to.
verbose (bool, optional) – Whether to print verbose output, by default None.

Returns:

The resolved path to the saved calibration data.

Return type:

Path

Raises:

ValueError – If no batches could be retrieved from the batcher.

class trtutils.builder.EngineCalibrator(calibration_cache: Path | str | None = None)[source]¶

Bases: IInt8EntropyCalibrator2

Implements the trt.IInt8EntropyCalibrator2.

set_batcher(batcher: AbstractBatcher) → None[source]¶: Set the batcher.

get_batch_size() → int[source]¶

Get the batch size.

Overrides from trt.IInt8EntropyCalibrator2.

Returns:: The batch size
Return type:: int

get_batch(names: list[str]) → list[int] | None[source]¶

Get the next batch of data.

Overrides from trt.IInt8EntropyCalibrator2.

Parameters:: names (list[str]) – The list of inputs, if useful to define the batch.
Returns:: GPU-Memory pointers of the next batch
Return type:: list[int]

read_calibration_cache() → bytes | None[source]¶

Read the calibration cache file if it exists.

Overrides from trt.IInt8EntropyCalibrator2.

Returns:: The calibration cache contents if it exists
Return type:: bytes | None

write_calibration_cache(cache: bytes) → None[source]¶

Write the calibration date to the calibration cache file.

Overrides from trt.IInt8EntropyCalibrator2.

Parameters:: cache (bytes) – The calibration data generated.

class trtutils.builder.ImageBatcher(image_dir: Path | str, shape: tuple[int, int, int], dtype: np.dtype | type[np.generic], batch_size: int = 8, order: str = 'NCHW', max_images: int | None = None, resize_method: str = 'letterbox', input_scale: tuple[float, float] = (0.0, 1.0), *, verbose: bool | None = None)[source]¶

Bases: AbstractBatcher

Creates image batches for calibrating TensorRT engines.

property num_batches: int¶: Get the number of batches.

property batch_size: int¶: Get the batch size.

get_next_batch() → np.ndarray | None[source]¶

Get a batch of images which have been preprocessed.

Returns:: The batch of images if one exists
Return type:: np.ndarray | None

class trtutils.builder.SyntheticBatcher(shape: tuple[int, int, int], dtype: np.dtype | type[np.generic], batch_size: int = 8, num_batches: int = 10, data_range: tuple[float, float] = (0.0, 1.0), order: str = 'NCHW', *, verbose: bool | None = None)[source]¶

Bases: AbstractBatcher

Creates synthetic data batches for calibrating TensorRT engines.

property num_batches: int¶: Get the number of batches.

property batch_size: int¶: Get the batch size.

get_next_batch() → np.ndarray | None[source]¶

Get a batch of synthetic data.

Returns:: The batch of synthetic data if one exists, None if all batches have been returned
Return type:: np.ndarray | None

trtutils.builder.build_dla_engine(onnx: Path | str, output_path: Path | str, data_batcher: AbstractBatcher, dla_core: int, max_chunks: int = 1, min_layers: int = 20, workspace: float = 4.0, calibration_cache: Path | str | None = None, timing_cache: Path | str | None = None, shapes: list[tuple[str, tuple[int, ...]]] | None = None, input_tensor_formats: list[tuple[str, trt.DataType, trt.TensorFormat]] | None = None, output_tensor_formats: list[tuple[str, trt.DataType, trt.TensorFormat]] | None = None, hooks: list[Callable[[trt.INetworkDefinition], trt.INetworkDefinition]] | None = None, optimization_level: int = 3, *, direct_io: bool = False, prefer_precision_constraints: bool = False, reject_empty_algorithms: bool = False, ignore_timing_mismatch: bool = False, fp8: bool | None = None, cache: bool | None = None, verbose: bool | None = None) → None[source]¶

Automatically build a TensorRT engine for DLA with automatic layer assignments.

This function will: 1. Check which layers can run on DLA 2. Find the largest chunk of DLA-compatible layers 3. Assign those layers to DLA with INT8 precision 4. Assign remaining layers to GPU with FP16 precision

Parameters:

onnx (Path, str) – The path to the ONNX model or a pre-made TensorRT network
output_path (Path, str) – The path where the engine should be saved
data_batcher (AbstractBatcher) – The data batcher instance for INT8 calibration
dla_core (int) – The DLA core to use
max_chunks (int, optional) – The maximum number of DLA-compatible chunks to assign to the DLA. By default 1, which will assign the first compatible chunk. Can set to 0 to assign all chunks which meet min_layers.
min_layers (int, optional) – The minimum number of layers in a chunk to be assigned to DLA. By default 20, which will assign chunks with at least 20 layers. Can set to 0 to assign all chunks.
workspace (float) – The size of the workspace in gigabytes. Default is 4.0 GiB.
calibration_cache (Path, str, optional) – The path to the calibration cache.
timing_cache (Path, str, optional) – Where to store the timing cache data. Default is None.
shapes (list[tuple[str, tuple[int, ...]]], optional) – A list of (input_name, shape) pairs to specify the shapes of the input layers. For example, shapes=[(“images”, (1, 3, imgsz, imgsz))] will set the input “images” to a fixed shape. This shape will be used as the min, optimal, and max shape for the binding. By default, None.
input_tensor_formats (list[tuple[str, trt.DataType, trt.TensorFormat]], optional) – A list of (name, dtype format) to allow deep specification of input layers. For example, input_tensor_formats=[(“input”, trt.DataType.UINT8, trt.TensorFormat.HWC)] By default, None
output_tensor_formats (list[tuple[str, trt.DataType, trt.TensorFormat]], optional) – A list of (name, dtype format) to allow deep specification of output layers. For example, output_tensor_formats=[(“output”, trt.DataType.HALF, trt.TensorFormat.LINEAR)] By default, None
hooks (list[Callable[[trt.INetworkDefinition], trt.INetworkDefinition]], optional) – An optional list of ‘hook’ functions to modify the TensorRT network before the remainder of the build phase occurs. By default, None
optimization_level (int, optional) – Optimization level to apply to the TensorRT builder config (0-5). By default, 3.
direct_io (bool) – Use direct IO for the engine. By default, False
prefer_precision_constraints (bool) – Whether or not to prefer precision constraints. By default, False
reject_empty_algorithms (bool) – Whether or not to reject empty algorithms. By default, False
ignore_timing_mismatch (bool) – Whether or not to allow different CUDA device generated timing caches to be used in the building of engines. By default, False
fp8 (bool, optional) – If True, enable FP8 precision for GPU layers. Requires compute capability >= 8.9 (Ada Lovelace / Hopper or newer). DLA layers will still use INT8 precision.
cache (bool, optional) – Whether or not to cache the engine in the trtutils engine cache. If an existing version is found will use that. Uses the name of the output file to assess if the engine has been compiled before. As such, naming the output ‘engine’, ‘model’ or similiar will result in unintended caching behavior. By default None, will not cache the engine.
verbose (bool, optional) – Whether to print verbose output, by default False

trtutils.builder.build_engine(onnx: Path | str, output: Path | str, default_device: trt.DeviceType | str = <DeviceType.GPU: 0>, workspace: float = 4.0, dla_core: int | None = None, calibration_cache: Path | str | None = None, data_batcher: AbstractBatcher | None = None, layer_precision: list[tuple[int, trt.DataType | None]] | None = None, layer_device: list[tuple[int, trt.DeviceType | None]] | None = None, shapes: Sequence[tuple[str, tuple[int, ...]]] | None = None, input_tensor_formats: list[tuple[str, trt.DataType, trt.TensorFormat]] | None = None, output_tensor_formats: list[tuple[str, trt.DataType, trt.TensorFormat]] | None = None, hooks: list[Callable[[trt.INetworkDefinition], trt.INetworkDefinition]] | None = None, optimization_level: int = 3, profiling_verbosity: trt.ProfilingVerbosity | None = None, tiling_optimization_level: trt.TilingOptimizationLevel | None = None, tiling_l2_cache_limit: int | None = None, device: int | None = None, *, timing_cache: Path | str | bool | None = None, gpu_fallback: bool = False, direct_io: bool = False, prefer_precision_constraints: bool = False, reject_empty_algorithms: bool = False, ignore_timing_mismatch: bool = False, fp16: bool | None = None, fp8: bool | None = None, int8: bool | None = None, cache: bool | None = None, verbose: bool | None = None) → None[source]¶

Build a TensorRT engine from an ONNX model.

The order in which operations occur inside build_engine:

Parse the ONNX model
Apply any network hooks
Create optimization profile and apply any manual shapes
Apply builder flags (precision constraints, empty algorithms, direct I/O)
Configure tensor formats if specified
Configure precision (FP16, FP8, INT8)
Set default device and DLA core
Apply individual layer precision and device settings
Set up timing cache
Build the engine
Save timing cache and engine

Parameters:

onnx (Path, str) – The path to the onnx model.
output (Path, str) – The location to save the TensorRT engine.
default_device (trt.DeviceType, str, optional) – The device to use for the engine. By default, trt.DeviceType.GPU. Options are trt.DeviceType.GPU, trt.DeviceType.DLA, or a string of “gpu” or “dla”.
timing_cache (Path, str, bool, optional) – Where to store the timing cache data. Can be a Path or str to a specific file, “global” or True to use the global timing cache stored in the trtutils cache directory, or None to not use a timing cache. Default is None.
workspace (float) – The size of the workspace in gigabytes. Default is 4.0 GiB.
calibration_cache (Path, str, optional) – The path to the calibration cache.
data_batcher (AbstractBatcher, optional) – The data batcher to use for calibration.
dla_core (int, optional) – The DLA core to build the engine for. By default, None or build the engine for GPU.
layer_precision (list[tuple[int, trt.DataType | None]], optional) – The precision to use for specific layers. By default, None.
layer_device (list[tuple[int, trt.DeviceType | None]], optional) – The device to use for specific layers. By default, None.
shapes (list[tuple[str, tuple[int, ...]]], optional) – A list of (input_name, shape) pairs to specify the shapes of the input layers. For example, shapes=[(“images”, (1, 3, imgsz, imgsz))] will set the input “images” to a fixed shape. This shape will be used as the min, optimal, and max shape for the binding. By default, None.
input_tensor_formats (list[tuple[str, trt.DataType, trt.TensorFormat]], optional) – A list of (name, dtype format) to allow deep specification of input layers. For example, input_tensor_formats=[(“input”, trt.DataType.UINT8, trt.TensorFormat.HWC)] By default, None
output_tensor_formats (list[tuple[str, trt.DataType, trt.TensorFormat]], optional) – A list of (name, dtype format) to allow deep specification of output layers. For example, output_tensor_formats=[(“output”, trt.DataType.HALF, trt.TensorFormat.LINEAR)] By default, None
hooks (list[Callable[[trt.INetworkDefinition], trt.INetworkDefinition]], optional) – An optional list of ‘hook’ functions to modify the TensorRT network before the remainder of the build phase occurs. By default, None
optimization_level (int, optional) – Optimization level to apply to the TensorRT builder config (0-5). By default, 3.
profiling_verbosity (trt.ProfilingVerbosity | None, optional) – Level of detail for profiling information in the built engine. Options are: trt.ProfilingVerbosity.NONE, trt.ProfilingVerbosity.LAYER_NAMES_ONLY, trt.ProfilingVerbosity.DETAILED DETAILED is recommended for best layer names when using profile_engine. By default, None (uses TensorRT’s default).
tiling_optimization_level (int, optional) – Tiling optimization level to enable cross-kernel tiled inference. By default, 0 (no tiling optimization).
tiling_l2_cache_limit (int, None, optional) – L2 cache limit (in bytes) for tiling optimization. By default, None (TensorRT manages the default value).
device (int, optional) – The CUDA device index to build the engine on. Default is None, which uses the current device.
gpu_fallback (bool) – Whether or not to allow GPU fallback for unsupported layers when building the engine for DLA. By default, False
direct_io (bool) – Use direct IO for the engine. By default, False
prefer_precision_constraints (bool) – Whether or not to prefer precision constraints. By default, False
reject_empty_algorithms (bool) – Whether or not to reject empty algorithms. By default, False
ignore_timing_mismatch (bool) – Whether or not to allow different CUDA device generated timing caches to be used in the building of engines. By default, False
fp16 (bool, optional) – If True, quantize the engine to FP16 precision.
fp8 (bool, optional) – If True, enable FP8 precision for the engine. Requires compute capability >= 8.9 (Ada Lovelace / Hopper or newer).
int8 (bool, optional) – If True, quantize the engine to INT8 precision.
cache (bool, optional) – Whether or not to cache the engine in the trtutils engine cache. If an existing version is found will use that. Uses the name of the output file to assess if the engine has been compiled before. As such, naming the output ‘engine’, ‘model’ or similiar will result in unintended caching behavior. By default None, will not cache the engine.
verbose (bool, optional) – If True, print verbose output. By default, None or False

Raises:

RuntimeError – If the ONNX model cannot be parsed
RuntimeError – If the TensorRT engines fails to build
ValueError – If layer is manually assigned to DLA and DLA is not supported and gpu_fallback is False

trtutils.builder.can_run_on_dla(onnx: Path | str | trt.INetworkDefinition, config: trt.IBuilderConfig | None = None, *, verbose_layers: bool | None = None, verbose_chunks: bool | None = None) → tuple[bool, list[tuple[list[trt.ILayer], int, int, bool]]][source]¶

Whether or not the entire model can be run on a DLA.

Parameters:

onnx (Path, str, or trt.INetworkDefinition) – The path to the onnx file or a pre-made TensorRT network.
config (trt.IBuilderConfig, optional) – The TensorRT builder config. Required if onnx is a network.
verbose_layers (bool, optional) – Whether to print verbose output for individual layers, by default None
verbose_chunks (bool, optional) – Whether to print verbose output for layer chunks, by default None

Returns:

Whether or not the model will all run on DLA and each block of layers. Where each block can run on a single device, DLA or GPU.

Return type:

tuple[bool, list[tuple[list[trt.ILayer], int, int, bool]]]

Raises:

ValueError – If config is not provided when onnx is a network

trtutils.builder.read_onnx(onnx: Path | str, workspace: float = 4.0) → tuple[trt.INetworkDefinition, trt.IBuilder, trt.IBuilderConfig, trt.IOnnxParser][source]¶

Open an ONNX model and generate TensorRT network, builder, config, and parser.

Parameters:

onnx (Path, str) – The path to the onnx model.
workspace (float) – The size of the workspace in gigabytes. Default is 4.0 GiB.

Returns:

The network, builder, config, and parser.

Return type:

tuple[trt.INetworkDefinition, trt.IBuilder, trt.IBuilderConfig, trt.IOnnxParser]

Raises:

FileNotFoundError – If the onnx model does not exist
IsADirectoryError – If the onnx model path is a directory
ValueError – If the onnx model path does not have .onnx extension
RuntimeError – If the ONNX model cannot be parsed

class trtutils.builder.ProgressBar[source]¶

Bases: IProgressMonitor

A progress bar for building TensorRT engines.

phase_start(phase_name: str, parent_phase: str | None, num_steps: int) → None[source]¶

Start a new phase.

Parameters:

phase_name (str) – The name of the phase.
parent_phase (str | None) – The name of the parent phase, or None if the phase is a root phase.
num_steps (int) – The number of steps in the phase.

step_complete(phase_name: str, step: int) → bool[source]¶

Step in current phase is completed.

Parameters:

phase_name (str) – The name of the phase.
step (int) – The step number.

Returns:

True if the build should continue, False if it should be interrupted.

Return type:

bool

phase_finish(phase_name: str) → None[source]¶: Finish the current phase.