trtutils.builder package

Module contents

Submodule containing tools for building TensorRT engines.

Submodules

hooks

Submodule containing hooks for building TensorRT engines.

onnx

Submodule containing tools for working with ONNX models.

quantize

Submodule for ONNX post-training quantization using NVIDIA modelopt.

Classes

AbstractBatcher

Abstract base class for data batching classes.

EngineCalibrator

Calibrates an engine during quantization.

ImageBatcher

Batches images for calibration during engine building.

SyntheticBatcher

Generates synthetic data batches for calibration during engine building.

ProgressBar

Progress bar implementation for TensorRT engine building.

Functions

build_engine()

Build a TensorRT engine from an ONNX file.

build_dla_engine()

Build an efficient TensorRT engine for DLA.

can_run_on_dla()

Evaluate if the model can run on a DLA.

read_onnx()

Read an ONNX file and get TensorRT objects.

class trtutils.builder.AbstractBatcher[source]

Bases: ABC

Abstract base class for data batching classes.

abstract property num_batches: int

Get the number of batches.

abstract property batch_size: int

Get the batch size.

abstract get_next_batch() np.ndarray | None[source]

Get the batch of data.

save_calibration_data(output_path: Path | str, *, verbose: bool | None = None) Path[source]

Drain all batches and save concatenated calibration data to a .npy file.

Parameters:
  • output_path (Path, str) – The path to save the calibration data to.

  • verbose (bool, optional) – Whether to print verbose output, by default None.

Returns:

The resolved path to the saved calibration data.

Return type:

Path

Raises:

ValueError – If no batches could be retrieved from the batcher.

class trtutils.builder.EngineCalibrator(calibration_cache: Path | str | None = None)[source]

Bases: IInt8EntropyCalibrator2

Implements the trt.IInt8EntropyCalibrator2.

set_batcher(batcher: AbstractBatcher) None[source]

Set the batcher.

get_batch_size() int[source]

Get the batch size.

Overrides from trt.IInt8EntropyCalibrator2.

Returns:

The batch size

Return type:

int

get_batch(names: list[str]) list[int] | None[source]

Get the next batch of data.

Overrides from trt.IInt8EntropyCalibrator2.

Parameters:

names (list[str]) – The list of inputs, if useful to define the batch.

Returns:

GPU-Memory pointers of the next batch

Return type:

list[int]

read_calibration_cache() bytes | None[source]

Read the calibration cache file if it exists.

Overrides from trt.IInt8EntropyCalibrator2.

Returns:

The calibration cache contents if it exists

Return type:

bytes | None

write_calibration_cache(cache: bytes) None[source]

Write the calibration date to the calibration cache file.

Overrides from trt.IInt8EntropyCalibrator2.

Parameters:

cache (bytes) – The calibration data generated.

class trtutils.builder.ImageBatcher(image_dir: Path | str, shape: tuple[int, int, int], dtype: np.dtype | type[np.generic], batch_size: int = 8, order: str = 'NCHW', max_images: int | None = None, resize_method: str = 'letterbox', input_scale: tuple[float, float] = (0.0, 1.0), *, verbose: bool | None = None)[source]

Bases: AbstractBatcher

Creates image batches for calibrating TensorRT engines.

property num_batches: int

Get the number of batches.

property batch_size: int

Get the batch size.

get_next_batch() np.ndarray | None[source]

Get a batch of images which have been preprocessed.

Returns:

The batch of images if one exists

Return type:

np.ndarray | None

class trtutils.builder.SyntheticBatcher(shape: tuple[int, int, int], dtype: np.dtype | type[np.generic], batch_size: int = 8, num_batches: int = 10, data_range: tuple[float, float] = (0.0, 1.0), order: str = 'NCHW', *, verbose: bool | None = None)[source]

Bases: AbstractBatcher

Creates synthetic data batches for calibrating TensorRT engines.

property num_batches: int

Get the number of batches.

property batch_size: int

Get the batch size.

get_next_batch() np.ndarray | None[source]

Get a batch of synthetic data.

Returns:

The batch of synthetic data if one exists, None if all batches have been returned

Return type:

np.ndarray | None

trtutils.builder.build_dla_engine(onnx: Path | str, output_path: Path | str, data_batcher: AbstractBatcher, dla_core: int, max_chunks: int = 1, min_layers: int = 20, workspace: float = 4.0, calibration_cache: Path | str | None = None, timing_cache: Path | str | None = None, shapes: list[tuple[str, tuple[int, ...]]] | None = None, input_tensor_formats: list[tuple[str, trt.DataType, trt.TensorFormat]] | None = None, output_tensor_formats: list[tuple[str, trt.DataType, trt.TensorFormat]] | None = None, hooks: list[Callable[[trt.INetworkDefinition], trt.INetworkDefinition]] | None = None, optimization_level: int = 3, *, direct_io: bool = False, prefer_precision_constraints: bool = False, reject_empty_algorithms: bool = False, ignore_timing_mismatch: bool = False, fp8: bool | None = None, cache: bool | None = None, verbose: bool | None = None) None[source]

Automatically build a TensorRT engine for DLA with automatic layer assignments.

This function will: 1. Check which layers can run on DLA 2. Find the largest chunk of DLA-compatible layers 3. Assign those layers to DLA with INT8 precision 4. Assign remaining layers to GPU with FP16 precision

Parameters:
  • onnx (Path, str) – The path to the ONNX model or a pre-made TensorRT network

  • output_path (Path, str) – The path where the engine should be saved

  • data_batcher (AbstractBatcher) – The data batcher instance for INT8 calibration

  • dla_core (int) – The DLA core to use

  • max_chunks (int, optional) – The maximum number of DLA-compatible chunks to assign to the DLA. By default 1, which will assign the first compatible chunk. Can set to 0 to assign all chunks which meet min_layers.

  • min_layers (int, optional) – The minimum number of layers in a chunk to be assigned to DLA. By default 20, which will assign chunks with at least 20 layers. Can set to 0 to assign all chunks.

  • workspace (float) – The size of the workspace in gigabytes. Default is 4.0 GiB.

  • calibration_cache (Path, str, optional) – The path to the calibration cache.

  • timing_cache (Path, str, optional) – Where to store the timing cache data. Default is None.

  • shapes (list[tuple[str, tuple[int, ...]]], optional) – A list of (input_name, shape) pairs to specify the shapes of the input layers. For example, shapes=[(“images”, (1, 3, imgsz, imgsz))] will set the input “images” to a fixed shape. This shape will be used as the min, optimal, and max shape for the binding. By default, None.

  • input_tensor_formats (list[tuple[str, trt.DataType, trt.TensorFormat]], optional) – A list of (name, dtype format) to allow deep specification of input layers. For example, input_tensor_formats=[(“input”, trt.DataType.UINT8, trt.TensorFormat.HWC)] By default, None

  • output_tensor_formats (list[tuple[str, trt.DataType, trt.TensorFormat]], optional) – A list of (name, dtype format) to allow deep specification of output layers. For example, output_tensor_formats=[(“output”, trt.DataType.HALF, trt.TensorFormat.LINEAR)] By default, None

  • hooks (list[Callable[[trt.INetworkDefinition], trt.INetworkDefinition]], optional) – An optional list of ‘hook’ functions to modify the TensorRT network before the remainder of the build phase occurs. By default, None

  • optimization_level (int, optional) – Optimization level to apply to the TensorRT builder config (0-5). By default, 3.

  • direct_io (bool) – Use direct IO for the engine. By default, False

  • prefer_precision_constraints (bool) – Whether or not to prefer precision constraints. By default, False

  • reject_empty_algorithms (bool) – Whether or not to reject empty algorithms. By default, False

  • ignore_timing_mismatch (bool) – Whether or not to allow different CUDA device generated timing caches to be used in the building of engines. By default, False

  • fp8 (bool, optional) – If True, enable FP8 precision for GPU layers. Requires compute capability >= 8.9 (Ada Lovelace / Hopper or newer). DLA layers will still use INT8 precision.

  • cache (bool, optional) – Whether or not to cache the engine in the trtutils engine cache. If an existing version is found will use that. Uses the name of the output file to assess if the engine has been compiled before. As such, naming the output ‘engine’, ‘model’ or similiar will result in unintended caching behavior. By default None, will not cache the engine.

  • verbose (bool, optional) – Whether to print verbose output, by default False

trtutils.builder.build_engine(onnx: Path | str, output: Path | str, default_device: trt.DeviceType | str = <DeviceType.GPU: 0>, workspace: float = 4.0, dla_core: int | None = None, calibration_cache: Path | str | None = None, data_batcher: AbstractBatcher | None = None, layer_precision: list[tuple[int, trt.DataType | None]] | None = None, layer_device: list[tuple[int, trt.DeviceType | None]] | None = None, shapes: Sequence[tuple[str, tuple[int, ...]]] | None = None, input_tensor_formats: list[tuple[str, trt.DataType, trt.TensorFormat]] | None = None, output_tensor_formats: list[tuple[str, trt.DataType, trt.TensorFormat]] | None = None, hooks: list[Callable[[trt.INetworkDefinition], trt.INetworkDefinition]] | None = None, optimization_level: int = 3, profiling_verbosity: trt.ProfilingVerbosity | None = None, tiling_optimization_level: trt.TilingOptimizationLevel | None = None, tiling_l2_cache_limit: int | None = None, device: int | None = None, *, timing_cache: Path | str | bool | None = None, gpu_fallback: bool = False, direct_io: bool = False, prefer_precision_constraints: bool = False, reject_empty_algorithms: bool = False, ignore_timing_mismatch: bool = False, fp16: bool | None = None, fp8: bool | None = None, int8: bool | None = None, cache: bool | None = None, verbose: bool | None = None) None[source]

Build a TensorRT engine from an ONNX model.

The order in which operations occur inside build_engine:

  1. Parse the ONNX model

  2. Apply any network hooks

  3. Create optimization profile and apply any manual shapes

  4. Apply builder flags (precision constraints, empty algorithms, direct I/O)

  5. Configure tensor formats if specified

  6. Configure precision (FP16, FP8, INT8)

  7. Set default device and DLA core

  8. Apply individual layer precision and device settings

  9. Set up timing cache

  10. Build the engine

  11. Save timing cache and engine

Parameters:
  • onnx (Path, str) – The path to the onnx model.

  • output (Path, str) – The location to save the TensorRT engine.

  • default_device (trt.DeviceType, str, optional) – The device to use for the engine. By default, trt.DeviceType.GPU. Options are trt.DeviceType.GPU, trt.DeviceType.DLA, or a string of “gpu” or “dla”.

  • timing_cache (Path, str, bool, optional) – Where to store the timing cache data. Can be a Path or str to a specific file, “global” or True to use the global timing cache stored in the trtutils cache directory, or None to not use a timing cache. Default is None.

  • workspace (float) – The size of the workspace in gigabytes. Default is 4.0 GiB.

  • calibration_cache (Path, str, optional) – The path to the calibration cache.

  • data_batcher (AbstractBatcher, optional) – The data batcher to use for calibration.

  • dla_core (int, optional) – The DLA core to build the engine for. By default, None or build the engine for GPU.

  • layer_precision (list[tuple[int, trt.DataType | None]], optional) – The precision to use for specific layers. By default, None.

  • layer_device (list[tuple[int, trt.DeviceType | None]], optional) – The device to use for specific layers. By default, None.

  • shapes (list[tuple[str, tuple[int, ...]]], optional) – A list of (input_name, shape) pairs to specify the shapes of the input layers. For example, shapes=[(“images”, (1, 3, imgsz, imgsz))] will set the input “images” to a fixed shape. This shape will be used as the min, optimal, and max shape for the binding. By default, None.

  • input_tensor_formats (list[tuple[str, trt.DataType, trt.TensorFormat]], optional) – A list of (name, dtype format) to allow deep specification of input layers. For example, input_tensor_formats=[(“input”, trt.DataType.UINT8, trt.TensorFormat.HWC)] By default, None

  • output_tensor_formats (list[tuple[str, trt.DataType, trt.TensorFormat]], optional) – A list of (name, dtype format) to allow deep specification of output layers. For example, output_tensor_formats=[(“output”, trt.DataType.HALF, trt.TensorFormat.LINEAR)] By default, None

  • hooks (list[Callable[[trt.INetworkDefinition], trt.INetworkDefinition]], optional) – An optional list of ‘hook’ functions to modify the TensorRT network before the remainder of the build phase occurs. By default, None

  • optimization_level (int, optional) – Optimization level to apply to the TensorRT builder config (0-5). By default, 3.

  • profiling_verbosity (trt.ProfilingVerbosity | None, optional) – Level of detail for profiling information in the built engine. Options are: trt.ProfilingVerbosity.NONE, trt.ProfilingVerbosity.LAYER_NAMES_ONLY, trt.ProfilingVerbosity.DETAILED DETAILED is recommended for best layer names when using profile_engine. By default, None (uses TensorRT’s default).

  • tiling_optimization_level (int, optional) – Tiling optimization level to enable cross-kernel tiled inference. By default, 0 (no tiling optimization).

  • tiling_l2_cache_limit (int, None, optional) – L2 cache limit (in bytes) for tiling optimization. By default, None (TensorRT manages the default value).

  • device (int, optional) – The CUDA device index to build the engine on. Default is None, which uses the current device.

  • gpu_fallback (bool) – Whether or not to allow GPU fallback for unsupported layers when building the engine for DLA. By default, False

  • direct_io (bool) – Use direct IO for the engine. By default, False

  • prefer_precision_constraints (bool) – Whether or not to prefer precision constraints. By default, False

  • reject_empty_algorithms (bool) – Whether or not to reject empty algorithms. By default, False

  • ignore_timing_mismatch (bool) – Whether or not to allow different CUDA device generated timing caches to be used in the building of engines. By default, False

  • fp16 (bool, optional) – If True, quantize the engine to FP16 precision.

  • fp8 (bool, optional) – If True, enable FP8 precision for the engine. Requires compute capability >= 8.9 (Ada Lovelace / Hopper or newer).

  • int8 (bool, optional) – If True, quantize the engine to INT8 precision.

  • cache (bool, optional) – Whether or not to cache the engine in the trtutils engine cache. If an existing version is found will use that. Uses the name of the output file to assess if the engine has been compiled before. As such, naming the output ‘engine’, ‘model’ or similiar will result in unintended caching behavior. By default None, will not cache the engine.

  • verbose (bool, optional) – If True, print verbose output. By default, None or False

Raises:
  • RuntimeError – If the ONNX model cannot be parsed

  • RuntimeError – If the TensorRT engines fails to build

  • ValueError – If layer is manually assigned to DLA and DLA is not supported and gpu_fallback is False

trtutils.builder.can_run_on_dla(onnx: Path | str | trt.INetworkDefinition, config: trt.IBuilderConfig | None = None, *, verbose_layers: bool | None = None, verbose_chunks: bool | None = None) tuple[bool, list[tuple[list[trt.ILayer], int, int, bool]]][source]

Whether or not the entire model can be run on a DLA.

Parameters:
  • onnx (Path, str, or trt.INetworkDefinition) – The path to the onnx file or a pre-made TensorRT network.

  • config (trt.IBuilderConfig, optional) – The TensorRT builder config. Required if onnx is a network.

  • verbose_layers (bool, optional) – Whether to print verbose output for individual layers, by default None

  • verbose_chunks (bool, optional) – Whether to print verbose output for layer chunks, by default None

Returns:

Whether or not the model will all run on DLA and each block of layers. Where each block can run on a single device, DLA or GPU.

Return type:

tuple[bool, list[tuple[list[trt.ILayer], int, int, bool]]]

Raises:

ValueError – If config is not provided when onnx is a network

trtutils.builder.read_onnx(onnx: Path | str, workspace: float = 4.0) tuple[trt.INetworkDefinition, trt.IBuilder, trt.IBuilderConfig, trt.IOnnxParser][source]

Open an ONNX model and generate TensorRT network, builder, config, and parser.

Parameters:
  • onnx (Path, str) – The path to the onnx model.

  • workspace (float) – The size of the workspace in gigabytes. Default is 4.0 GiB.

Returns:

The network, builder, config, and parser.

Return type:

tuple[trt.INetworkDefinition, trt.IBuilder, trt.IBuilderConfig, trt.IOnnxParser]

Raises:
class trtutils.builder.ProgressBar[source]

Bases: IProgressMonitor

A progress bar for building TensorRT engines.

phase_start(phase_name: str, parent_phase: str | None, num_steps: int) None[source]

Start a new phase.

Parameters:
  • phase_name (str) – The name of the phase.

  • parent_phase (str | None) – The name of the parent phase, or None if the phase is a root phase.

  • num_steps (int) – The number of steps in the phase.

step_complete(phase_name: str, step: int) bool[source]

Step in current phase is completed.

Parameters:
  • phase_name (str) – The name of the phase.

  • step (int) – The step number.

Returns:

True if the build should continue, False if it should be interrupted.

Return type:

bool

phase_finish(phase_name: str) None[source]

Finish the current phase.