trtutils.jetson package¶

Module contents¶

A submodule implementing additional tools for Jetson devices.

Classes¶

JetsonBenchmarkResult: The results of benchmarking a TRTEngine on a Jetson device.
JetsonLayerInfo: Per-layer timing with power and energy metrics for Jetson profiling.
JetsonProfilerResult: The results of profiling a TRTEngine on a Jetson device.

Functions¶

benchmark_engine(): A mirror of trtutils.benchmark_engine, but also measures energy usage.
benchmark_engines(): A mirror of trtutils.benchmark_engines, but also measures energy usage.
profile_engine(): A mirror of trtutils.inspect.profile_engine, but also measures per-layer energy usage.

class trtutils.jetson.JetsonBenchmarkResult(latency: 'Metric', power_draw: 'Metric', energy: 'Metric')[source]¶

Bases: object

latency: Metric¶

power_draw: Metric¶

energy: Metric¶

class trtutils.jetson.JetsonLayerInfo(name: str, mean: float, median: float, min: float, max: float, raw: list[float], power: float, energy: float)[source]¶

Bases: LayerTiming

A dataclass to store per-layer profiling statistics for Jetson devices.

Extends LayerTiming with power and energy metrics.

name¶

The name of the layer.

Type:: str

mean¶

The mean execution time in milliseconds.

Type:: float

median¶

The median execution time in milliseconds.

Type:: float

min¶

The minimum execution time in milliseconds.

Type:: float

max¶

The maximum execution time in milliseconds.

Type:: float

raw¶

The raw execution times in milliseconds across all iterations.

Type:: list[float]

power¶

The mean power draw in milliwatts during layer execution.

Type:: float

energy¶

The mean energy consumption in millijoules per layer execution.

Type:: float

power: float¶

energy: float¶

class trtutils.jetson.JetsonProfilerResult(layers: Sequence[JetsonLayerInfo], total_time: LayerTiming, iterations: int, power_draw: Metric, energy: Metric)[source]¶

Bases: ProfilerResult

A dataclass to store the complete profiling results for Jetson devices.

This extends the standard profiling results with energy and power metrics.

layers¶

The per-layer timing, power, and energy statistics.

Type:: list[JetsonLayerInfo]

total_time¶

The total execution time statistics across all layers.

Type:: LayerTiming

iterations¶

The number of profiling iterations performed.

Type:: int

power_draw¶

The power draw statistics in milliwatts.

Type:: Metric

energy¶

The energy consumption statistics in milliwatt-seconds.

Type:: Metric

layers: Sequence[JetsonLayerInfo]¶

power_draw: Metric¶

energy: Metric¶

trtutils.jetson.benchmark_engine(engine: TRTEngine | Path | str, iterations: int = 1000, warmup_iterations: int = 50, tegra_interval: int = 5, dla_core: int | None = None, *, warmup: bool | None = None, cuda_graph: bool | None = None, verbose: bool | None = None) → JetsonBenchmarkResult[source]¶

Benchmark a TensorRT engine on a Jetson device.

Parameters:

engine (TRTEngine | Path | str) – The engine to benchmark. Either a TRTEngine object or path to the engine file. If a path is given, then a TRTEngine will be created automatically.
iterations (int, optional) – The number of iterations to run the benchmark for, by default 1000.
warmup_iterations (int, optional) – The number of warmup iterations to run before the benchmark, by default 50.
tegra_interval (int, optional) – The number of milliseconds between each tegrastats sampling. The smaller the number, the more samples per second are generated. By default 5 milliseconds between samples.
dla_core (int, optional) – The DLA core to assign DLA layers of the engine to. Default is None. If None, any DLA layers will be assigned to DLA core 0.
warmup (bool, optional) – Whether to do warmup iterations, by default None If None, warmup will be set to True.
cuda_graph (bool, optional) – Whether to enable CUDA graph capture for optimized execution. By default None, which enables CUDA graphs. Set to False for engines with DLA layers, as DLA does not support CUDA graphs.
verbose (bool, optional) – Whether ot not to output additional information to stdout. Default None/False.

Returns:

A dataclass containing the results of the benchmark.

Return type:

BenchmarkResult

Benchmark a TensorRT engine.

Parameters:

engines (Sequence[TRTEngine | Path | str | tuple[TRTEngine | Path | str, int]]) – The engines to benchmark as paths to the engine files.
iterations (int, optional) – The number of iterations to run the benchmark for, by default 1000.
warmup_iterations (int, optional) – The number of warmup iterations to run before the benchmark, by default 50.
tegra_interval (int, optional) – The number of milliseconds between each tegrastats sampling. The smaller the number, the more samples per second are generated. By default 5 milliseconds between samples.
warmup (bool, optional) – Whether to do warmup iterations, by default None If None, warmup will be set to True.
cuda_graph (bool, optional) – Whether to enable CUDA graph capture for optimized execution. By default None, which enables CUDA graphs. Set to False for engines with DLA layers, as DLA does not support CUDA graphs.
parallel (bool, optional) – Whether or not to process the engines in parallel. Useful for assessing concurrent execution performance. Will execute the engines in lockstep. If None, will benchmark each engine individually.
verbose (bool, optional) – Whether ot not to output additional information to stdout. Default None/False.

Returns:

A list of dataclasses containing the results of the benchmark. If parallel was True, will only contain one item.

Return type:

list[JetsonBenchmarkResult]

Profile a TensorRT engine layer-by-layer on a Jetson device.

This function runs inference multiple times and collects per-layer execution times using TensorRT’s IProfiler interface, along with power and energy metrics using tegrastats. It returns aggregated statistics (mean, median, min, max) for each layer across all iterations, plus per-layer power and energy consumption.

Notes

For best results, build the engine with profiling_verbosity set to DETAILED when calling build_engine. Otherwise, layer names may be numeric indices.

The default iteration count is 10000 (higher than standard profiling) to ensure adequate tegrastats sampling coverage across all layers, especially fast-executing ones.

Parameters:

engine (Path | str | TRTEngine) – The engine to profile. Either a TRTEngine object or path to the engine file. If a path is given, then a TRTEngine will be created automatically.
iterations (int, optional) – The number of profiling iterations to run, by default 10000. Higher iteration counts provide better coverage for per-layer power metrics.
warmup_iterations (int, optional) – The number of warmup iterations to run before profiling, by default 10.
tegra_interval (int, optional) – The interval in milliseconds between tegrastats samples, by default 5.
dla_core (int, optional) – The DLA core to assign DLA layers of the engine to. Default is None. If None, any DLA layers will be assigned to DLA core 0.
device (int, optional) – The CUDA device index to use for the engine. Default is None, which uses the current device.
warmup (bool, optional) – Whether to do warmup iterations, by default None. If None, warmup will be set to True.
cuda_graph (bool, optional) – Whether to enable CUDA graph capture for optimized execution. By default None, which enables CUDA graphs. Set to False for engines with DLA layers, as DLA does not support CUDA graphs.
verbose (bool, optional) – Whether to output additional information to stdout. Default None/False.

Returns:

A dataclass containing per-layer timing/power/energy statistics, total execution time, overall power draw, and overall energy consumption.

Return type:

JetsonProfilerResult