trtutils.jetson package¶
Module contents¶
A submodule implementing additional tools for Jetson devices.
Classes¶
JetsonBenchmarkResultThe results of benchmarking a TRTEngine on a Jetson device.
JetsonLayerInfoPer-layer timing with power and energy metrics for Jetson profiling.
JetsonProfilerResultThe results of profiling a TRTEngine on a Jetson device.
Functions¶
benchmark_engine()A mirror of trtutils.benchmark_engine, but also measures energy usage.
benchmark_engines()A mirror of trtutils.benchmark_engines, but also measures energy usage.
profile_engine()A mirror of trtutils.inspect.profile_engine, but also measures per-layer energy usage.
- class trtutils.jetson.JetsonBenchmarkResult(latency: 'Metric', power_draw: 'Metric', energy: 'Metric')[source]¶
Bases:
object
- class trtutils.jetson.JetsonLayerInfo(name: str, mean: float, median: float, min: float, max: float, raw: list[float], power: float, energy: float)[source]¶
Bases:
LayerTimingA dataclass to store per-layer profiling statistics for Jetson devices.
Extends LayerTiming with power and energy metrics.
- class trtutils.jetson.JetsonProfilerResult(layers: Sequence[JetsonLayerInfo], total_time: LayerTiming, iterations: int, power_draw: Metric, energy: Metric)[source]¶
Bases:
ProfilerResultA dataclass to store the complete profiling results for Jetson devices.
This extends the standard profiling results with energy and power metrics.
- layers¶
The per-layer timing, power, and energy statistics.
- Type:
- total_time¶
The total execution time statistics across all layers.
- Type:
LayerTiming
- layers: Sequence[JetsonLayerInfo]¶
- trtutils.jetson.benchmark_engine(engine: TRTEngine | Path | str, iterations: int = 1000, warmup_iterations: int = 50, tegra_interval: int = 5, dla_core: int | None = None, *, warmup: bool | None = None, cuda_graph: bool | None = None, verbose: bool | None = None) JetsonBenchmarkResult[source]¶
Benchmark a TensorRT engine on a Jetson device.
- Parameters:
engine (TRTEngine | Path | str) – The engine to benchmark. Either a TRTEngine object or path to the engine file. If a path is given, then a TRTEngine will be created automatically.
iterations (int, optional) – The number of iterations to run the benchmark for, by default 1000.
warmup_iterations (int, optional) – The number of warmup iterations to run before the benchmark, by default 50.
tegra_interval (int, optional) – The number of milliseconds between each tegrastats sampling. The smaller the number, the more samples per second are generated. By default 5 milliseconds between samples.
dla_core (int, optional) – The DLA core to assign DLA layers of the engine to. Default is None. If None, any DLA layers will be assigned to DLA core 0.
warmup (bool, optional) – Whether to do warmup iterations, by default None If None, warmup will be set to True.
cuda_graph (bool, optional) – Whether to enable CUDA graph capture for optimized execution. By default None, which enables CUDA graphs. Set to False for engines with DLA layers, as DLA does not support CUDA graphs.
verbose (bool, optional) – Whether ot not to output additional information to stdout. Default None/False.
- Returns:
A dataclass containing the results of the benchmark.
- Return type:
- trtutils.jetson.benchmark_engines(engines: Sequence[TRTEngine | Path | str | tuple[TRTEngine | Path | str, int]], iterations: int = 1000, warmup_iterations: int = 50, tegra_interval: int = 5, *, warmup: bool | None = None, cuda_graph: bool | None = None, parallel: bool | None = None, verbose: bool | None = None) list[JetsonBenchmarkResult][source]¶
Benchmark a TensorRT engine.
- Parameters:
engines (Sequence[TRTEngine | Path | str | tuple[TRTEngine | Path | str, int]]) – The engines to benchmark as paths to the engine files.
iterations (int, optional) – The number of iterations to run the benchmark for, by default 1000.
warmup_iterations (int, optional) – The number of warmup iterations to run before the benchmark, by default 50.
tegra_interval (int, optional) – The number of milliseconds between each tegrastats sampling. The smaller the number, the more samples per second are generated. By default 5 milliseconds between samples.
warmup (bool, optional) – Whether to do warmup iterations, by default None If None, warmup will be set to True.
cuda_graph (bool, optional) – Whether to enable CUDA graph capture for optimized execution. By default None, which enables CUDA graphs. Set to False for engines with DLA layers, as DLA does not support CUDA graphs.
parallel (bool, optional) – Whether or not to process the engines in parallel. Useful for assessing concurrent execution performance. Will execute the engines in lockstep. If None, will benchmark each engine individually.
verbose (bool, optional) – Whether ot not to output additional information to stdout. Default None/False.
- Returns:
A list of dataclasses containing the results of the benchmark. If parallel was True, will only contain one item.
- Return type:
- trtutils.jetson.profile_engine(engine: Path | str | TRTEngine, iterations: int = 10000, warmup_iterations: int = 10, tegra_interval: int = 5, dla_core: int | None = None, device: int | None = None, *, warmup: bool | None = None, cuda_graph: bool | None = None, verbose: bool | None = None) JetsonProfilerResult[source]¶
Profile a TensorRT engine layer-by-layer on a Jetson device.
This function runs inference multiple times and collects per-layer execution times using TensorRT’s IProfiler interface, along with power and energy metrics using tegrastats. It returns aggregated statistics (mean, median, min, max) for each layer across all iterations, plus per-layer power and energy consumption.
Notes
For best results, build the engine with profiling_verbosity set to DETAILED when calling build_engine. Otherwise, layer names may be numeric indices.
The default iteration count is 10000 (higher than standard profiling) to ensure adequate tegrastats sampling coverage across all layers, especially fast-executing ones.
- Parameters:
engine (Path | str | TRTEngine) – The engine to profile. Either a TRTEngine object or path to the engine file. If a path is given, then a TRTEngine will be created automatically.
iterations (int, optional) – The number of profiling iterations to run, by default 10000. Higher iteration counts provide better coverage for per-layer power metrics.
warmup_iterations (int, optional) – The number of warmup iterations to run before profiling, by default 10.
tegra_interval (int, optional) – The interval in milliseconds between tegrastats samples, by default 5.
dla_core (int, optional) – The DLA core to assign DLA layers of the engine to. Default is None. If None, any DLA layers will be assigned to DLA core 0.
device (int, optional) – The CUDA device index to use for the engine. Default is None, which uses the current device.
warmup (bool, optional) – Whether to do warmup iterations, by default None. If None, warmup will be set to True.
cuda_graph (bool, optional) – Whether to enable CUDA graph capture for optimized execution. By default None, which enables CUDA graphs. Set to False for engines with DLA layers, as DLA does not support CUDA graphs.
verbose (bool, optional) – Whether to output additional information to stdout. Default None/False.
- Returns:
A dataclass containing per-layer timing/power/energy statistics, total execution time, overall power draw, and overall energy consumption.
- Return type: