Changelog

0.6.1 (2025-06-17)

Added

  • core.cuda_free and core.cuda_host_free for freeing CUDA device and host memory

  • core.allocate_to_device and core.free_device_ptrs for managing CUDA device memory

Fixed

  • TRTEngine: Correctly reset input bindings when using direct_exec or raw_exec

0.6.0 (2025-06-16)

Added

  • TensorRT-based image preprocessing pipeline (impls.yolo._preprocessors._trt) * Automatically falls back to CUDA or CPU preprocessors when unsupported

  • Engine cache utilities (core.cache) enabling on-disk reuse of timing caches and faster start-up

  • Just-In-Time kernel generation via _jit module and --jit CLI flag

  • Managed (Unified) CUDA memory option for TRTEngine

  • Device-specific DLA core selection through --dla-core flag

  • Faster scale-swap-transpose (SST) CUDA kernel implementation

  • Multi-stream and live-webcam demo applications residing in demos/

  • Alternative preprocessing methods measured in benchmarking

  • Additional verbose / debug CLI flags and richer inspect command output

Changed

  • Lazy loading of TensorRT plugins and YOLO preprocessors reduces import overhead

  • Default host-side memory allocator uses pagelocked memory for all allocations; managed memory enabled by additional flag.

Fixed

  • Resolved occasional overwrite of benchmark result files

  • Miscellaneous stability fixes identified by the expanded CI test-suite

0.5.0 (2025-04-24)

Added

  • Comprehensive engine builder subpackage (builder) and trtutils build CLI command * Layer-wise progress bar, timing cache support, and INT8 calibration via ImageBatcher / Calibrator * Dedicated DLA build helpers (builder._dla) with per-layer precision and chunk-size specification

  • Support for TensorRT execute_async_v3 backend

  • Unified logging facility exported through trtutils.get_logger

  • Extensive benchmarking scripts with automatic documentation generation

  • Continuous Integration pipelines and greatly expanded automated test-suite (#40)

Improved

  • Enhanced YOLO CLI (image batching, new convenience flags)

  • Build and benchmarking documentation now generated automatically in CI

  • Utilization of timing caches when building engines with trtexec for faster rebuilds

Fixed

  • Memory leak in CUDAPreprocessor due to incorrect explicit free call

  • Numerous minor bugs uncovered by the new tests

0.4.1 (2025-03-04)

Added

  • core.init_cuda * Use to start CUDA if not using a TRTEngine and only the core setup.

  • benchmark_engines and jetson.benchmark_engines * Can benchmark TRTEngines in concurrently running mode

  • Example yolo CLI program * Currently only support video file input and display.

  • General CLI fixes for trtexec

  • Experimental non-pagelocked memory addressing for TRTengines * Unstable, should be used with caution. Will be refined in the future * Does not provide performance improvement, simply for testing speedup of pagelocked memory utilization. As such, low-priority

  • Basic internal profiling setup for YOLO objects. * No current public access, but accessible through: (_pre)(_infer)(_post)_profile attributes * Only stores last timestamp tuple * No end2end method support yet

Fixed

  • yolo.CUDAPreprocessor using the wrong block size during resize call

  • Various fixes and extensions to ParallelTRTEngines

0.4.0 (2024-12-05)

Added

  • CUDA-based resize kernels

    • Perform linear or letterbox resizing

  • core.create_kernel_args and core.Kernel

0.3.4 (2024-11-12)

Added

  • CUDA-based preprocessing for YOLO:

    • Introduced CUDAPreprocessor and CPUPreprocessor

    • Additional parameters in YOLO constructor and methods:

      • conf_thres

      • extra_nms, agnostic_nms

      • resize_method, preprocessing_unit

  • Runtime CUDA kernel generation with NVRTC:

    • Final transform (transpose from HWC to BCHW) reduced from 50ms to 5ms for 1280x1280, achieving a 10x speedup

Improved

  • Multi-threading safety:

    • ParallelYOLO enforces serial deserialization of engine files

    • CUDAProcessor now serializes initialization

    • Core CUDA/NVRTC calls use mutexes

0.3.3 (2024-10-31)

Added

  • impls.yolo.YOLO:

    • Added input_range parameter for specifying the input range

    • YOLOX uses [0:255], all others use [0:1]

0.3.2 (2024-10-31)

Added

  • Variations of impls.yolo.YOLO: YOLO7, YOLO8, YOLO9, YOLO10, and YOLOX

Changed

  • impls.yolo.YOLO:

    • Version inference is now automatic

    • Postprocessing determined from outputs

0.3.1 (2024-10-29)

Improved

  • Outputs from impls.yolo.YOLO now use standard Python types:

    • Improved compatibility with JIT compilers like numba

0.3.0 (2024-10-25)

Added

  • impls.yolo.ParallelYOLO: Enables running multiple YOLO models simultaneously

Improved

  • TRTEngine:

    • Uses async memory copies and execution

    • Implements pagelocked memory on host

Removed

  • backend submodule: Deprecated in favor of CUDA Python engines

0.2.3 (2024-10-17)

Added

  • jetson.benchmark_engine integrated with jetsontools > 0.0.3

Improved

  • TRTEngine: Enhanced threading documentation

Fixed

  • trtexec.build_engine: Correctly builds for DLA core 0

0.2.2 (2024-10-17)

Changed

  • TRTEngine:

    • Uses execute_async_v2 for inference

    • core.create_engine now creates a cudaStream

0.2.1 (2024-10-16)

Added

  • Locks for TensorRT engine creation and CUDA memory allocation

0.2.0 (2024-10-02)

Added

  • benchmark_engine: Measures engine latency

  • Submodules:

    • jetson

    • impls

    • impls.yolo: Supports YOLO variants (V7 to V10)

Changed

  • trtexec.build_from_onnx renamed to trtexec.build_engine

0.1.2 (2024-10-10)

Added

  • Async and parallel execution classes:

    • QueuedTRTEngine, QueuedTRTModel

    • ParallelTRTEngine, ParallelTRTModel

0.1.1 (2024-07-30)

Fixed

  • Resolved AttributeError during deallocation crashes

0.1.0 (2024-07-30)

Changed

  • Default TRTEngine now uses CUDA Python:

    • Improved stability and compatibility

    • Legacy PyCUDA version available via trtutils.backends.PyCudaTRTEngine

0.0.8 (2024-07-21)

Added

  • trtexec submodule:

    • Locate and run trtexec commands programmatically

0.0.3 (2024-02-22)

Fixed

  • Correct package detection as fully typed

Improved

  • Examples, documentation, and stricter linting/typing

Added

  • PyCUDA install script for Linux