Advanced Usage Tutorial¶

This tutorial covers advanced usage of trtutils, including lower-level interfaces and CUDA operations.

trtutils provides some lower-level interfaces which are used inside of TRTEngine. These involve creating TensorRT engines allocating memory with CUDA.

These lower-level utilities can be found inside of the core submodule. All utilties included are:

CUDA context helpers
CUDA stream helpers
CUDA memory allocation and transfer functions
NVRTC to compile CUDA kernels
TensorRT engine deserialization
Binding abstraction for managed memory

Bindings¶

Binding manages CUDA-allocated memory. You can create bindings directly if you allocate memory manually with cuda_malloc() or allocate_pagelocked(). CudaMalloc allocates memory directly on the GPU and allocate_pagelocked allocates page-locked memory to share between CUDA and the CPU. Pagelocked memory can enable large speedup on some systems. You can also create a binding with create_binding() which automatically allocates memory based on a given Numpy array.

Example of binding allocation with create_binding:

import numpy as np
from trtutils.core import create_binding

# float16 image
arr = np.random.default_rng().integers(0, 255, (480, 640, 3), dtype=np.float16)

# get a binding to represent the array
binding = create_binding(arr)

# allocate with pagelocked memory
pl_binding = create_binding(arr, pagelocked_mem=True)

TensorRT Engine deserialization¶

You can deserialize a TensorRT engine using the create_engine() function. This function also allocates an execution context, a tensorrt logger, and a CUDA stream which can be used to execute the engine.

from trtutils.core import create_engine

# given a path to a compiled TensorRT engine, deserialize
# tensorrt.ICudaEngine, tensorrt.IExecutionContext, tensorrt.ILogger, cuda.cudaStream_t
engine, context, logger, stream = create_engine("engine.engine")

CUDA Kernel Compilation¶

It may be nessecary to define preprocessing or postprocessing operations for your TensorRT engine with a CUDA kernel if the CPU is not fast enough. trtutils provides a small wrapper around NVRTC (NVIDIA real-time compiler) which allows you to define CUDA kernels as Python strings and compile at runtime.

An example of compiling a kernel:

# kernel which handles preprocessing for a YOLO model
# after the image has been resize to the models input size
# this assumes the model takes RGB and image is BGR (OpenCV)
KERNEL_CODE = """\
extern "C" __global__
void scaleSwapTranspose(
    const unsigned char* __restrict__ inputArr,
    float* outputArr,
    const float scale,
    const float offset,
    const int height,
    const int width
) {
    const int tx = blockIdx.x * blockDim.x + threadIdx.x;
    const int ty = blockIdx.y * blockDim.y + threadIdx.y;
    const int tz = blockIdx.z * blockDim.z + threadIdx.z;
    if (tx < height && ty < width && tz < 3) {
        const int inputIdx = (tx * width * 3) + (ty * 3) + tz;
        const float val = static_cast<float>(inputArr[inputIdx]);
        const float scaledVal = val * scale + offset;
        const int dstChannel = 2 - tz;
        const int outputIdx = (dstChannel * height * width) + (tx * width) + ty;
        outputArr[outputIdx] = scaledVal;
    }
}
"""

from trtutils.core import Kernel

# compile and load kernel
kernel = Kernel(KERNEL_CODE, "scaleSwapTranspose")

# to run the kernel need input and output CUDA data
import numpy as np
from trtutils.core import create_binding

input_arr = np.zeros((640, 640, 3), dtype=np.uint8)
output_arr = np.zeros((1, 3, 640, 640), dtype=np.float32)

input_binding = create_binding(input_arr)
output_binding = create_binding(output_arr)

# create some args for the kernel
# the args is a pointer to an array of pointers
# a new arg array has to be created for each call to cuLaunchKernel
input_arg: np.ndarray = np.array(
    [input_binding.allocation],
    dtype=np.uint64,
)
output_arg: np.ndarray = np.array(
    [output_binding.allocation],
    dtype=np.uint64,
)
# assume no scale and offset
args = kernel.create_args(
    input_binding.allocation,
    output_binding.allocation,
    height,
    width,
    scale,
    offset,
)

# launch the kernel
from trtutils.core import create_stream, stream_synchronize, memcpy_host_to_device_async, memcpy_device_to_host_async

stream = create_stream()
memcpy_host_to_device_async(
    input_binding.allocation,
    input_arr,
    stream,
)
kernel.call((32, 32, 1), (32, 32, 1), stream, args)
memcpy_device_to_host_async(
    output_binding.host_allocation,
    output_binding.allocation,
    stream,
)
stream_synchronize(stream)

# print the completed output shape
print(output_binding.host_allocation.shape)