tensorrt¶
tensorrt ¶
Attributes¶
Classes¶
TensorRTRuntimeMixin ¶
Shared TensorRT runtime logic for sync and async implementations.
This mixin provides common TensorRT-specific logic that is shared between synchronous and asynchronous runtime implementations. It handles:
- Device validation (TensorRT requires CUDA)
- Binding memory size calculation
- Output shape computation
This mixin is pure logic with no I/O operations, making it safe to reuse across sync and async implementations.
Attributes:
| Name | Type | Description |
|---|---|---|
device | Any | Device configuration (provided by subclass). |
Example
# In sync runtime
class TensorRTRuntime(
TensorRTRuntimeMixin,
RuntimeConfigMixin,
BatchableRuntime,
):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self._validate_tensorrt_device() # Use mixin
# In async runtime - same!
class TensorRTRuntime(
TensorRTRuntimeMixin,
RuntimeConfigMixin,
BatchableRuntime,
):
def __init__(self, *args, **kwargs):
super().__init__(*args, **kwargs)
self._validate_tensorrt_device() # Same mixin!
TensorRTRuntime ¶
TensorRTRuntime(model_path: str | PathLike[str], device: str | Device, precision: Precision = FP32, warmup_iterations: int = 3, warmup_shape: tuple[int, ...] = (1, 3, 224, 224))
Bases: RuntimeConfigMixin, TensorRTRuntimeMixin, BatchableRuntime[ndarray, Any]
TensorRT Runtime for optimized inference (sync version).
Supports
- TensorRT (.engine, .trt) models
- CUDA devices only
- FP32, FP16, INT8 precision
- Batch inference
- Automatic warmup
This is the synchronous version. For async support, see inferflow.asyncio.runtime.tensorrt.TensorRTRuntime.
Attributes:
| Name | Type | Description |
|---|---|---|
runtime | Runtime | None | TensorRT runtime instance (None before load()). |
engine | ICudaEngine | None | TensorRT engine (None before load()). |
context | IExecutionContext | None | TensorRT execution context (None before load()). |
inputs | list[DeviceAllocation] | List of input device memory allocations. |
outputs | list[DeviceAllocation] | List of output device memory allocations. |
bindings | list[int] | List of binding pointers for execution. |
stream | Stream | None | CUDA stream for async operations. |
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_path | str | PathLike[str] | Path to TensorRT engine file. | required |
device | str | Device | CUDA device (default: "cuda:0"). | required |
warmup_iterations | int | Number of warmup iterations (default: 3). | 3 |
warmup_shape | tuple[int, ...] | Input shape for warmup (default: (1, 3, 640, 640)). | (1, 3, 224, 224) |
Raises:
| Type | Description |
|---|---|
FileNotFoundError | If model file does not exist. |
ValueError | If device is not CUDA. |
ImportError | If tensorrt or pycuda is not installed. |
Example
import inferflow as iff
import numpy as np
# Initialize runtime
runtime = iff.TensorRTRuntime(
model_path="model.engine",
device="cuda:0",
)
# Single inference
with runtime:
input_array = np.random.randn(1, 3, 640, 640).astype(
np.float32
)
output = runtime.infer(input_array)
# Batch inference
with runtime:
batch = [
np.random.randn(1, 3, 640, 640).astype(np.float32),
np.random.randn(1, 3, 640, 640).astype(np.float32),
]
outputs = runtime.infer_batch(batch)
Source code in inferflow/runtime/tensorrt.py
Attributes¶
Functions¶
load ¶
Load TensorRT engine and prepare for inference.
Performs
- Load engine from disk
- Create execution context
- Allocate device memory for inputs/outputs
- Create CUDA stream
- Warmup inference
Raises:
| Type | Description |
|---|---|
FileNotFoundError | If engine file does not exist. |
RuntimeError | If TensorRT fails to deserialize engine. |
Source code in inferflow/runtime/tensorrt.py
infer ¶
Run inference on a single input.
Uses CUDA async operations for efficient processing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input | ndarray | Input numpy array. | required |
Returns:
| Type | Description |
|---|---|
Any | Output array with shape determined by model. |
Raises:
| Type | Description |
|---|---|
RuntimeError | If engine is not loaded. |
Example
Source code in inferflow/runtime/tensorrt.py
infer_batch ¶
Run inference on a batch of inputs.
Concatenates inputs and runs batch inference, then splits the output back into individual results.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
inputs | list[ndarray] | List of input arrays. Each should have shape (1, C, H, W). | required |
Returns:
| Type | Description |
|---|---|
list[Any] | List of outputs, one per input. Each maintains batch dimension. |
Raises:
| Type | Description |
|---|---|
RuntimeError | If engine is not loaded. |
Example
Source code in inferflow/runtime/tensorrt.py
unload ¶
Unload engine and free resources.
Performs
- Free CUDA device memory
- Release engine and context
Safe to call multiple times.