tensorrt¶
tensorrt ¶
Attributes¶
Classes¶
TensorRTRuntime ¶
TensorRTRuntime(model_path: str | PathLike[str], device: str | Device, precision: Precision = FP32, warmup_iterations: int = 3, warmup_shape: tuple[int, ...] = (1, 3, 224, 224))
Bases: RuntimeConfigMixin, TensorRTRuntimeMixin, BatchableRuntime[ndarray, Any]
TensorRT Runtime for optimized inference (async version).
Asynchronous version of inferflow.runtime.tensorrt.TensorRTRuntime.
All I/O operations (engine loading, inference) are executed in a thread pool to avoid blocking the event loop. The API is identical to the sync version, but all methods are async.
Supports
- TensorRT (.engine, .trt) models
- CUDA devices only
- FP32, FP16, INT8 precision
- Batch inference
- Automatic warmup
Attributes:
| Name | Type | Description |
|---|---|---|
runtime | Runtime | None | TensorRT runtime instance (None before load()). |
engine | ICudaEngine | None | TensorRT engine (None before load()). |
context | IExecutionContext | None | TensorRT execution context (None before load()). |
inputs | list[DeviceAllocation] | List of input device memory allocations. |
outputs | list[DeviceAllocation] | List of output device memory allocations. |
bindings | list[int] | List of binding pointers for execution. |
stream | Stream | None | CUDA stream for async operations. |
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
model_path | str | PathLike[str] | Path to TensorRT engine file. | required |
device | str | Device | CUDA device (default: "cuda:0"). | required |
warmup_iterations | int | Number of warmup iterations (default: 3). | 3 |
warmup_shape | tuple[int, ...] | Input shape for warmup (default: (1, 3, 640, 640)). | (1, 3, 224, 224) |
Raises:
| Type | Description |
|---|---|
FileNotFoundError | If model file does not exist. |
ValueError | If device is not CUDA. |
ImportError | If tensorrt or pycuda is not installed. |
Example
import inferflow.asyncio as iff
import numpy as np
# Initialize runtime
runtime = iff.TensorRTRuntime(
model_path="model.engine",
device="cuda:0",
)
# Single inference
async with runtime:
input_array = np.random.randn(1, 3, 640, 640).astype(
np.float32
)
output = await runtime.infer(input_array)
# Batch inference
async with runtime:
batch = [
np.random.randn(1, 3, 640, 640).astype(np.float32),
np.random.randn(1, 3, 640, 640).astype(np.float32),
]
outputs = await runtime.infer_batch(batch)
Source code in inferflow/asyncio/runtime/tensorrt.py
Attributes¶
Functions¶
load async ¶
Load TensorRT engine and prepare for inference (async).
Performs
- Load engine from disk (in thread pool)
- Create execution context
- Allocate device memory for inputs/outputs
- Create CUDA stream
- Warmup inference (in thread pool)
Raises:
| Type | Description |
|---|---|
FileNotFoundError | If engine file does not exist. |
RuntimeError | If TensorRT fails to deserialize engine. |
Source code in inferflow/asyncio/runtime/tensorrt.py
infer async ¶
Run inference on a single input (async).
Uses CUDA async operations for efficient processing.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
input | ndarray | Input numpy array. | required |
Returns:
| Type | Description |
|---|---|
Any | Output array with shape determined by model. |
Raises:
| Type | Description |
|---|---|
RuntimeError | If engine is not loaded. |
Example
Source code in inferflow/asyncio/runtime/tensorrt.py
infer_batch async ¶
Run inference on a batch of inputs (async).
Concatenates inputs and runs batch inference, then splits the output back into individual results.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
inputs | list[ndarray] | List of input arrays. Each should have shape (1, C, H, W). | required |
Returns:
| Type | Description |
|---|---|
list[Any] | List of outputs, one per input. Each maintains batch dimension. |
Raises:
| Type | Description |
|---|---|
RuntimeError | If engine is not loaded. |
Example
Source code in inferflow/asyncio/runtime/tensorrt.py
unload async ¶
Unload engine and free resources (async).
Performs
- Free CUDA device memory
- Release engine and context
Safe to call multiple times.