I/O Operations¶
io ¶
Classes¶
Functions¶
load_dataset ¶
load_dataset(path: str | PathLike[str], name: str | None = None, format: str = 'auto', **kwargs: Any) -> Dataset
Load dataset from file or directory.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
path | str | PathLike[str] | Path to dataset | required |
name | str | None | Optional name for the dataset | None |
format | str | Dataset format ('coco', 'yolo', or 'auto' to detect) | 'auto' |
**kwargs | Any | Additional format-specific parameters | {} |
Returns:
| Type | Description |
|---|---|
Dataset | Loaded Dataset |
Raises:
| Type | Description |
|---|---|
ValueError | If format is unsupported or auto-detection fails |
FileNotFoundError | If path doesn't exist |
Source code in boxlab/dataset/io.py
export_dataset ¶
export_dataset(dataset: Dataset, output_dir: str | PathLike[str], format: str, split_ratio: SplitRatio | None = None, seed: int | None = None, naming: str | NamingStrategy = 'original', copy_images: bool = True, **kwargs: Any) -> None
Export dataset to specified format.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
dataset | Dataset | Dataset to export | required |
output_dir | str | PathLike[str] | Output directory path | required |
format | str | Export format ('coco', 'yolo', etc.) | required |
split_ratio | SplitRatio | None | Optional train/val/test split ratios | None |
seed | int | None | Random seed for reproducibility | None |
naming | str | NamingStrategy | File naming strategy ('original', 'prefix', 'uuid', 'sequential') or a NamingStrategy instance | 'original' |
copy_images | bool | Whether to copy image files | True |
**kwargs | Any | Additional format-specific parameters | {} |
Raises:
| Type | Description |
|---|---|
ValueError | If format is unsupported |
Examples:
>>> # Export to YOLO with splits
>>> export_dataset(
... dataset,
... "./output",
... format="yolo",
... split_ratio=SplitRatio(train=0.7, val=0.2, test=0.1),
... naming="prefix",
... )
Source code in boxlab/dataset/io.py
get_supported_formats ¶
Get information about all supported formats.
Returns:
| Type | Description |
|---|---|
dict[str, dict[str, Any]] | Dictionary with loader and exporter information |
Examples:
Source code in boxlab/dataset/io.py
merge ¶
merge(*datasets: Dataset, name: str = 'merged_dataset', resolve_conflicts: Literal['skip', 'rename', 'error'] = 'skip', preserve_sources: bool = True, fix_duplicates: bool = True) -> Dataset
Merge multiple datasets into one.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
*datasets | Dataset | Datasets to merge | () |
name | str | Name for the merged dataset | 'merged_dataset' |
resolve_conflicts | Literal['skip', 'rename', 'error'] | How to handle category conflicts | 'skip' |
preserve_sources | bool | Whether to preserve source information | True |
fix_duplicates | bool | Whether to fix duplicate category names | True |
Returns:
| Type | Description |
|---|---|
Dataset | Merged dataset |
Source code in boxlab/dataset/io.py
options: show_root_heading: true show_source: true heading_level: 2 members_order: source show_signature_annotations: true separate_signature: true
Overview¶
The I/O module provides high-level convenience functions for common dataset operations. It simplifies loading, exporting, and merging datasets with automatic format detection and sensible defaults.
Key Concepts¶
Automatic Format Detection¶
The load_dataset() function can automatically detect the format based on file structure:
.jsonfiles → COCO format- Directories with
.yaml/.yml+images//labels/→ YOLO format
Naming Strategies¶
When exporting datasets, you can control how output files are named:
| Strategy | Pattern | Example |
|---|---|---|
original | Keep original name | image001.jpg |
uuid | Random UUID | a1b2c3d4-e5f6-7890.jpg |
uuid_prefix | UUID + source prefix | camera1_a1b2c3d4.jpg |
sequential | Sequential numbers | 000001.jpg |
sequential_prefix | Numbers + source prefix | camera1_000001.jpg |
Conflict Resolution¶
When merging datasets with duplicate category names:
- skip: Keep the first occurrence
- rename: Add
_othersuffix to duplicates - error: Raise an exception
Common Workflows¶
Convert Between Formats¶
from boxlab.dataset.io import load_dataset, export_dataset
# Load COCO dataset
dataset = load_dataset("coco/instances.json", format="coco")
# Export to YOLO format
export_dataset(dataset, "output/yolo", format="yolo")
Split Dataset for Training¶
from boxlab.dataset.io import load_dataset, export_dataset
from boxlab.dataset.types import SplitRatio
# Load dataset
dataset = load_dataset("my_dataset/")
# Export with 70/20/10 split
split_ratio = SplitRatio(train=0.7, val=0.2, test=0.1)
export_dataset(
dataset,
"output/split_data",
format="yolo",
split_ratio=split_ratio,
seed=42 # Reproducible split
)
Combine Multiple Datasets¶
from boxlab.dataset.io import load_dataset, merge, export_dataset
# Load datasets from different sources
ds1 = load_dataset("source1/")
ds2 = load_dataset("source2/")
ds3 = load_dataset("source3/")
# Merge all datasets
merged = merge(ds1, ds2, ds3, name="combined_dataset")
# Export merged dataset
export_dataset(merged, "output/merged", format="coco")
Format Support Discovery¶
List Available Formats¶
from boxlab.dataset.io import get_supported_formats
formats = get_supported_formats()
print("Available Loaders:")
for name, info in formats["loaders"].items():
print(f" • {name}: {info['description']}")
print("\nAvailable Exporters:")
for name, info in formats["exporters"].items():
print(f" • {name}: {info['description']}")
Check Format Capabilities¶
from boxlab.dataset.io import get_supported_formats
formats = get_supported_formats()
# Check COCO loader extensions
coco_info = formats["loaders"]["coco"]
print(f"COCO supports: {coco_info['supported_extensions']}")
# Check YOLO exporter defaults
yolo_info = formats["exporters"]["yolo"]
print(f"YOLO defaults: {yolo_info['default_config']}")
Error Handling¶
Handle Format Detection Failures¶
from boxlab.dataset.io import load_dataset
try:
# Try auto-detection
dataset = load_dataset("unknown_structure/")
except ValueError as e:
print(f"Auto-detection failed: {e}")
# Fall back to explicit format
dataset = load_dataset("unknown_structure/", format="yolo")
Handle Merge Conflicts¶
from boxlab.dataset.io import merge
from boxlab.exceptions import CategoryConflictError
try:
merged = merge(
ds1, ds2,
resolve_conflicts="error" # Strict mode
)
except CategoryConflictError as e:
print(f"Category conflict: {e}")
# Retry with automatic resolution
merged = merge(ds1, ds2, resolve_conflicts="rename")
See Also¶
- Dataset Core - Core dataset management
- Plugin System: Extend dataset functionality
- Types: Data structures and type definitions
- PyTorch Adapter - Training integration