Dataset¶
Dataset ¶
Base class for dataset management with multi-source support.
This class provides comprehensive dataset management capabilities including loading, exporting, merging, and analyzing object detection datasets. It supports multiple data sources and provides flexible file naming strategies with detailed statistics.
Features
- Load from COCO or YOLO format
- Export to COCO or YOLO format
- Multi-source dataset management
- Flexible file naming strategies
- Comprehensive statistics including per-source analysis
- Category management with conflict resolution
- Dataset splitting and merging
- Visualization tools for samples and statistics
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
name | str | Dataset name identifier. Defaults to "dataset". | 'dataset' |
Attributes:
| Name | Type | Description |
|---|---|---|
name | str | The dataset name identifier. |
images | dict[str, ImageInfo] | Mapping of image IDs to image information. |
annotations | dict[str, list[Annotation]] | Mapping of image IDs to their annotations. |
categories | dict[int, str] | Mapping of category IDs to category names. |
category_name_to_id | dict[str, int] | Reverse mapping of category names to IDs. |
source_info | dict[str, str] | Mapping of image IDs to their source names. |
Example
Basic dataset creation and usage:
from boxlab.dataset import Dataset
from boxlab.dataset.types import ImageInfo, Annotation, BBox
# Create a new dataset
dataset = Dataset(name="my_dataset")
# Add categories
dataset.add_category(1, "person")
dataset.add_category(2, "car")
# Add an image
img_info = ImageInfo(
image_id="001",
file_name="image1.jpg",
width=640,
height=480,
path="/path/to/image1.jpg",
)
dataset.add_image(img_info, source_name="camera1")
# Add an annotation
annotation = Annotation(
bbox=BBox(x_min=10, y_min=20, x_max=100, y_max=150),
category_id=1,
category_name="person",
image_id="001",
annotation_id="ann_001",
)
dataset.add_annotation(annotation)
# Get statistics
stats = dataset.get_statistics()
print(f"Total images: {stats['num_images']}")
print(f"Total annotations: {stats['num_annotations']}")
Example
Merging multiple datasets:
from boxlab.dataset import Dataset
# Create two datasets
dataset1 = Dataset(name="dataset_a")
dataset2 = Dataset(name="dataset_b")
# ... add data to both datasets ...
# Merge datasets
merged = dataset1.merge(
dataset2,
resolve_conflicts="skip",
preserve_sources=True,
)
# Or use the + operator
merged = dataset1 + dataset2
Source code in boxlab/dataset/__init__.py
Functions¶
add_category ¶
Add a category to the dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
category_id | int | Unique identifier for the category. | required |
category_name | str | Human-readable name for the category. | required |
Example
Source code in boxlab/dataset/__init__.py
get_category_name ¶
Get category name by ID.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
category_id | int | The category ID to look up. | required |
Returns:
| Type | Description |
|---|---|
str | None | The category name if found, None otherwise. |
Example
Source code in boxlab/dataset/__init__.py
get_category_id ¶
Get category ID by name.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
category_name | str | The category name to look up. | required |
Returns:
| Type | Description |
|---|---|
int | None | The category ID if found, None otherwise. |
Example
Source code in boxlab/dataset/__init__.py
fix_duplicate_categories ¶
Fix duplicate category names by merging them.
When multiple category IDs map to the same category name, this method consolidates them into a single canonical ID (the smallest ID) and remaps all affected annotations.
Returns:
| Type | Description |
|---|---|
dict[int, int] | A mapping from old category IDs to new (canonical) category IDs. |
Example
Source code in boxlab/dataset/__init__.py
201 202 203 204 205 206 207 208 209 210 211 212 213 214 215 216 217 218 219 220 221 222 223 224 225 226 227 228 229 230 231 232 233 234 235 236 237 238 239 240 241 242 243 244 245 246 247 248 249 250 251 252 253 254 255 256 257 258 259 260 261 262 263 264 265 266 267 268 269 270 271 272 273 274 275 276 277 278 279 280 281 282 283 284 285 286 287 288 289 290 291 | |
add_image ¶
add_image(image_info: ImageInfo, source_name: str | None = None) -> None
Add image metadata to dataset with optional source tracking.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
image_info | ImageInfo | ImageInfo object containing image metadata. | required |
source_name | str | None | Optional name of the data source. Useful for tracking which dataset or camera the image came from. | None |
Example
Source code in boxlab/dataset/__init__.py
get_image ¶
get_image(image_id: str) -> ImageInfo | None
Get image information by ID.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
image_id | str | The unique identifier of the image. | required |
Returns:
| Type | Description |
|---|---|
ImageInfo | None | ImageInfo object if found, None otherwise. |
Example
Source code in boxlab/dataset/__init__.py
get_image_source ¶
Get the source name for an image.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
image_id | str | The unique identifier of the image. | required |
Returns:
| Type | Description |
|---|---|
str | None | Source name if tracked, None otherwise. |
Example
Source code in boxlab/dataset/__init__.py
get_sources ¶
Get all unique source names in the dataset.
Returns:
| Type | Description |
|---|---|
set[str] | Set of unique source names. |
Example
Source code in boxlab/dataset/__init__.py
add_annotation ¶
add_annotation(annotation: Annotation) -> None
Add annotation to dataset.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
annotation | Annotation | Annotation object containing bounding box and category info. | required |
Example
from boxlab.dataset import Dataset
from boxlab.dataset.types import Annotation, BBox
dataset = Dataset(name="my_dataset")
annotation = Annotation(
bbox=BBox(x_min=100, y_min=50, x_max=200, y_max=150),
category_id=1,
category_name="person",
image_id="001",
annotation_id="ann_001",
)
dataset.add_annotation(annotation)
Source code in boxlab/dataset/__init__.py
get_annotations ¶
get_annotations(image_id: str) -> list[Annotation]
Get all annotations for a specific image.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
image_id | str | The unique identifier of the image. | required |
Returns:
| Type | Description |
|---|---|
list[Annotation] | List of Annotation objects for the specified image. Returns empty |
list[Annotation] | list if no annotations found. |
Example
Source code in boxlab/dataset/__init__.py
num_images ¶
Get total number of images in the dataset.
Returns:
| Type | Description |
|---|---|
int | Number of images. |
Example
Source code in boxlab/dataset/__init__.py
num_annotations ¶
Get total number of annotations in the dataset.
Returns:
| Type | Description |
|---|---|
int | Total count of all annotations across all images. |
Example
Source code in boxlab/dataset/__init__.py
num_categories ¶
Get total number of categories in the dataset.
Returns:
| Type | Description |
|---|---|
int | Number of unique categories. |
Example
Source code in boxlab/dataset/__init__.py
get_statistics ¶
get_statistics(by_source: Literal[False] = False) -> DatasetStatistics
get_statistics(by_source: Literal[True]) -> dict[str, DatasetStatistics]
get_statistics(by_source: bool = False) -> DatasetStatistics | dict[str, DatasetStatistics]
Calculate dataset statistics.
Computes comprehensive statistics about the dataset including image
counts, annotation counts, category distribution, and bounding box area
metrics.
Args:
by_source: If True, return statistics grouped by source name.
If False, return overall statistics for the entire dataset.
Returns:
If by_source is False: A DatasetStatistics object with overall
stats.
If by_source is True: A dict mapping source names to their
respective DatasetStatistics objects.
Example:
```python
dataset = Dataset(name="my_dataset")
# ... add data ...
# Get overall statistics
stats = dataset.get_statistics()
print(f"Images: {stats['num_images']}")
print(f"Annotations: {stats['num_annotations']}")
print(
"Avg annotations per image: "
f"{stats['avg_annotations_per_image']:.2f}"
)
# Get statistics by source
stats_by_source = dataset.get_statistics(by_source=True)
for source, source_stats in stats_by_source.items():
print(f"
Source: {source}") print(f" Images: {source_stats['num_images']}") ```
Source code in boxlab/dataset/__init__.py
print_statistics ¶
Print dataset statistics to console.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
by_source | bool | If True, print statistics for each source separately. If False, print overall statistics. | False |
Example
Source code in boxlab/dataset/__init__.py
merge ¶
merge(other: Dataset, resolve_conflicts: Literal['skip', 'rename', 'error'] = 'skip', preserve_sources: bool = True, regen_ids: bool = True) -> Dataset
Merge another dataset into a new dataset.
Creates a new dataset containing all images, annotations, and categories from both datasets. Handles category conflicts according to the specified resolution strategy.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
other | Dataset | Another Dataset object to merge with this one. | required |
resolve_conflicts | Literal['skip', 'rename', 'error'] | Strategy for handling category name conflicts: - "skip": Use the existing category ID (default) - "rename": Rename the conflicting category from other dataset - "error": Raise CategoryConflictError | 'skip' |
preserve_sources | bool | If True, maintain source tracking information from both datasets. | True |
regen_ids | bool | If True, regenerate image and annotation IDs to avoid conflicts. Recommended when merging datasets with overlapping IDs. | True |
Returns:
| Type | Description |
|---|---|
Dataset | A new Dataset object containing the merged data. |
Raises:
| Type | Description |
|---|---|
CategoryConflictError | If resolve_conflicts is "error" and a category name conflict is detected. |
DatasetError | If category mapping fails during merge. |
Example
from boxlab.dataset import Dataset
# Create two datasets
dataset_a = Dataset(name="dataset_a")
dataset_b = Dataset(name="dataset_b")
# ... populate datasets ...
# Merge with default settings
merged = dataset_a.merge(dataset_b)
# Merge with conflict resolution
merged = dataset_a.merge(
dataset_b,
resolve_conflicts="rename",
preserve_sources=True,
)
# Using the + operator (equivalent to merge with defaults)
merged = dataset_a + dataset_b
Source code in boxlab/dataset/__init__.py
678 679 680 681 682 683 684 685 686 687 688 689 690 691 692 693 694 695 696 697 698 699 700 701 702 703 704 705 706 707 708 709 710 711 712 713 714 715 716 717 718 719 720 721 722 723 724 725 726 727 728 729 730 731 732 733 734 735 736 737 738 739 740 741 742 743 744 745 746 747 748 749 750 751 752 753 754 755 756 757 758 759 760 761 762 763 764 765 766 767 768 769 770 771 772 773 774 775 776 777 778 779 780 781 782 783 784 785 786 787 788 789 790 791 792 793 794 795 796 797 798 799 800 801 802 803 804 805 806 807 808 809 810 811 812 813 814 815 816 817 818 819 820 821 822 823 824 825 826 827 828 829 830 831 832 833 834 835 836 837 | |
split ¶
split(split_ratio: SplitRatio, seed: int | None = None) -> dict[str, list[str]]
Split dataset into train, validation, and test sets.
Randomly shuffles and divides the dataset images according to the specified ratios. Useful for creating training/validation/test splits for machine learning.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
split_ratio | SplitRatio | SplitRatio object defining the proportions for train, validation, and test sets. Must sum to 1.0. | required |
seed | int | None | Optional random seed for reproducible splits. If None, the split will be non-deterministic. | None |
Returns:
| Type | Description |
|---|---|
dict[str, list[str]] | Dictionary with keys "train", "val", and "test", each mapping to |
dict[str, list[str]] | a list of image IDs in that split. |
Raises:
| Type | Description |
|---|---|
ValueError | If split_ratio proportions don't sum to 1.0 (raised by split_ratio.validate()). |
Example
from boxlab.dataset import Dataset
from boxlab.dataset.types import SplitRatio
dataset = Dataset(name="my_dataset")
# ... populate dataset ...
# Define split ratios: 70% train, 20% val, 10% test
split_ratio = SplitRatio(train=0.7, val=0.2, test=0.1)
# Split with fixed seed for reproducibility
splits = dataset.split(split_ratio, seed=42)
print(f"Train images: {len(splits['train'])}")
print(f"Val images: {len(splits['val'])}")
print(f"Test images: {len(splits['test'])}")
# Access specific split
train_image_ids = splits["train"]
Source code in boxlab/dataset/__init__.py
visualize_sample ¶
visualize_sample(image_id: str, figsize: tuple[int, int] = (12, 8), show_labels: bool = True, save_path: Path | None = None) -> None
Visualize a single image with its annotations.
Displays the image with bounding boxes and category labels overlaid. Each category is assigned a unique color, and annotations are drawn as rectangles with optional text labels.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
image_id | str | The unique identifier of the image to visualize. | required |
figsize | tuple[int, int] | Tuple of (width, height) for the matplotlib figure size. Defaults to (12, 8). | (12, 8) |
show_labels | bool | If True, display category names above bounding boxes. Defaults to True. | True |
save_path | Path | None | Optional path to save the visualization as an image file. If None, only displays the plot. | None |
Raises:
| Type | Description |
|---|---|
DatasetError | If the image is not found or has no path defined. |
Example
from pathlib import Path
from boxlab.dataset import Dataset
dataset = Dataset(name="my_dataset")
# ... populate dataset ...
# Display a sample image
dataset.visualize_sample("001")
# Save visualization to file
dataset.visualize_sample(
"001",
figsize=(16, 10),
show_labels=True,
save_path=Path("output/sample_001.png"),
)
Source code in boxlab/dataset/__init__.py
925 926 927 928 929 930 931 932 933 934 935 936 937 938 939 940 941 942 943 944 945 946 947 948 949 950 951 952 953 954 955 956 957 958 959 960 961 962 963 964 965 966 967 968 969 970 971 972 973 974 975 976 977 978 979 980 981 982 983 984 985 986 987 988 989 990 991 992 993 994 995 996 997 998 999 1000 1001 1002 1003 1004 1005 1006 1007 1008 1009 1010 1011 1012 1013 1014 1015 1016 1017 1018 1019 | |
visualize_category_distribution ¶
visualize_category_distribution(figsize: tuple[int, int] = (12, 6), save_path: Path | None = None) -> None
Visualize category distribution as a bar chart.
Creates a bar chart showing the number of annotations for each category in the dataset. Useful for understanding class balance and distribution.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
figsize | tuple[int, int] | Tuple of (width, height) for the matplotlib figure size. Defaults to (12, 6). | (12, 6) |
save_path | Path | None | Optional path to save the visualization as an image file. If None, only displays the plot. | None |
Example
from pathlib import Path
from boxlab.dataset import Dataset
dataset = Dataset(name="my_dataset")
# ... populate dataset ...
# Display category distribution
dataset.visualize_category_distribution()
# Save to file
dataset.visualize_category_distribution(
figsize=(16, 8),
save_path=Path("output/category_distribution.png"),
)
Source code in boxlab/dataset/__init__.py
1021 1022 1023 1024 1025 1026 1027 1028 1029 1030 1031 1032 1033 1034 1035 1036 1037 1038 1039 1040 1041 1042 1043 1044 1045 1046 1047 1048 1049 1050 1051 1052 1053 1054 1055 1056 1057 1058 1059 1060 1061 1062 1063 1064 1065 1066 1067 1068 1069 1070 1071 1072 1073 1074 1075 1076 1077 1078 1079 1080 1081 1082 1083 1084 1085 1086 1087 1088 1089 1090 1091 | |
__add__ ¶
__add__(other: object) -> Dataset
Enable merging datasets using the + operator.
Parameters:
| Name | Type | Description | Default |
|---|---|---|---|
other | object | Another Dataset object to merge. | required |
Returns:
| Type | Description |
|---|---|
Dataset | A new merged Dataset. |
Raises:
| Type | Description |
|---|---|
TypeError | If other is not a Dataset instance. |
Example
Source code in boxlab/dataset/__init__.py
__len__ ¶
Return the number of images in the dataset.
Returns:
| Type | Description |
|---|---|
int | Number of images. |
Example
Source code in boxlab/dataset/__init__.py
options: show_root_heading: true show_source: true heading_level: 2 members_order: source show_signature_annotations: true separate_signature: true
Overview¶
The Dataset class is the core component of BoxLab's dataset management system. It provides comprehensive functionality for managing object detection datasets, including loading, exporting, merging, and analyzing data from multiple sources.
Key Features¶
- Multi-format Support: Load and export datasets in COCO, YOLO, and other formats
- Multi-source Management: Track and manage data from multiple sources
- Category Management: Handle categories with conflict resolution
- Dataset Operations: Split, merge, and transform datasets
- Statistics & Visualization: Comprehensive dataset analysis and visualization tools
- Flexible Architecture: Plugin-based system for extending functionality
Quick Start¶
import pathlib
from boxlab.dataset import Dataset
from boxlab.dataset.types import ImageInfo, Annotation, BBox
# Create a new dataset
dataset = Dataset(name="my_dataset")
# Add categories
dataset.add_category(1, "person")
dataset.add_category(2, "car")
# Add an image
img_info = ImageInfo(
image_id="001",
file_name="image1.jpg",
width=640,
height=480,
path=pathlib.Path("/path/to/image1.jpg"),
)
dataset.add_image(img_info, source_name="camera1")
# Add an annotation
annotation = Annotation(
bbox=BBox(x_min=10, y_min=20, x_max=100, y_max=150),
category_id=1,
category_name="person",
image_id="001",
annotation_id="ann_001",
)
dataset.add_annotation(annotation)
# Get statistics
stats = dataset.get_statistics()
print(f"Total images: {stats['num_images']}")
print(f"Total annotations: {stats['num_annotations']}")
Category Management¶
Manage object categories in your dataset:
add_category(): Add a new categoryget_category_name(): Retrieve category name by IDget_category_id(): Retrieve category ID by namefix_duplicate_categories(): Resolve duplicate category names
Image Management¶
Handle image metadata and sources:
add_image(): Add image with optional source trackingget_image(): Retrieve image informationget_image_source(): Get source name for an imageget_sources(): List all unique data sources
Annotation Management¶
Work with object annotations:
add_annotation(): Add bounding box annotationget_annotations(): Get all annotations for an image
Statistics & Analysis¶
Analyze your dataset:
get_statistics(): Compute comprehensive statisticsprint_statistics(): Display statistics in consolenum_images(): Get image countnum_annotations(): Get annotation countnum_categories(): Get category count
Dataset Operations¶
Transform and combine datasets:
split(): Split dataset into train/val/test setsmerge(): Combine multiple datasets__add__(): Merge using+operator
Visualization¶
Visualize dataset content:
visualize_sample(): Display image with annotationsvisualize_category_distribution(): Show category balance
See Also¶
- Plugin System: Extend dataset functionality
- Types: Data structures and type definitions
- I/O Operations - Loading and exporting datasets
- PyTorch Adapter - Training integration