Dataset¶

Dataset ¶

Dataset(name: str = 'dataset')

Base class for dataset management with multi-source support.

This class provides comprehensive dataset management capabilities including loading, exporting, merging, and analyzing object detection datasets. It supports multiple data sources and provides flexible file naming strategies with detailed statistics.

Features

Load from COCO or YOLO format
Export to COCO or YOLO format
Multi-source dataset management
Flexible file naming strategies
Comprehensive statistics including per-source analysis
Category management with conflict resolution
Dataset splitting and merging
Visualization tools for samples and statistics

Parameters:

Name	Type	Description	Default
`name`	`str`	Dataset name identifier. Defaults to "dataset".	`'dataset'`

Attributes:

Name	Type	Description
`name`	`str`	The dataset name identifier.
`images`	`dict[str, ImageInfo]`	Mapping of image IDs to image information.
`annotations`	`dict[str, list[Annotation]]`	Mapping of image IDs to their annotations.
`categories`	`dict[int, str]`	Mapping of category IDs to category names.
`category_name_to_id`	`dict[str, int]`	Reverse mapping of category names to IDs.
`source_info`	`dict[str, str]`	Mapping of image IDs to their source names.

Example

Basic dataset creation and usage:

from boxlab.dataset import Dataset
from boxlab.dataset.types import ImageInfo, Annotation, BBox

# Create a new dataset
dataset = Dataset(name="my_dataset")

# Add categories
dataset.add_category(1, "person")
dataset.add_category(2, "car")

# Add an image
img_info = ImageInfo(
    image_id="001",
    file_name="image1.jpg",
    width=640,
    height=480,
    path="/path/to/image1.jpg",
)
dataset.add_image(img_info, source_name="camera1")

# Add an annotation
annotation = Annotation(
    bbox=BBox(x_min=10, y_min=20, x_max=100, y_max=150),
    category_id=1,
    category_name="person",
    image_id="001",
    annotation_id="ann_001",
)
dataset.add_annotation(annotation)

# Get statistics
stats = dataset.get_statistics()
print(f"Total images: {stats['num_images']}")
print(f"Total annotations: {stats['num_annotations']}")

Example

Merging multiple datasets:

from boxlab.dataset import Dataset

# Create two datasets
dataset1 = Dataset(name="dataset_a")
dataset2 = Dataset(name="dataset_b")

# ... add data to both datasets ...

# Merge datasets
merged = dataset1.merge(
    dataset2,
    resolve_conflicts="skip",
    preserve_sources=True,
)

# Or use the + operator
merged = dataset1 + dataset2

Source code in boxlab/dataset/__init__.py

def __init__(self, name: str = "dataset") -> None:
    self.name = name
    self.images: dict[str, ImageInfo] = {}
    self.annotations: dict[str, list[Annotation]] = coll.defaultdict(list)
    self.categories: dict[int, str] = {}
    self.category_name_to_id: dict[str, int] = {}
    self.source_info: dict[str, str] = {}  # image_id -> source_name
    logger.debug(f"Initialized dataset: {name}")

Functions¶

add_category ¶

add_category(category_id: int, category_name: str) -> None

Add a category to the dataset.

Parameters:

Name	Type	Description	Default
`category_id`	`int`	Unique identifier for the category.	required
`category_name`	`str`	Human-readable name for the category.	required

Example

dataset = Dataset(name="my_dataset")
dataset.add_category(1, "person")
dataset.add_category(2, "car")
dataset.add_category(3, "bicycle")

Source code in boxlab/dataset/__init__.py

def add_category(self, category_id: int, category_name: str) -> None:
    """Add a category to the dataset.

    Args:
        category_id: Unique identifier for the category.
        category_name: Human-readable name for the category.

    Example:
        ```python
        dataset = Dataset(name="my_dataset")
        dataset.add_category(1, "person")
        dataset.add_category(2, "car")
        dataset.add_category(3, "bicycle")
        ```
    """
    self.categories[category_id] = category_name
    self.category_name_to_id[category_name] = category_id
    logger.debug(f"Added category: {category_id} -> {category_name}")

get_category_name ¶

get_category_name(category_id: int) -> str | None

Get category name by ID.

Parameters:

Name	Type	Description	Default
`category_id`	`int`	The category ID to look up.	required

Returns:

Type	Description
`str \| None`	The category name if found, None otherwise.

Example

dataset = Dataset(name="my_dataset")
dataset.add_category(1, "person")

name = dataset.get_category_name(1)
print(name)  # Output: "person"

Source code in boxlab/dataset/__init__.py

def get_category_name(self, category_id: int) -> str | None:
    """Get category name by ID.

    Args:
        category_id: The category ID to look up.

    Returns:
        The category name if found, None otherwise.

    Example:
        ```python
        dataset = Dataset(name="my_dataset")
        dataset.add_category(1, "person")

        name = dataset.get_category_name(1)
        print(name)  # Output: "person"
        ```
    """
    return self.categories.get(category_id)

get_category_id ¶

get_category_id(category_name: str) -> int | None

Get category ID by name.

Parameters:

Name	Type	Description	Default
`category_name`	`str`	The category name to look up.	required

Returns:

Type	Description
`int \| None`	The category ID if found, None otherwise.

Example

dataset = Dataset(name="my_dataset")
dataset.add_category(1, "person")

cat_id = dataset.get_category_id("person")
print(cat_id)  # Output: 1

Source code in boxlab/dataset/__init__.py

def get_category_id(self, category_name: str) -> int | None:
    """Get category ID by name.

    Args:
        category_name: The category name to look up.

    Returns:
        The category ID if found, None otherwise.

    Example:
        ```python
        dataset = Dataset(name="my_dataset")
        dataset.add_category(1, "person")

        cat_id = dataset.get_category_id("person")
        print(cat_id)  # Output: 1
        ```
    """
    return self.category_name_to_id.get(category_name)

fix_duplicate_categories ¶

fix_duplicate_categories() -> dict[int, int]

Fix duplicate category names by merging them.

When multiple category IDs map to the same category name, this method consolidates them into a single canonical ID (the smallest ID) and remaps all affected annotations.

Returns:

Type	Description
`dict[int, int]`	A mapping from old category IDs to new (canonical) category IDs.

Example

dataset = Dataset(name="my_dataset")
# Suppose categories were loaded with duplicates:
# {1: "person", 2: "car", 5: "person"}

mapping = dataset.fix_duplicate_categories()
# mapping = {1: 1, 2: 2, 5: 1}
# Now only categories {1: "person", 2: "car"} remain

Source code in boxlab/dataset/__init__.py

def fix_duplicate_categories(self) -> dict[int, int]:
    """Fix duplicate category names by merging them.

    When multiple category IDs map to the same category name, this method
    consolidates them into a single canonical ID (the smallest ID) and
    remaps all affected annotations.

    Returns:
        A mapping from old category IDs to new (canonical) category IDs.

    Example:
        ```python
        dataset = Dataset(name="my_dataset")
        # Suppose categories were loaded with duplicates:
        # {1: "person", 2: "car", 5: "person"}

        mapping = dataset.fix_duplicate_categories()
        # mapping = {1: 1, 2: 2, 5: 1}
        # Now only categories {1: "person", 2: "car"} remain
        ```
    """
    logger.info(f"Fixing duplicate categories in dataset: {self.name}")

    name_to_ids: dict[str, list[int]] = coll.defaultdict(list)
    for cat_id, cat_name in self.categories.items():
        name_to_ids[cat_name].append(cat_id)

    category_mapping: dict[int, int] = {}
    duplicates_found = 0

    for cat_name, cat_ids in name_to_ids.items():
        if len(cat_ids) > 1:
            cat_ids.sort()
            canonical_id = cat_ids[0]
            duplicates_found += len(cat_ids) - 1

            logger.warning(
                f"Found duplicate category '{cat_name}' with IDs {cat_ids}, "
                f"keeping ID {canonical_id}"
            )

            for cat_id in cat_ids:
                category_mapping[cat_id] = canonical_id
        else:
            category_mapping[cat_ids[0]] = cat_ids[0]

    if duplicates_found == 0:
        logger.info("No duplicate categories found")
        return category_mapping

    # Rebuild categories
    new_categories: dict[int, str] = {}
    new_category_name_to_id: dict[str, int] = {}

    for cat_name, cat_ids in name_to_ids.items():
        canonical_id = min(cat_ids)
        new_categories[canonical_id] = cat_name
        new_category_name_to_id[cat_name] = canonical_id

    self.categories = new_categories
    self.category_name_to_id = new_category_name_to_id

    # Remap annotations
    annotations_updated = 0
    for img_id in list(self.annotations.keys()):
        new_anns = []
        for ann in self.annotations[img_id]:
            new_cat_id = category_mapping[ann.category_id]
            if ann.category_id != new_cat_id:
                new_ann = Annotation(
                    bbox=ann.bbox,
                    category_id=new_cat_id,
                    category_name=ann.category_name,
                    image_id=ann.image_id,
                    annotation_id=ann.annotation_id,
                    area=ann.area,
                    iscrowd=ann.iscrowd,
                )
                new_anns.append(new_ann)
                annotations_updated += 1
            else:
                new_anns.append(ann)

        self.annotations[img_id] = new_anns

    logger.info(
        f"Fixed {duplicates_found} duplicate categories, "
        f"updated {annotations_updated} annotations"
    )

    return category_mapping

add_image ¶

add_image(image_info: ImageInfo, source_name: str | None = None) -> None

Add image metadata to dataset with optional source tracking.

Parameters:

Name	Type	Description	Default
`image_info`	`ImageInfo`	ImageInfo object containing image metadata.	required
`source_name`	`str \| None`	Optional name of the data source. Useful for tracking which dataset or camera the image came from.	`None`

Example

from boxlab.dataset import Dataset
from boxlab.dataset.types import ImageInfo

dataset = Dataset(name="my_dataset")

img_info = ImageInfo(
    image_id="001",
    file_name="image1.jpg",
    width=1920,
    height=1080,
    path="/data/images/image1.jpg",
)

dataset.add_image(img_info, source_name="camera_front")

Source code in boxlab/dataset/__init__.py

def add_image(self, image_info: ImageInfo, source_name: str | None = None) -> None:
    """Add image metadata to dataset with optional source tracking.

    Args:
        image_info: ImageInfo object containing image metadata.
        source_name: Optional name of the data source. Useful for tracking
            which dataset or camera the image came from.

    Example:
        ```python
        from boxlab.dataset import Dataset
        from boxlab.dataset.types import ImageInfo

        dataset = Dataset(name="my_dataset")

        img_info = ImageInfo(
            image_id="001",
            file_name="image1.jpg",
            width=1920,
            height=1080,
            path="/data/images/image1.jpg",
        )

        dataset.add_image(img_info, source_name="camera_front")
        ```
    """
    self.images[image_info.image_id] = image_info
    if source_name:
        self.source_info[image_info.image_id] = source_name
    logger.debug(f"Added image: {image_info.image_id} from source: {source_name}")

get_image ¶

get_image(image_id: str) -> ImageInfo | None

Get image information by ID.

Parameters:

Name	Type	Description	Default
`image_id`	`str`	The unique identifier of the image.	required

Returns:

Type	Description
`ImageInfo \| None`	ImageInfo object if found, None otherwise.

Example

dataset = Dataset(name="my_dataset")
# ... add images ...

img_info = dataset.get_image("001")
if img_info:
    print(f"Image: {img_info.file_name}")
    print(f"Size: {img_info.width}x{img_info.height}")

Source code in boxlab/dataset/__init__.py

def get_image(self, image_id: str) -> ImageInfo | None:
    """Get image information by ID.

    Args:
        image_id: The unique identifier of the image.

    Returns:
        ImageInfo object if found, None otherwise.

    Example:
        ```python
        dataset = Dataset(name="my_dataset")
        # ... add images ...

        img_info = dataset.get_image("001")
        if img_info:
            print(f"Image: {img_info.file_name}")
            print(f"Size: {img_info.width}x{img_info.height}")
        ```
    """
    return self.images.get(image_id)

get_image_source ¶

get_image_source(image_id: str) -> str | None

Get the source name for an image.

Parameters:

Name	Type	Description	Default
`image_id`	`str`	The unique identifier of the image.	required

Returns:

Type	Description
`str \| None`	Source name if tracked, None otherwise.

Example

dataset = Dataset(name="my_dataset")
# ... add images with sources ...

source = dataset.get_image_source("001")
print(f"Image source: {source}")  # Output: "camera_front"

Source code in boxlab/dataset/__init__.py

def get_image_source(self, image_id: str) -> str | None:
    """Get the source name for an image.

    Args:
        image_id: The unique identifier of the image.

    Returns:
        Source name if tracked, None otherwise.

    Example:
        ```python
        dataset = Dataset(name="my_dataset")
        # ... add images with sources ...

        source = dataset.get_image_source("001")
        print(f"Image source: {source}")  # Output: "camera_front"
        ```
    """
    return self.source_info.get(image_id)

get_sources ¶

get_sources() -> set[str]

Get all unique source names in the dataset.

Returns:

Type	Description
`set[str]`	Set of unique source names.

Example

dataset = Dataset(name="my_dataset")
# ... add images from multiple sources ...

sources = dataset.get_sources()
print(f"Data sources: {sources}")
# Output: {'camera_front', 'camera_rear', 'dataset_a'}

Source code in boxlab/dataset/__init__.py

def get_sources(self) -> set[str]:
    """Get all unique source names in the dataset.

    Returns:
        Set of unique source names.

    Example:
        ```python
        dataset = Dataset(name="my_dataset")
        # ... add images from multiple sources ...

        sources = dataset.get_sources()
        print(f"Data sources: {sources}")
        # Output: {'camera_front', 'camera_rear', 'dataset_a'}
        ```
    """
    return set(self.source_info.values())

add_annotation ¶

add_annotation(annotation: Annotation) -> None

Add annotation to dataset.

Parameters:

Name	Type	Description	Default
`annotation`	`Annotation`	Annotation object containing bounding box and category info.	required

Example

from boxlab.dataset import Dataset
from boxlab.dataset.types import Annotation, BBox

dataset = Dataset(name="my_dataset")

annotation = Annotation(
    bbox=BBox(x_min=100, y_min=50, x_max=200, y_max=150),
    category_id=1,
    category_name="person",
    image_id="001",
    annotation_id="ann_001",
)

dataset.add_annotation(annotation)

Source code in boxlab/dataset/__init__.py

def add_annotation(self, annotation: Annotation) -> None:
    """Add annotation to dataset.

    Args:
        annotation: Annotation object containing bounding box and category
            info.

    Example:
        ```python
        from boxlab.dataset import Dataset
        from boxlab.dataset.types import Annotation, BBox

        dataset = Dataset(name="my_dataset")

        annotation = Annotation(
            bbox=BBox(x_min=100, y_min=50, x_max=200, y_max=150),
            category_id=1,
            category_name="person",
            image_id="001",
            annotation_id="ann_001",
        )

        dataset.add_annotation(annotation)
        ```
    """
    self.annotations[annotation.image_id].append(annotation)

get_annotations ¶

get_annotations(image_id: str) -> list[Annotation]

Get all annotations for a specific image.

Parameters:

Name	Type	Description	Default
`image_id`	`str`	The unique identifier of the image.	required

Returns:

Type	Description
`list[Annotation]`	List of Annotation objects for the specified image. Returns empty
`list[Annotation]`	list if no annotations found.

Example

dataset = Dataset(name="my_dataset")
# ... add images and annotations ...

annotations = dataset.get_annotations("001")
print(f"Found {len(annotations)} annotations")

for ann in annotations:
    print(
        f"Category: {ann.category_name}, BBox: {ann.bbox}"
    )

Source code in boxlab/dataset/__init__.py

def get_annotations(self, image_id: str) -> list[Annotation]:
    """Get all annotations for a specific image.

    Args:
        image_id: The unique identifier of the image.

    Returns:
        List of Annotation objects for the specified image. Returns empty
        list if no annotations found.

    Example:
        ```python
        dataset = Dataset(name="my_dataset")
        # ... add images and annotations ...

        annotations = dataset.get_annotations("001")
        print(f"Found {len(annotations)} annotations")

        for ann in annotations:
            print(
                f"Category: {ann.category_name}, BBox: {ann.bbox}"
            )
        ```
    """
    return self.annotations.get(image_id, [])

num_images ¶

num_images() -> int

Get total number of images in the dataset.

Returns:

Type	Description
`int`	Number of images.

Example

dataset = Dataset(name="my_dataset")
# ... add images ...

print(f"Total images: {dataset.num_images()}")

Source code in boxlab/dataset/__init__.py

def num_images(self) -> int:
    """Get total number of images in the dataset.

    Returns:
        Number of images.

    Example:
        ```python
        dataset = Dataset(name="my_dataset")
        # ... add images ...

        print(f"Total images: {dataset.num_images()}")
        ```
    """
    return len(self.images)

num_annotations ¶

num_annotations() -> int

Get total number of annotations in the dataset.

Returns:

Type	Description
`int`	Total count of all annotations across all images.

Example

dataset = Dataset(name="my_dataset")
# ... add annotations ...

print(f"Total annotations: {dataset.num_annotations()}")

Source code in boxlab/dataset/__init__.py

def num_annotations(self) -> int:
    """Get total number of annotations in the dataset.

    Returns:
        Total count of all annotations across all images.

    Example:
        ```python
        dataset = Dataset(name="my_dataset")
        # ... add annotations ...

        print(f"Total annotations: {dataset.num_annotations()}")
        ```
    """
    return sum(len(anns) for anns in self.annotations.values())

num_categories ¶

num_categories() -> int

Get total number of categories in the dataset.

Returns:

Type	Description
`int`	Number of unique categories.

Example

dataset = Dataset(name="my_dataset")
# ... add categories ...

print(f"Total categories: {dataset.num_categories()}")

Source code in boxlab/dataset/__init__.py

def num_categories(self) -> int:
    """Get total number of categories in the dataset.

    Returns:
        Number of unique categories.

    Example:
        ```python
        dataset = Dataset(name="my_dataset")
        # ... add categories ...

        print(f"Total categories: {dataset.num_categories()}")
        ```
    """
    return len(self.categories)

get_statistics ¶

get_statistics(by_source: Literal[False] = False) -> DatasetStatistics

get_statistics(by_source: Literal[True]) -> dict[str, DatasetStatistics]

get_statistics(by_source: bool = False) -> DatasetStatistics | dict[str, DatasetStatistics]

Calculate dataset statistics.

    Computes comprehensive statistics about the dataset including image
    counts, annotation counts, category distribution, and bounding box area
    metrics.

    Args:
        by_source: If True, return statistics grouped by source name.
            If False, return overall statistics for the entire dataset.

    Returns:
        If by_source is False: A DatasetStatistics object with overall
            stats.
        If by_source is True: A dict mapping source names to their
            respective DatasetStatistics objects.

    Example:
        ```python
        dataset = Dataset(name="my_dataset")
        # ... add data ...

        # Get overall statistics
        stats = dataset.get_statistics()
        print(f"Images: {stats['num_images']}")
        print(f"Annotations: {stats['num_annotations']}")
        print(
            "Avg annotations per image: "
            f"{stats['avg_annotations_per_image']:.2f}"
        )

        # Get statistics by source
        stats_by_source = dataset.get_statistics(by_source=True)
        for source, source_stats in stats_by_source.items():
            print(f"

Source: {source}") print(f" Images: {source_stats['num_images']}") ```

Source code in boxlab/dataset/__init__.py

def get_statistics(
    self,
    by_source: bool = False,
) -> DatasetStatistics | dict[str, DatasetStatistics]:
    """Calculate dataset statistics.

    Computes comprehensive statistics about the dataset including image
    counts, annotation counts, category distribution, and bounding box area
    metrics.

    Args:
        by_source: If True, return statistics grouped by source name.
            If False, return overall statistics for the entire dataset.

    Returns:
        If by_source is False: A DatasetStatistics object with overall
            stats.
        If by_source is True: A dict mapping source names to their
            respective DatasetStatistics objects.

    Example:
        ```python
        dataset = Dataset(name="my_dataset")
        # ... add data ...

        # Get overall statistics
        stats = dataset.get_statistics()
        print(f"Images: {stats['num_images']}")
        print(f"Annotations: {stats['num_annotations']}")
        print(
            "Avg annotations per image: "
            f"{stats['avg_annotations_per_image']:.2f}"
        )

        # Get statistics by source
        stats_by_source = dataset.get_statistics(by_source=True)
        for source, source_stats in stats_by_source.items():
            print(f"\nSource: {source}")
            print(f"  Images: {source_stats['num_images']}")
        ```
    """
    if by_source:
        return self._get_statistics_by_source()

    return self._calculate_statistics(self.images.keys())

print_statistics ¶

print_statistics(by_source: bool = False) -> None

Print dataset statistics to console.

Parameters:

Name	Type	Description	Default
`by_source`	`bool`	If True, print statistics for each source separately. If False, print overall statistics.	`False`

Example

dataset = Dataset(name="my_dataset")
# ... add data ...

# Print overall statistics
dataset.print_statistics()

# Print per-source statistics
dataset.print_statistics(by_source=True)

Source code in boxlab/dataset/__init__.py

def print_statistics(self, by_source: bool = False) -> None:
    """Print dataset statistics to console.

    Args:
        by_source: If True, print statistics for each source separately.
            If False, print overall statistics.

    Example:
        ```python
        dataset = Dataset(name="my_dataset")
        # ... add data ...

        # Print overall statistics
        dataset.print_statistics()

        # Print per-source statistics
        dataset.print_statistics(by_source=True)
        ```
    """
    if by_source:
        self._print_statistics_by_source()
    else:
        self._print_single_statistics(self.get_statistics(by_source=False), self.name)

merge ¶

merge(other: Dataset, resolve_conflicts: Literal['skip', 'rename', 'error'] = 'skip', preserve_sources: bool = True, regen_ids: bool = True) -> Dataset

Merge another dataset into a new dataset.

Creates a new dataset containing all images, annotations, and categories from both datasets. Handles category conflicts according to the specified resolution strategy.

Parameters:

Name	Type	Description	Default
`other`	`Dataset`	Another Dataset object to merge with this one.	required
`resolve_conflicts`	`Literal['skip', 'rename', 'error']`	Strategy for handling category name conflicts: - "skip": Use the existing category ID (default) - "rename": Rename the conflicting category from other dataset - "error": Raise CategoryConflictError	`'skip'`
`preserve_sources`	`bool`	If True, maintain source tracking information from both datasets.	`True`
`regen_ids`	`bool`	If True, regenerate image and annotation IDs to avoid conflicts. Recommended when merging datasets with overlapping IDs.	`True`

Returns:

Type	Description
`Dataset`	A new Dataset object containing the merged data.

Raises:

Type	Description
`CategoryConflictError`	If resolve_conflicts is "error" and a category name conflict is detected.
`DatasetError`	If category mapping fails during merge.

Example

from boxlab.dataset import Dataset

# Create two datasets
dataset_a = Dataset(name="dataset_a")
dataset_b = Dataset(name="dataset_b")

# ... populate datasets ...

# Merge with default settings
merged = dataset_a.merge(dataset_b)

# Merge with conflict resolution
merged = dataset_a.merge(
    dataset_b,
    resolve_conflicts="rename",
    preserve_sources=True,
)

# Using the + operator (equivalent to merge with defaults)
merged = dataset_a + dataset_b

Source code in boxlab/dataset/__init__.py

def merge(
    self,
    other: Dataset,
    resolve_conflicts: t.Literal["skip", "rename", "error"] = "skip",
    preserve_sources: bool = True,
    regen_ids: bool = True,
) -> Dataset:
    """Merge another dataset into a new dataset.

    Creates a new dataset containing all images, annotations, and
    categories from both datasets. Handles category conflicts according to
    the specified resolution strategy.

    Args:
        other: Another Dataset object to merge with this one.
        resolve_conflicts: Strategy for handling category name conflicts:
            - "skip": Use the existing category ID (default)
            - "rename": Rename the conflicting category from other dataset
            - "error": Raise CategoryConflictError
        preserve_sources: If True, maintain source tracking information
            from both datasets.
        regen_ids: If True, regenerate image and annotation IDs to avoid
            conflicts. Recommended when merging datasets with overlapping
            IDs.

    Returns:
        A new Dataset object containing the merged data.

    Raises:
        CategoryConflictError: If resolve_conflicts is "error" and a
            category name conflict is detected.
        DatasetError: If category mapping fails during merge.

    Example:
        ```python
        from boxlab.dataset import Dataset

        # Create two datasets
        dataset_a = Dataset(name="dataset_a")
        dataset_b = Dataset(name="dataset_b")

        # ... populate datasets ...

        # Merge with default settings
        merged = dataset_a.merge(dataset_b)

        # Merge with conflict resolution
        merged = dataset_a.merge(
            dataset_b,
            resolve_conflicts="rename",
            preserve_sources=True,
        )

        # Using the + operator (equivalent to merge with defaults)
        merged = dataset_a + dataset_b
        ```
    """
    logger.info(f"Merging '{other.name}' into '{self.name}'")

    merged = Dataset(name=f"{self.name}_merged")

    # Merge categories
    category_mapping: dict[int, int] = {}
    next_category_id = max(self.categories.keys()) + 1 if self.categories else 1

    for cat_id, cat_name in self.categories.items():
        merged.add_category(cat_id, cat_name)
        category_mapping[cat_id] = cat_id

    for cat_id, cat_name in other.categories.items():
        if cat_name in merged.category_name_to_id:
            if resolve_conflicts == "skip":
                category_mapping[cat_id] = merged.category_name_to_id[cat_name]
            elif resolve_conflicts == "rename":
                new_name = f"{cat_name}_other"
                merged.add_category(next_category_id, new_name)
                category_mapping[cat_id] = next_category_id
                next_category_id += 1
            elif resolve_conflicts == "error":
                raise CategoryConflictError(cat_name, f"Category name conflict: {cat_name}")
        else:
            merged.add_category(cat_id, cat_name)
            category_mapping[cat_id] = cat_id

    # Merge from self
    for img_id, img_info in self.images.items():
        new_iid = utils.gen_uid(prefix="img_") if regen_ids else img_id
        source = self.source_info.get(img_id, self.name) if preserve_sources else None

        # Create new ImageInfo with potentially new ID
        new_img_info = ImageInfo(
            image_id=new_iid,
            file_name=img_info.file_name,
            width=img_info.width,
            height=img_info.height,
            path=img_info.path,
        )
        merged.add_image(new_img_info, source_name=source)
        for ann in self.get_annotations(img_id):
            new_ann = (
                Annotation(
                    bbox=ann.bbox,
                    category_id=ann.category_id,
                    category_name=ann.category_name,
                    image_id=new_iid,
                    annotation_id=utils.gen_uid(prefix="ann_")
                    if regen_ids
                    else ann.annotation_id,
                    area=ann.area,
                    iscrowd=ann.iscrowd,
                )
                if regen_ids
                else Annotation(
                    bbox=ann.bbox,
                    category_id=ann.category_id,
                    category_name=ann.category_name,
                    image_id=new_iid,
                    annotation_id=ann.annotation_id,
                    area=ann.area,
                    iscrowd=ann.iscrowd,
                )
            )
            merged.add_annotation(new_ann)

    # Merge from other
    for img_id, img_info in other.images.items():
        new_iid = (
            utils.gen_uid(prefix="img_")
            if regen_ids
            else self._resolve_id_conflict(img_id, lambda x: x in merged.images)
        )
        new_img_info = ImageInfo(
            image_id=new_iid,
            file_name=img_info.file_name,
            width=img_info.width,
            height=img_info.height,
            path=img_info.path,
        )

        source = other.source_info.get(img_id, other.name) if preserve_sources else None
        merged.add_image(new_img_info, source_name=source)

        for ann in other.get_annotations(img_id):
            new_cat_name = merged.get_category_name(category_mapping[ann.category_id])
            if new_cat_name is None:
                raise DatasetError(f"Category mapping failed for {ann.category_id}")

            new_ann = Annotation(
                bbox=ann.bbox,
                category_id=category_mapping[ann.category_id],
                category_name=new_cat_name,
                image_id=new_iid,
                annotation_id=utils.gen_uid(prefix="ann_") if regen_ids else ann.annotation_id,
                area=ann.area,
                iscrowd=ann.iscrowd,
            )
            merged.add_annotation(new_ann)

    logger.info(f"Merge completed: {merged.num_images()} images")
    return merged

split ¶

split(split_ratio: SplitRatio, seed: int | None = None) -> dict[str, list[str]]

Split dataset into train, validation, and test sets.

Randomly shuffles and divides the dataset images according to the specified ratios. Useful for creating training/validation/test splits for machine learning.

Parameters:

Name	Type	Description	Default
`split_ratio`	`SplitRatio`	SplitRatio object defining the proportions for train, validation, and test sets. Must sum to 1.0.	required
`seed`	`int \| None`	Optional random seed for reproducible splits. If None, the split will be non-deterministic.	`None`

Returns:

Type	Description
`dict[str, list[str]]`	Dictionary with keys "train", "val", and "test", each mapping to
`dict[str, list[str]]`	a list of image IDs in that split.

Raises:

Type	Description
`ValueError`	If split_ratio proportions don't sum to 1.0 (raised by split_ratio.validate()).

Example

from boxlab.dataset import Dataset
from boxlab.dataset.types import SplitRatio

dataset = Dataset(name="my_dataset")
# ... populate dataset ...

# Define split ratios: 70% train, 20% val, 10% test
split_ratio = SplitRatio(train=0.7, val=0.2, test=0.1)

# Split with fixed seed for reproducibility
splits = dataset.split(split_ratio, seed=42)

print(f"Train images: {len(splits['train'])}")
print(f"Val images: {len(splits['val'])}")
print(f"Test images: {len(splits['test'])}")

# Access specific split
train_image_ids = splits["train"]

Source code in boxlab/dataset/__init__.py

def split(self, split_ratio: SplitRatio, seed: int | None = None) -> dict[str, list[str]]:
    """Split dataset into train, validation, and test sets.

    Randomly shuffles and divides the dataset images according to the
    specified ratios. Useful for creating training/validation/test splits
    for machine learning.

    Args:
        split_ratio: SplitRatio object defining the proportions for train,
            validation, and test sets. Must sum to 1.0.
        seed: Optional random seed for reproducible splits. If None, the
            split will be non-deterministic.

    Returns:
        Dictionary with keys "train", "val", and "test", each mapping to
        a list of image IDs in that split.

    Raises:
        ValueError: If split_ratio proportions don't sum to 1.0 (raised by
            split_ratio.validate()).

    Example:
        ```python
        from boxlab.dataset import Dataset
        from boxlab.dataset.types import SplitRatio

        dataset = Dataset(name="my_dataset")
        # ... populate dataset ...

        # Define split ratios: 70% train, 20% val, 10% test
        split_ratio = SplitRatio(train=0.7, val=0.2, test=0.1)

        # Split with fixed seed for reproducibility
        splits = dataset.split(split_ratio, seed=42)

        print(f"Train images: {len(splits['train'])}")
        print(f"Val images: {len(splits['val'])}")
        print(f"Test images: {len(splits['test'])}")

        # Access specific split
        train_image_ids = splits["train"]
        ```
    """
    split_ratio.validate()

    image_ids = list(self.images.keys())
    if seed is not None:
        random.seed(seed)
    random.shuffle(image_ids)

    total = len(image_ids)
    train_end = int(total * split_ratio.train)
    val_end = train_end + int(total * split_ratio.val)

    return {
        "train": image_ids[:train_end],
        "val": image_ids[train_end:val_end],
        "test": image_ids[val_end:],
    }

visualize_sample ¶

visualize_sample(image_id: str, figsize: tuple[int, int] = (12, 8), show_labels: bool = True, save_path: Path | None = None) -> None

Visualize a single image with its annotations.

Displays the image with bounding boxes and category labels overlaid. Each category is assigned a unique color, and annotations are drawn as rectangles with optional text labels.

Parameters:

Name	Type	Description	Default
`image_id`	`str`	The unique identifier of the image to visualize.	required
`figsize`	`tuple[int, int]`	Tuple of (width, height) for the matplotlib figure size. Defaults to (12, 8).	`(12, 8)`
`show_labels`	`bool`	If True, display category names above bounding boxes. Defaults to True.	`True`
`save_path`	`Path \| None`	Optional path to save the visualization as an image file. If None, only displays the plot.	`None`

Raises:

Type	Description
`DatasetError`	If the image is not found or has no path defined.

Example

from pathlib import Path
from boxlab.dataset import Dataset

dataset = Dataset(name="my_dataset")
# ... populate dataset ...

# Display a sample image
dataset.visualize_sample("001")

# Save visualization to file
dataset.visualize_sample(
    "001",
    figsize=(16, 10),
    show_labels=True,
    save_path=Path("output/sample_001.png"),
)

Source code in boxlab/dataset/__init__.py

def visualize_sample(
    self,
    image_id: str,
    figsize: tuple[int, int] = (12, 8),
    show_labels: bool = True,
    save_path: pathlib.Path | None = None,
) -> None:
    """Visualize a single image with its annotations.

    Displays the image with bounding boxes and category labels overlaid.
    Each category is assigned a unique color, and annotations are drawn
    as rectangles with optional text labels.

    Args:
        image_id: The unique identifier of the image to visualize.
        figsize: Tuple of (width, height) for the matplotlib figure size.
            Defaults to (12, 8).
        show_labels: If True, display category names above bounding boxes.
            Defaults to True.
        save_path: Optional path to save the visualization as an image file.
            If None, only displays the plot.

    Raises:
        DatasetError: If the image is not found or has no path defined.

    Example:
        ```python
        from pathlib import Path
        from boxlab.dataset import Dataset

        dataset = Dataset(name="my_dataset")
        # ... populate dataset ...

        # Display a sample image
        dataset.visualize_sample("001")

        # Save visualization to file
        dataset.visualize_sample(
            "001",
            figsize=(16, 10),
            show_labels=True,
            save_path=Path("output/sample_001.png"),
        )
        ```
    """
    img_info = self.get_image(image_id)
    if img_info is None or img_info.path is None:
        raise DatasetError(f"Image {image_id} not found or has no path")

    img = Image.open(img_info.path)
    anns = self.get_annotations(image_id)
    source = self.get_image_source(image_id)

    _fig, ax = plt.subplots(1, figsize=figsize)
    ax.imshow(img)

    colors = plt.cm.rainbow(np.linspace(0, 1, self.num_categories()))  # type: ignore
    category_colors = {cat_id: colors[i] for i, cat_id in enumerate(self.categories.keys())}

    for ann in anns:
        bbox = ann.bbox
        color = category_colors[ann.category_id]

        rect = patches.Rectangle(
            (bbox.x_min, bbox.y_min),
            bbox.x_max - bbox.x_min,
            bbox.y_max - bbox.y_min,
            linewidth=2,
            edgecolor=color,
            facecolor="none",
        )
        ax.add_patch(rect)

        if show_labels:
            ax.text(
                bbox.x_min,
                bbox.y_min - 5,
                ann.category_name,
                color="white",
                fontsize=10,
                bbox={"facecolor": color, "alpha": 0.7, "edgecolor": "none", "pad": 2},
            )

    ax.axis("off")
    title = f"Image: {img_info.file_name} | Annotations: {len(anns)}"
    if source:
        title += f" | Source: {source}"
    ax.set_title(title)
    plt.tight_layout()

    if save_path:
        plt.savefig(save_path, bbox_inches="tight", dpi=150)
        logger.info(f"Visualization saved to: {save_path}")

    plt.show()

visualize_category_distribution ¶

visualize_category_distribution(figsize: tuple[int, int] = (12, 6), save_path: Path | None = None) -> None

Visualize category distribution as a bar chart.

Creates a bar chart showing the number of annotations for each category in the dataset. Useful for understanding class balance and distribution.

Parameters:

Name	Type	Description	Default
`figsize`	`tuple[int, int]`	Tuple of (width, height) for the matplotlib figure size. Defaults to (12, 6).	`(12, 6)`
`save_path`	`Path \| None`	Optional path to save the visualization as an image file. If None, only displays the plot.	`None`

Example

from pathlib import Path
from boxlab.dataset import Dataset

dataset = Dataset(name="my_dataset")
# ... populate dataset ...

# Display category distribution
dataset.visualize_category_distribution()

# Save to file
dataset.visualize_category_distribution(
    figsize=(16, 8),
    save_path=Path("output/category_distribution.png"),
)

Source code in boxlab/dataset/__init__.py

def visualize_category_distribution(
    self,
    figsize: tuple[int, int] = (12, 6),
    save_path: pathlib.Path | None = None,
) -> None:
    """Visualize category distribution as a bar chart.

    Creates a bar chart showing the number of annotations for each category
    in the dataset. Useful for understanding class balance and distribution.

    Args:
        figsize: Tuple of (width, height) for the matplotlib figure size.
            Defaults to (12, 6).
        save_path: Optional path to save the visualization as an image file.
            If None, only displays the plot.

    Example:
        ```python
        from pathlib import Path
        from boxlab.dataset import Dataset

        dataset = Dataset(name="my_dataset")
        # ... populate dataset ...

        # Display category distribution
        dataset.visualize_category_distribution()

        # Save to file
        dataset.visualize_category_distribution(
            figsize=(16, 8),
            save_path=Path("output/category_distribution.png"),
        )
        ```
    """
    logger.info(f"Visualizing category distribution for dataset: {self.name}")

    stats = self.get_statistics(by_source=False)
    cat_dist = stats["category_distribution"]

    if not cat_dist:
        logger.warning("No categories to visualize")
        return

    categories = list(cat_dist.keys())
    counts = list(map(float, cat_dist.values()))

    _fig, ax = plt.subplots(1, figsize=figsize)
    bars = ax.bar(categories, counts, color="skyblue", edgecolor="navy", alpha=0.7)

    for bar in bars:
        height = bar.get_height()
        ax.text(
            bar.get_x() + bar.get_width() / 2,
            height,
            f"{int(height)}",
            ha="center",
            va="bottom",
            fontsize=10,
        )

    ax.set_xlabel("Category", fontsize=12)
    ax.set_ylabel("Count", fontsize=12)
    ax.set_title(f"Category Distribution - {self.name}", fontsize=14, fontweight="bold")
    ax.tick_params(axis="x", rotation=45)
    plt.tight_layout()

    if save_path:
        plt.savefig(save_path, bbox_inches="tight", dpi=150)
        logger.info(f"Category distribution saved to: {save_path}")

    plt.show()

add ¶

__add__(other: object) -> Dataset

Enable merging datasets using the + operator.

Parameters:

Name	Type	Description	Default
`other`	`object`	Another Dataset object to merge.	required

Returns:

Type	Description
`Dataset`	A new merged Dataset.

Raises:

Type	Description
`TypeError`	If other is not a Dataset instance.

Example

dataset_a = Dataset(name="dataset_a")
dataset_b = Dataset(name="dataset_b")

# Merge using + operator
merged = dataset_a + dataset_b

Source code in boxlab/dataset/__init__.py

def __add__(self, other: object) -> Dataset:
    """Enable merging datasets using the + operator.

    Args:
        other: Another Dataset object to merge.

    Returns:
        A new merged Dataset.

    Raises:
        TypeError: If other is not a Dataset instance.

    Example:
        ```python
        dataset_a = Dataset(name="dataset_a")
        dataset_b = Dataset(name="dataset_b")

        # Merge using + operator
        merged = dataset_a + dataset_b
        ```
    """
    if not isinstance(other, Dataset):
        return NotImplemented
    return self.merge(other)

len ¶

__len__() -> int

Return the number of images in the dataset.

Returns:

Type	Description
`int`	Number of images.

Example

dataset = Dataset(name="my_dataset")
# ... add images ...

print(f"Dataset contains {len(dataset)} images")

Source code in boxlab/dataset/__init__.py

def __len__(self) -> int:
    """Return the number of images in the dataset.

    Returns:
        Number of images.

    Example:
        ```python
        dataset = Dataset(name="my_dataset")
        # ... add images ...

        print(f"Dataset contains {len(dataset)} images")
        ```
    """
    return self.num_images()

options: show_root_heading: true show_source: true heading_level: 2 members_order: source show_signature_annotations: true separate_signature: true

Overview¶

The Dataset class is the core component of BoxLab's dataset management system. It provides comprehensive functionality for managing object detection datasets, including loading, exporting, merging, and analyzing data from multiple sources.

Key Features¶

Multi-format Support: Load and export datasets in COCO, YOLO, and other formats
Multi-source Management: Track and manage data from multiple sources
Category Management: Handle categories with conflict resolution
Dataset Operations: Split, merge, and transform datasets
Statistics & Visualization: Comprehensive dataset analysis and visualization tools
Flexible Architecture: Plugin-based system for extending functionality

Quick Start¶

import pathlib

from boxlab.dataset import Dataset
from boxlab.dataset.types import ImageInfo, Annotation, BBox

# Create a new dataset
dataset = Dataset(name="my_dataset")

# Add categories
dataset.add_category(1, "person")
dataset.add_category(2, "car")

# Add an image
img_info = ImageInfo(
    image_id="001",
    file_name="image1.jpg",
    width=640,
    height=480,
    path=pathlib.Path("/path/to/image1.jpg"),
)
dataset.add_image(img_info, source_name="camera1")

# Add an annotation
annotation = Annotation(
    bbox=BBox(x_min=10, y_min=20, x_max=100, y_max=150),
    category_id=1,
    category_name="person",
    image_id="001",
    annotation_id="ann_001",
)
dataset.add_annotation(annotation)

# Get statistics
stats = dataset.get_statistics()
print(f"Total images: {stats['num_images']}")
print(f"Total annotations: {stats['num_annotations']}")

Category Management¶

Manage object categories in your dataset:

add_category(): Add a new category
get_category_name(): Retrieve category name by ID
get_category_id(): Retrieve category ID by name
fix_duplicate_categories(): Resolve duplicate category names

Image Management¶

Handle image metadata and sources:

add_image(): Add image with optional source tracking
get_image(): Retrieve image information
get_image_source(): Get source name for an image
get_sources(): List all unique data sources

Annotation Management¶

Work with object annotations:

add_annotation(): Add bounding box annotation
get_annotations(): Get all annotations for an image

Statistics & Analysis¶

Analyze your dataset:

get_statistics(): Compute comprehensive statistics
print_statistics(): Display statistics in console
num_images(): Get image count
num_annotations(): Get annotation count
num_categories(): Get category count

Dataset Operations¶

Transform and combine datasets:

split(): Split dataset into train/val/test sets
merge(): Combine multiple datasets
__add__(): Merge using + operator

Visualization¶

Visualize dataset content:

visualize_sample(): Display image with annotations
visualize_category_distribution(): Show category balance

Dataset¶

Dataset ¶

Functions¶

add_category ¶

get_category_name ¶

get_category_id ¶

fix_duplicate_categories ¶

add_image ¶

get_image ¶

get_image_source ¶

get_sources ¶

add_annotation ¶

get_annotations ¶

num_images ¶

num_annotations ¶

num_categories ¶

get_statistics ¶

print_statistics ¶

merge ¶

split ¶

visualize_sample ¶

visualize_category_distribution ¶

__add__ ¶

__len__ ¶

Overview¶

Key Features¶

Quick Start¶

Category Management¶

Image Management¶

Annotation Management¶

Statistics & Analysis¶

Dataset Operations¶

Visualization¶

See Also¶

add ¶

len ¶