Skip to content

Dataset

Dataset

Dataset(name: str = 'dataset')

Base class for dataset management with multi-source support.

This class provides comprehensive dataset management capabilities including loading, exporting, merging, and analyzing object detection datasets. It supports multiple data sources and provides flexible file naming strategies with detailed statistics.

Features
  • Load from COCO or YOLO format
  • Export to COCO or YOLO format
  • Multi-source dataset management
  • Flexible file naming strategies
  • Comprehensive statistics including per-source analysis
  • Category management with conflict resolution
  • Dataset splitting and merging
  • Visualization tools for samples and statistics

Parameters:

Name Type Description Default
name str

Dataset name identifier. Defaults to "dataset".

'dataset'

Attributes:

Name Type Description
name str

The dataset name identifier.

images dict[str, ImageInfo]

Mapping of image IDs to image information.

annotations dict[str, list[Annotation]]

Mapping of image IDs to their annotations.

categories dict[int, str]

Mapping of category IDs to category names.

category_name_to_id dict[str, int]

Reverse mapping of category names to IDs.

source_info dict[str, str]

Mapping of image IDs to their source names.

Example

Basic dataset creation and usage:

from boxlab.dataset import Dataset
from boxlab.dataset.types import ImageInfo, Annotation, BBox

# Create a new dataset
dataset = Dataset(name="my_dataset")

# Add categories
dataset.add_category(1, "person")
dataset.add_category(2, "car")

# Add an image
img_info = ImageInfo(
    image_id="001",
    file_name="image1.jpg",
    width=640,
    height=480,
    path="/path/to/image1.jpg",
)
dataset.add_image(img_info, source_name="camera1")

# Add an annotation
annotation = Annotation(
    bbox=BBox(x_min=10, y_min=20, x_max=100, y_max=150),
    category_id=1,
    category_name="person",
    image_id="001",
    annotation_id="ann_001",
)
dataset.add_annotation(annotation)

# Get statistics
stats = dataset.get_statistics()
print(f"Total images: {stats['num_images']}")
print(f"Total annotations: {stats['num_annotations']}")
Example

Merging multiple datasets:

from boxlab.dataset import Dataset

# Create two datasets
dataset1 = Dataset(name="dataset_a")
dataset2 = Dataset(name="dataset_b")

# ... add data to both datasets ...

# Merge datasets
merged = dataset1.merge(
    dataset2,
    resolve_conflicts="skip",
    preserve_sources=True,
)

# Or use the + operator
merged = dataset1 + dataset2
Source code in boxlab/dataset/__init__.py
def __init__(self, name: str = "dataset") -> None:
    self.name = name
    self.images: dict[str, ImageInfo] = {}
    self.annotations: dict[str, list[Annotation]] = coll.defaultdict(list)
    self.categories: dict[int, str] = {}
    self.category_name_to_id: dict[str, int] = {}
    self.source_info: dict[str, str] = {}  # image_id -> source_name
    logger.debug(f"Initialized dataset: {name}")

Functions

add_category

add_category(category_id: int, category_name: str) -> None

Add a category to the dataset.

Parameters:

Name Type Description Default
category_id int

Unique identifier for the category.

required
category_name str

Human-readable name for the category.

required
Example
dataset = Dataset(name="my_dataset")
dataset.add_category(1, "person")
dataset.add_category(2, "car")
dataset.add_category(3, "bicycle")
Source code in boxlab/dataset/__init__.py
def add_category(self, category_id: int, category_name: str) -> None:
    """Add a category to the dataset.

    Args:
        category_id: Unique identifier for the category.
        category_name: Human-readable name for the category.

    Example:
        ```python
        dataset = Dataset(name="my_dataset")
        dataset.add_category(1, "person")
        dataset.add_category(2, "car")
        dataset.add_category(3, "bicycle")
        ```
    """
    self.categories[category_id] = category_name
    self.category_name_to_id[category_name] = category_id
    logger.debug(f"Added category: {category_id} -> {category_name}")

get_category_name

get_category_name(category_id: int) -> str | None

Get category name by ID.

Parameters:

Name Type Description Default
category_id int

The category ID to look up.

required

Returns:

Type Description
str | None

The category name if found, None otherwise.

Example
dataset = Dataset(name="my_dataset")
dataset.add_category(1, "person")

name = dataset.get_category_name(1)
print(name)  # Output: "person"
Source code in boxlab/dataset/__init__.py
def get_category_name(self, category_id: int) -> str | None:
    """Get category name by ID.

    Args:
        category_id: The category ID to look up.

    Returns:
        The category name if found, None otherwise.

    Example:
        ```python
        dataset = Dataset(name="my_dataset")
        dataset.add_category(1, "person")

        name = dataset.get_category_name(1)
        print(name)  # Output: "person"
        ```
    """
    return self.categories.get(category_id)

get_category_id

get_category_id(category_name: str) -> int | None

Get category ID by name.

Parameters:

Name Type Description Default
category_name str

The category name to look up.

required

Returns:

Type Description
int | None

The category ID if found, None otherwise.

Example
dataset = Dataset(name="my_dataset")
dataset.add_category(1, "person")

cat_id = dataset.get_category_id("person")
print(cat_id)  # Output: 1
Source code in boxlab/dataset/__init__.py
def get_category_id(self, category_name: str) -> int | None:
    """Get category ID by name.

    Args:
        category_name: The category name to look up.

    Returns:
        The category ID if found, None otherwise.

    Example:
        ```python
        dataset = Dataset(name="my_dataset")
        dataset.add_category(1, "person")

        cat_id = dataset.get_category_id("person")
        print(cat_id)  # Output: 1
        ```
    """
    return self.category_name_to_id.get(category_name)

fix_duplicate_categories

fix_duplicate_categories() -> dict[int, int]

Fix duplicate category names by merging them.

When multiple category IDs map to the same category name, this method consolidates them into a single canonical ID (the smallest ID) and remaps all affected annotations.

Returns:

Type Description
dict[int, int]

A mapping from old category IDs to new (canonical) category IDs.

Example
dataset = Dataset(name="my_dataset")
# Suppose categories were loaded with duplicates:
# {1: "person", 2: "car", 5: "person"}

mapping = dataset.fix_duplicate_categories()
# mapping = {1: 1, 2: 2, 5: 1}
# Now only categories {1: "person", 2: "car"} remain
Source code in boxlab/dataset/__init__.py
def fix_duplicate_categories(self) -> dict[int, int]:
    """Fix duplicate category names by merging them.

    When multiple category IDs map to the same category name, this method
    consolidates them into a single canonical ID (the smallest ID) and
    remaps all affected annotations.

    Returns:
        A mapping from old category IDs to new (canonical) category IDs.

    Example:
        ```python
        dataset = Dataset(name="my_dataset")
        # Suppose categories were loaded with duplicates:
        # {1: "person", 2: "car", 5: "person"}

        mapping = dataset.fix_duplicate_categories()
        # mapping = {1: 1, 2: 2, 5: 1}
        # Now only categories {1: "person", 2: "car"} remain
        ```
    """
    logger.info(f"Fixing duplicate categories in dataset: {self.name}")

    name_to_ids: dict[str, list[int]] = coll.defaultdict(list)
    for cat_id, cat_name in self.categories.items():
        name_to_ids[cat_name].append(cat_id)

    category_mapping: dict[int, int] = {}
    duplicates_found = 0

    for cat_name, cat_ids in name_to_ids.items():
        if len(cat_ids) > 1:
            cat_ids.sort()
            canonical_id = cat_ids[0]
            duplicates_found += len(cat_ids) - 1

            logger.warning(
                f"Found duplicate category '{cat_name}' with IDs {cat_ids}, "
                f"keeping ID {canonical_id}"
            )

            for cat_id in cat_ids:
                category_mapping[cat_id] = canonical_id
        else:
            category_mapping[cat_ids[0]] = cat_ids[0]

    if duplicates_found == 0:
        logger.info("No duplicate categories found")
        return category_mapping

    # Rebuild categories
    new_categories: dict[int, str] = {}
    new_category_name_to_id: dict[str, int] = {}

    for cat_name, cat_ids in name_to_ids.items():
        canonical_id = min(cat_ids)
        new_categories[canonical_id] = cat_name
        new_category_name_to_id[cat_name] = canonical_id

    self.categories = new_categories
    self.category_name_to_id = new_category_name_to_id

    # Remap annotations
    annotations_updated = 0
    for img_id in list(self.annotations.keys()):
        new_anns = []
        for ann in self.annotations[img_id]:
            new_cat_id = category_mapping[ann.category_id]
            if ann.category_id != new_cat_id:
                new_ann = Annotation(
                    bbox=ann.bbox,
                    category_id=new_cat_id,
                    category_name=ann.category_name,
                    image_id=ann.image_id,
                    annotation_id=ann.annotation_id,
                    area=ann.area,
                    iscrowd=ann.iscrowd,
                )
                new_anns.append(new_ann)
                annotations_updated += 1
            else:
                new_anns.append(ann)

        self.annotations[img_id] = new_anns

    logger.info(
        f"Fixed {duplicates_found} duplicate categories, "
        f"updated {annotations_updated} annotations"
    )

    return category_mapping

add_image

add_image(image_info: ImageInfo, source_name: str | None = None) -> None

Add image metadata to dataset with optional source tracking.

Parameters:

Name Type Description Default
image_info ImageInfo

ImageInfo object containing image metadata.

required
source_name str | None

Optional name of the data source. Useful for tracking which dataset or camera the image came from.

None
Example
from boxlab.dataset import Dataset
from boxlab.dataset.types import ImageInfo

dataset = Dataset(name="my_dataset")

img_info = ImageInfo(
    image_id="001",
    file_name="image1.jpg",
    width=1920,
    height=1080,
    path="/data/images/image1.jpg",
)

dataset.add_image(img_info, source_name="camera_front")
Source code in boxlab/dataset/__init__.py
def add_image(self, image_info: ImageInfo, source_name: str | None = None) -> None:
    """Add image metadata to dataset with optional source tracking.

    Args:
        image_info: ImageInfo object containing image metadata.
        source_name: Optional name of the data source. Useful for tracking
            which dataset or camera the image came from.

    Example:
        ```python
        from boxlab.dataset import Dataset
        from boxlab.dataset.types import ImageInfo

        dataset = Dataset(name="my_dataset")

        img_info = ImageInfo(
            image_id="001",
            file_name="image1.jpg",
            width=1920,
            height=1080,
            path="/data/images/image1.jpg",
        )

        dataset.add_image(img_info, source_name="camera_front")
        ```
    """
    self.images[image_info.image_id] = image_info
    if source_name:
        self.source_info[image_info.image_id] = source_name
    logger.debug(f"Added image: {image_info.image_id} from source: {source_name}")

get_image

get_image(image_id: str) -> ImageInfo | None

Get image information by ID.

Parameters:

Name Type Description Default
image_id str

The unique identifier of the image.

required

Returns:

Type Description
ImageInfo | None

ImageInfo object if found, None otherwise.

Example
dataset = Dataset(name="my_dataset")
# ... add images ...

img_info = dataset.get_image("001")
if img_info:
    print(f"Image: {img_info.file_name}")
    print(f"Size: {img_info.width}x{img_info.height}")
Source code in boxlab/dataset/__init__.py
def get_image(self, image_id: str) -> ImageInfo | None:
    """Get image information by ID.

    Args:
        image_id: The unique identifier of the image.

    Returns:
        ImageInfo object if found, None otherwise.

    Example:
        ```python
        dataset = Dataset(name="my_dataset")
        # ... add images ...

        img_info = dataset.get_image("001")
        if img_info:
            print(f"Image: {img_info.file_name}")
            print(f"Size: {img_info.width}x{img_info.height}")
        ```
    """
    return self.images.get(image_id)

get_image_source

get_image_source(image_id: str) -> str | None

Get the source name for an image.

Parameters:

Name Type Description Default
image_id str

The unique identifier of the image.

required

Returns:

Type Description
str | None

Source name if tracked, None otherwise.

Example
dataset = Dataset(name="my_dataset")
# ... add images with sources ...

source = dataset.get_image_source("001")
print(f"Image source: {source}")  # Output: "camera_front"
Source code in boxlab/dataset/__init__.py
def get_image_source(self, image_id: str) -> str | None:
    """Get the source name for an image.

    Args:
        image_id: The unique identifier of the image.

    Returns:
        Source name if tracked, None otherwise.

    Example:
        ```python
        dataset = Dataset(name="my_dataset")
        # ... add images with sources ...

        source = dataset.get_image_source("001")
        print(f"Image source: {source}")  # Output: "camera_front"
        ```
    """
    return self.source_info.get(image_id)

get_sources

get_sources() -> set[str]

Get all unique source names in the dataset.

Returns:

Type Description
set[str]

Set of unique source names.

Example
dataset = Dataset(name="my_dataset")
# ... add images from multiple sources ...

sources = dataset.get_sources()
print(f"Data sources: {sources}")
# Output: {'camera_front', 'camera_rear', 'dataset_a'}
Source code in boxlab/dataset/__init__.py
def get_sources(self) -> set[str]:
    """Get all unique source names in the dataset.

    Returns:
        Set of unique source names.

    Example:
        ```python
        dataset = Dataset(name="my_dataset")
        # ... add images from multiple sources ...

        sources = dataset.get_sources()
        print(f"Data sources: {sources}")
        # Output: {'camera_front', 'camera_rear', 'dataset_a'}
        ```
    """
    return set(self.source_info.values())

add_annotation

add_annotation(annotation: Annotation) -> None

Add annotation to dataset.

Parameters:

Name Type Description Default
annotation Annotation

Annotation object containing bounding box and category info.

required
Example
from boxlab.dataset import Dataset
from boxlab.dataset.types import Annotation, BBox

dataset = Dataset(name="my_dataset")

annotation = Annotation(
    bbox=BBox(x_min=100, y_min=50, x_max=200, y_max=150),
    category_id=1,
    category_name="person",
    image_id="001",
    annotation_id="ann_001",
)

dataset.add_annotation(annotation)
Source code in boxlab/dataset/__init__.py
def add_annotation(self, annotation: Annotation) -> None:
    """Add annotation to dataset.

    Args:
        annotation: Annotation object containing bounding box and category
            info.

    Example:
        ```python
        from boxlab.dataset import Dataset
        from boxlab.dataset.types import Annotation, BBox

        dataset = Dataset(name="my_dataset")

        annotation = Annotation(
            bbox=BBox(x_min=100, y_min=50, x_max=200, y_max=150),
            category_id=1,
            category_name="person",
            image_id="001",
            annotation_id="ann_001",
        )

        dataset.add_annotation(annotation)
        ```
    """
    self.annotations[annotation.image_id].append(annotation)

get_annotations

get_annotations(image_id: str) -> list[Annotation]

Get all annotations for a specific image.

Parameters:

Name Type Description Default
image_id str

The unique identifier of the image.

required

Returns:

Type Description
list[Annotation]

List of Annotation objects for the specified image. Returns empty

list[Annotation]

list if no annotations found.

Example
dataset = Dataset(name="my_dataset")
# ... add images and annotations ...

annotations = dataset.get_annotations("001")
print(f"Found {len(annotations)} annotations")

for ann in annotations:
    print(
        f"Category: {ann.category_name}, BBox: {ann.bbox}"
    )
Source code in boxlab/dataset/__init__.py
def get_annotations(self, image_id: str) -> list[Annotation]:
    """Get all annotations for a specific image.

    Args:
        image_id: The unique identifier of the image.

    Returns:
        List of Annotation objects for the specified image. Returns empty
        list if no annotations found.

    Example:
        ```python
        dataset = Dataset(name="my_dataset")
        # ... add images and annotations ...

        annotations = dataset.get_annotations("001")
        print(f"Found {len(annotations)} annotations")

        for ann in annotations:
            print(
                f"Category: {ann.category_name}, BBox: {ann.bbox}"
            )
        ```
    """
    return self.annotations.get(image_id, [])

num_images

num_images() -> int

Get total number of images in the dataset.

Returns:

Type Description
int

Number of images.

Example
dataset = Dataset(name="my_dataset")
# ... add images ...

print(f"Total images: {dataset.num_images()}")
Source code in boxlab/dataset/__init__.py
def num_images(self) -> int:
    """Get total number of images in the dataset.

    Returns:
        Number of images.

    Example:
        ```python
        dataset = Dataset(name="my_dataset")
        # ... add images ...

        print(f"Total images: {dataset.num_images()}")
        ```
    """
    return len(self.images)

num_annotations

num_annotations() -> int

Get total number of annotations in the dataset.

Returns:

Type Description
int

Total count of all annotations across all images.

Example
dataset = Dataset(name="my_dataset")
# ... add annotations ...

print(f"Total annotations: {dataset.num_annotations()}")
Source code in boxlab/dataset/__init__.py
def num_annotations(self) -> int:
    """Get total number of annotations in the dataset.

    Returns:
        Total count of all annotations across all images.

    Example:
        ```python
        dataset = Dataset(name="my_dataset")
        # ... add annotations ...

        print(f"Total annotations: {dataset.num_annotations()}")
        ```
    """
    return sum(len(anns) for anns in self.annotations.values())

num_categories

num_categories() -> int

Get total number of categories in the dataset.

Returns:

Type Description
int

Number of unique categories.

Example
dataset = Dataset(name="my_dataset")
# ... add categories ...

print(f"Total categories: {dataset.num_categories()}")
Source code in boxlab/dataset/__init__.py
def num_categories(self) -> int:
    """Get total number of categories in the dataset.

    Returns:
        Number of unique categories.

    Example:
        ```python
        dataset = Dataset(name="my_dataset")
        # ... add categories ...

        print(f"Total categories: {dataset.num_categories()}")
        ```
    """
    return len(self.categories)

get_statistics

get_statistics(by_source: Literal[False] = False) -> DatasetStatistics
get_statistics(by_source: Literal[True]) -> dict[str, DatasetStatistics]
get_statistics(by_source: bool = False) -> DatasetStatistics | dict[str, DatasetStatistics]

Calculate dataset statistics.

    Computes comprehensive statistics about the dataset including image
    counts, annotation counts, category distribution, and bounding box area
    metrics.

    Args:
        by_source: If True, return statistics grouped by source name.
            If False, return overall statistics for the entire dataset.

    Returns:
        If by_source is False: A DatasetStatistics object with overall
            stats.
        If by_source is True: A dict mapping source names to their
            respective DatasetStatistics objects.

    Example:
        ```python
        dataset = Dataset(name="my_dataset")
        # ... add data ...

        # Get overall statistics
        stats = dataset.get_statistics()
        print(f"Images: {stats['num_images']}")
        print(f"Annotations: {stats['num_annotations']}")
        print(
            "Avg annotations per image: "
            f"{stats['avg_annotations_per_image']:.2f}"
        )

        # Get statistics by source
        stats_by_source = dataset.get_statistics(by_source=True)
        for source, source_stats in stats_by_source.items():
            print(f"

Source: {source}") print(f" Images: {source_stats['num_images']}") ```

Source code in boxlab/dataset/__init__.py
def get_statistics(
    self,
    by_source: bool = False,
) -> DatasetStatistics | dict[str, DatasetStatistics]:
    """Calculate dataset statistics.

    Computes comprehensive statistics about the dataset including image
    counts, annotation counts, category distribution, and bounding box area
    metrics.

    Args:
        by_source: If True, return statistics grouped by source name.
            If False, return overall statistics for the entire dataset.

    Returns:
        If by_source is False: A DatasetStatistics object with overall
            stats.
        If by_source is True: A dict mapping source names to their
            respective DatasetStatistics objects.

    Example:
        ```python
        dataset = Dataset(name="my_dataset")
        # ... add data ...

        # Get overall statistics
        stats = dataset.get_statistics()
        print(f"Images: {stats['num_images']}")
        print(f"Annotations: {stats['num_annotations']}")
        print(
            "Avg annotations per image: "
            f"{stats['avg_annotations_per_image']:.2f}"
        )

        # Get statistics by source
        stats_by_source = dataset.get_statistics(by_source=True)
        for source, source_stats in stats_by_source.items():
            print(f"\nSource: {source}")
            print(f"  Images: {source_stats['num_images']}")
        ```
    """
    if by_source:
        return self._get_statistics_by_source()

    return self._calculate_statistics(self.images.keys())

print_statistics

print_statistics(by_source: bool = False) -> None

Print dataset statistics to console.

Parameters:

Name Type Description Default
by_source bool

If True, print statistics for each source separately. If False, print overall statistics.

False
Example
dataset = Dataset(name="my_dataset")
# ... add data ...

# Print overall statistics
dataset.print_statistics()

# Print per-source statistics
dataset.print_statistics(by_source=True)
Source code in boxlab/dataset/__init__.py
def print_statistics(self, by_source: bool = False) -> None:
    """Print dataset statistics to console.

    Args:
        by_source: If True, print statistics for each source separately.
            If False, print overall statistics.

    Example:
        ```python
        dataset = Dataset(name="my_dataset")
        # ... add data ...

        # Print overall statistics
        dataset.print_statistics()

        # Print per-source statistics
        dataset.print_statistics(by_source=True)
        ```
    """
    if by_source:
        self._print_statistics_by_source()
    else:
        self._print_single_statistics(self.get_statistics(by_source=False), self.name)

merge

merge(other: Dataset, resolve_conflicts: Literal['skip', 'rename', 'error'] = 'skip', preserve_sources: bool = True, regen_ids: bool = True) -> Dataset

Merge another dataset into a new dataset.

Creates a new dataset containing all images, annotations, and categories from both datasets. Handles category conflicts according to the specified resolution strategy.

Parameters:

Name Type Description Default
other Dataset

Another Dataset object to merge with this one.

required
resolve_conflicts Literal['skip', 'rename', 'error']

Strategy for handling category name conflicts: - "skip": Use the existing category ID (default) - "rename": Rename the conflicting category from other dataset - "error": Raise CategoryConflictError

'skip'
preserve_sources bool

If True, maintain source tracking information from both datasets.

True
regen_ids bool

If True, regenerate image and annotation IDs to avoid conflicts. Recommended when merging datasets with overlapping IDs.

True

Returns:

Type Description
Dataset

A new Dataset object containing the merged data.

Raises:

Type Description
CategoryConflictError

If resolve_conflicts is "error" and a category name conflict is detected.

DatasetError

If category mapping fails during merge.

Example
from boxlab.dataset import Dataset

# Create two datasets
dataset_a = Dataset(name="dataset_a")
dataset_b = Dataset(name="dataset_b")

# ... populate datasets ...

# Merge with default settings
merged = dataset_a.merge(dataset_b)

# Merge with conflict resolution
merged = dataset_a.merge(
    dataset_b,
    resolve_conflicts="rename",
    preserve_sources=True,
)

# Using the + operator (equivalent to merge with defaults)
merged = dataset_a + dataset_b
Source code in boxlab/dataset/__init__.py
def merge(
    self,
    other: Dataset,
    resolve_conflicts: t.Literal["skip", "rename", "error"] = "skip",
    preserve_sources: bool = True,
    regen_ids: bool = True,
) -> Dataset:
    """Merge another dataset into a new dataset.

    Creates a new dataset containing all images, annotations, and
    categories from both datasets. Handles category conflicts according to
    the specified resolution strategy.

    Args:
        other: Another Dataset object to merge with this one.
        resolve_conflicts: Strategy for handling category name conflicts:
            - "skip": Use the existing category ID (default)
            - "rename": Rename the conflicting category from other dataset
            - "error": Raise CategoryConflictError
        preserve_sources: If True, maintain source tracking information
            from both datasets.
        regen_ids: If True, regenerate image and annotation IDs to avoid
            conflicts. Recommended when merging datasets with overlapping
            IDs.

    Returns:
        A new Dataset object containing the merged data.

    Raises:
        CategoryConflictError: If resolve_conflicts is "error" and a
            category name conflict is detected.
        DatasetError: If category mapping fails during merge.

    Example:
        ```python
        from boxlab.dataset import Dataset

        # Create two datasets
        dataset_a = Dataset(name="dataset_a")
        dataset_b = Dataset(name="dataset_b")

        # ... populate datasets ...

        # Merge with default settings
        merged = dataset_a.merge(dataset_b)

        # Merge with conflict resolution
        merged = dataset_a.merge(
            dataset_b,
            resolve_conflicts="rename",
            preserve_sources=True,
        )

        # Using the + operator (equivalent to merge with defaults)
        merged = dataset_a + dataset_b
        ```
    """
    logger.info(f"Merging '{other.name}' into '{self.name}'")

    merged = Dataset(name=f"{self.name}_merged")

    # Merge categories
    category_mapping: dict[int, int] = {}
    next_category_id = max(self.categories.keys()) + 1 if self.categories else 1

    for cat_id, cat_name in self.categories.items():
        merged.add_category(cat_id, cat_name)
        category_mapping[cat_id] = cat_id

    for cat_id, cat_name in other.categories.items():
        if cat_name in merged.category_name_to_id:
            if resolve_conflicts == "skip":
                category_mapping[cat_id] = merged.category_name_to_id[cat_name]
            elif resolve_conflicts == "rename":
                new_name = f"{cat_name}_other"
                merged.add_category(next_category_id, new_name)
                category_mapping[cat_id] = next_category_id
                next_category_id += 1
            elif resolve_conflicts == "error":
                raise CategoryConflictError(cat_name, f"Category name conflict: {cat_name}")
        else:
            merged.add_category(cat_id, cat_name)
            category_mapping[cat_id] = cat_id

    # Merge from self
    for img_id, img_info in self.images.items():
        new_iid = utils.gen_uid(prefix="img_") if regen_ids else img_id
        source = self.source_info.get(img_id, self.name) if preserve_sources else None

        # Create new ImageInfo with potentially new ID
        new_img_info = ImageInfo(
            image_id=new_iid,
            file_name=img_info.file_name,
            width=img_info.width,
            height=img_info.height,
            path=img_info.path,
        )
        merged.add_image(new_img_info, source_name=source)
        for ann in self.get_annotations(img_id):
            new_ann = (
                Annotation(
                    bbox=ann.bbox,
                    category_id=ann.category_id,
                    category_name=ann.category_name,
                    image_id=new_iid,
                    annotation_id=utils.gen_uid(prefix="ann_")
                    if regen_ids
                    else ann.annotation_id,
                    area=ann.area,
                    iscrowd=ann.iscrowd,
                )
                if regen_ids
                else Annotation(
                    bbox=ann.bbox,
                    category_id=ann.category_id,
                    category_name=ann.category_name,
                    image_id=new_iid,
                    annotation_id=ann.annotation_id,
                    area=ann.area,
                    iscrowd=ann.iscrowd,
                )
            )
            merged.add_annotation(new_ann)

    # Merge from other
    for img_id, img_info in other.images.items():
        new_iid = (
            utils.gen_uid(prefix="img_")
            if regen_ids
            else self._resolve_id_conflict(img_id, lambda x: x in merged.images)
        )
        new_img_info = ImageInfo(
            image_id=new_iid,
            file_name=img_info.file_name,
            width=img_info.width,
            height=img_info.height,
            path=img_info.path,
        )

        source = other.source_info.get(img_id, other.name) if preserve_sources else None
        merged.add_image(new_img_info, source_name=source)

        for ann in other.get_annotations(img_id):
            new_cat_name = merged.get_category_name(category_mapping[ann.category_id])
            if new_cat_name is None:
                raise DatasetError(f"Category mapping failed for {ann.category_id}")

            new_ann = Annotation(
                bbox=ann.bbox,
                category_id=category_mapping[ann.category_id],
                category_name=new_cat_name,
                image_id=new_iid,
                annotation_id=utils.gen_uid(prefix="ann_") if regen_ids else ann.annotation_id,
                area=ann.area,
                iscrowd=ann.iscrowd,
            )
            merged.add_annotation(new_ann)

    logger.info(f"Merge completed: {merged.num_images()} images")
    return merged

split

split(split_ratio: SplitRatio, seed: int | None = None) -> dict[str, list[str]]

Split dataset into train, validation, and test sets.

Randomly shuffles and divides the dataset images according to the specified ratios. Useful for creating training/validation/test splits for machine learning.

Parameters:

Name Type Description Default
split_ratio SplitRatio

SplitRatio object defining the proportions for train, validation, and test sets. Must sum to 1.0.

required
seed int | None

Optional random seed for reproducible splits. If None, the split will be non-deterministic.

None

Returns:

Type Description
dict[str, list[str]]

Dictionary with keys "train", "val", and "test", each mapping to

dict[str, list[str]]

a list of image IDs in that split.

Raises:

Type Description
ValueError

If split_ratio proportions don't sum to 1.0 (raised by split_ratio.validate()).

Example
from boxlab.dataset import Dataset
from boxlab.dataset.types import SplitRatio

dataset = Dataset(name="my_dataset")
# ... populate dataset ...

# Define split ratios: 70% train, 20% val, 10% test
split_ratio = SplitRatio(train=0.7, val=0.2, test=0.1)

# Split with fixed seed for reproducibility
splits = dataset.split(split_ratio, seed=42)

print(f"Train images: {len(splits['train'])}")
print(f"Val images: {len(splits['val'])}")
print(f"Test images: {len(splits['test'])}")

# Access specific split
train_image_ids = splits["train"]
Source code in boxlab/dataset/__init__.py
def split(self, split_ratio: SplitRatio, seed: int | None = None) -> dict[str, list[str]]:
    """Split dataset into train, validation, and test sets.

    Randomly shuffles and divides the dataset images according to the
    specified ratios. Useful for creating training/validation/test splits
    for machine learning.

    Args:
        split_ratio: SplitRatio object defining the proportions for train,
            validation, and test sets. Must sum to 1.0.
        seed: Optional random seed for reproducible splits. If None, the
            split will be non-deterministic.

    Returns:
        Dictionary with keys "train", "val", and "test", each mapping to
        a list of image IDs in that split.

    Raises:
        ValueError: If split_ratio proportions don't sum to 1.0 (raised by
            split_ratio.validate()).

    Example:
        ```python
        from boxlab.dataset import Dataset
        from boxlab.dataset.types import SplitRatio

        dataset = Dataset(name="my_dataset")
        # ... populate dataset ...

        # Define split ratios: 70% train, 20% val, 10% test
        split_ratio = SplitRatio(train=0.7, val=0.2, test=0.1)

        # Split with fixed seed for reproducibility
        splits = dataset.split(split_ratio, seed=42)

        print(f"Train images: {len(splits['train'])}")
        print(f"Val images: {len(splits['val'])}")
        print(f"Test images: {len(splits['test'])}")

        # Access specific split
        train_image_ids = splits["train"]
        ```
    """
    split_ratio.validate()

    image_ids = list(self.images.keys())
    if seed is not None:
        random.seed(seed)
    random.shuffle(image_ids)

    total = len(image_ids)
    train_end = int(total * split_ratio.train)
    val_end = train_end + int(total * split_ratio.val)

    return {
        "train": image_ids[:train_end],
        "val": image_ids[train_end:val_end],
        "test": image_ids[val_end:],
    }

visualize_sample

visualize_sample(image_id: str, figsize: tuple[int, int] = (12, 8), show_labels: bool = True, save_path: Path | None = None) -> None

Visualize a single image with its annotations.

Displays the image with bounding boxes and category labels overlaid. Each category is assigned a unique color, and annotations are drawn as rectangles with optional text labels.

Parameters:

Name Type Description Default
image_id str

The unique identifier of the image to visualize.

required
figsize tuple[int, int]

Tuple of (width, height) for the matplotlib figure size. Defaults to (12, 8).

(12, 8)
show_labels bool

If True, display category names above bounding boxes. Defaults to True.

True
save_path Path | None

Optional path to save the visualization as an image file. If None, only displays the plot.

None

Raises:

Type Description
DatasetError

If the image is not found or has no path defined.

Example
from pathlib import Path
from boxlab.dataset import Dataset

dataset = Dataset(name="my_dataset")
# ... populate dataset ...

# Display a sample image
dataset.visualize_sample("001")

# Save visualization to file
dataset.visualize_sample(
    "001",
    figsize=(16, 10),
    show_labels=True,
    save_path=Path("output/sample_001.png"),
)
Source code in boxlab/dataset/__init__.py
def visualize_sample(
    self,
    image_id: str,
    figsize: tuple[int, int] = (12, 8),
    show_labels: bool = True,
    save_path: pathlib.Path | None = None,
) -> None:
    """Visualize a single image with its annotations.

    Displays the image with bounding boxes and category labels overlaid.
    Each category is assigned a unique color, and annotations are drawn
    as rectangles with optional text labels.

    Args:
        image_id: The unique identifier of the image to visualize.
        figsize: Tuple of (width, height) for the matplotlib figure size.
            Defaults to (12, 8).
        show_labels: If True, display category names above bounding boxes.
            Defaults to True.
        save_path: Optional path to save the visualization as an image file.
            If None, only displays the plot.

    Raises:
        DatasetError: If the image is not found or has no path defined.

    Example:
        ```python
        from pathlib import Path
        from boxlab.dataset import Dataset

        dataset = Dataset(name="my_dataset")
        # ... populate dataset ...

        # Display a sample image
        dataset.visualize_sample("001")

        # Save visualization to file
        dataset.visualize_sample(
            "001",
            figsize=(16, 10),
            show_labels=True,
            save_path=Path("output/sample_001.png"),
        )
        ```
    """
    img_info = self.get_image(image_id)
    if img_info is None or img_info.path is None:
        raise DatasetError(f"Image {image_id} not found or has no path")

    img = Image.open(img_info.path)
    anns = self.get_annotations(image_id)
    source = self.get_image_source(image_id)

    _fig, ax = plt.subplots(1, figsize=figsize)
    ax.imshow(img)

    colors = plt.cm.rainbow(np.linspace(0, 1, self.num_categories()))  # type: ignore
    category_colors = {cat_id: colors[i] for i, cat_id in enumerate(self.categories.keys())}

    for ann in anns:
        bbox = ann.bbox
        color = category_colors[ann.category_id]

        rect = patches.Rectangle(
            (bbox.x_min, bbox.y_min),
            bbox.x_max - bbox.x_min,
            bbox.y_max - bbox.y_min,
            linewidth=2,
            edgecolor=color,
            facecolor="none",
        )
        ax.add_patch(rect)

        if show_labels:
            ax.text(
                bbox.x_min,
                bbox.y_min - 5,
                ann.category_name,
                color="white",
                fontsize=10,
                bbox={"facecolor": color, "alpha": 0.7, "edgecolor": "none", "pad": 2},
            )

    ax.axis("off")
    title = f"Image: {img_info.file_name} | Annotations: {len(anns)}"
    if source:
        title += f" | Source: {source}"
    ax.set_title(title)
    plt.tight_layout()

    if save_path:
        plt.savefig(save_path, bbox_inches="tight", dpi=150)
        logger.info(f"Visualization saved to: {save_path}")

    plt.show()

visualize_category_distribution

visualize_category_distribution(figsize: tuple[int, int] = (12, 6), save_path: Path | None = None) -> None

Visualize category distribution as a bar chart.

Creates a bar chart showing the number of annotations for each category in the dataset. Useful for understanding class balance and distribution.

Parameters:

Name Type Description Default
figsize tuple[int, int]

Tuple of (width, height) for the matplotlib figure size. Defaults to (12, 6).

(12, 6)
save_path Path | None

Optional path to save the visualization as an image file. If None, only displays the plot.

None
Example
from pathlib import Path
from boxlab.dataset import Dataset

dataset = Dataset(name="my_dataset")
# ... populate dataset ...

# Display category distribution
dataset.visualize_category_distribution()

# Save to file
dataset.visualize_category_distribution(
    figsize=(16, 8),
    save_path=Path("output/category_distribution.png"),
)
Source code in boxlab/dataset/__init__.py
def visualize_category_distribution(
    self,
    figsize: tuple[int, int] = (12, 6),
    save_path: pathlib.Path | None = None,
) -> None:
    """Visualize category distribution as a bar chart.

    Creates a bar chart showing the number of annotations for each category
    in the dataset. Useful for understanding class balance and distribution.

    Args:
        figsize: Tuple of (width, height) for the matplotlib figure size.
            Defaults to (12, 6).
        save_path: Optional path to save the visualization as an image file.
            If None, only displays the plot.

    Example:
        ```python
        from pathlib import Path
        from boxlab.dataset import Dataset

        dataset = Dataset(name="my_dataset")
        # ... populate dataset ...

        # Display category distribution
        dataset.visualize_category_distribution()

        # Save to file
        dataset.visualize_category_distribution(
            figsize=(16, 8),
            save_path=Path("output/category_distribution.png"),
        )
        ```
    """
    logger.info(f"Visualizing category distribution for dataset: {self.name}")

    stats = self.get_statistics(by_source=False)
    cat_dist = stats["category_distribution"]

    if not cat_dist:
        logger.warning("No categories to visualize")
        return

    categories = list(cat_dist.keys())
    counts = list(map(float, cat_dist.values()))

    _fig, ax = plt.subplots(1, figsize=figsize)
    bars = ax.bar(categories, counts, color="skyblue", edgecolor="navy", alpha=0.7)

    for bar in bars:
        height = bar.get_height()
        ax.text(
            bar.get_x() + bar.get_width() / 2,
            height,
            f"{int(height)}",
            ha="center",
            va="bottom",
            fontsize=10,
        )

    ax.set_xlabel("Category", fontsize=12)
    ax.set_ylabel("Count", fontsize=12)
    ax.set_title(f"Category Distribution - {self.name}", fontsize=14, fontweight="bold")
    ax.tick_params(axis="x", rotation=45)
    plt.tight_layout()

    if save_path:
        plt.savefig(save_path, bbox_inches="tight", dpi=150)
        logger.info(f"Category distribution saved to: {save_path}")

    plt.show()

__add__

__add__(other: object) -> Dataset

Enable merging datasets using the + operator.

Parameters:

Name Type Description Default
other object

Another Dataset object to merge.

required

Returns:

Type Description
Dataset

A new merged Dataset.

Raises:

Type Description
TypeError

If other is not a Dataset instance.

Example
dataset_a = Dataset(name="dataset_a")
dataset_b = Dataset(name="dataset_b")

# Merge using + operator
merged = dataset_a + dataset_b
Source code in boxlab/dataset/__init__.py
def __add__(self, other: object) -> Dataset:
    """Enable merging datasets using the + operator.

    Args:
        other: Another Dataset object to merge.

    Returns:
        A new merged Dataset.

    Raises:
        TypeError: If other is not a Dataset instance.

    Example:
        ```python
        dataset_a = Dataset(name="dataset_a")
        dataset_b = Dataset(name="dataset_b")

        # Merge using + operator
        merged = dataset_a + dataset_b
        ```
    """
    if not isinstance(other, Dataset):
        return NotImplemented
    return self.merge(other)

__len__

__len__() -> int

Return the number of images in the dataset.

Returns:

Type Description
int

Number of images.

Example
dataset = Dataset(name="my_dataset")
# ... add images ...

print(f"Dataset contains {len(dataset)} images")
Source code in boxlab/dataset/__init__.py
def __len__(self) -> int:
    """Return the number of images in the dataset.

    Returns:
        Number of images.

    Example:
        ```python
        dataset = Dataset(name="my_dataset")
        # ... add images ...

        print(f"Dataset contains {len(dataset)} images")
        ```
    """
    return self.num_images()

options: show_root_heading: true show_source: true heading_level: 2 members_order: source show_signature_annotations: true separate_signature: true

Overview

The Dataset class is the core component of BoxLab's dataset management system. It provides comprehensive functionality for managing object detection datasets, including loading, exporting, merging, and analyzing data from multiple sources.

Key Features

  • Multi-format Support: Load and export datasets in COCO, YOLO, and other formats
  • Multi-source Management: Track and manage data from multiple sources
  • Category Management: Handle categories with conflict resolution
  • Dataset Operations: Split, merge, and transform datasets
  • Statistics & Visualization: Comprehensive dataset analysis and visualization tools
  • Flexible Architecture: Plugin-based system for extending functionality

Quick Start

import pathlib

from boxlab.dataset import Dataset
from boxlab.dataset.types import ImageInfo, Annotation, BBox

# Create a new dataset
dataset = Dataset(name="my_dataset")

# Add categories
dataset.add_category(1, "person")
dataset.add_category(2, "car")

# Add an image
img_info = ImageInfo(
    image_id="001",
    file_name="image1.jpg",
    width=640,
    height=480,
    path=pathlib.Path("/path/to/image1.jpg"),
)
dataset.add_image(img_info, source_name="camera1")

# Add an annotation
annotation = Annotation(
    bbox=BBox(x_min=10, y_min=20, x_max=100, y_max=150),
    category_id=1,
    category_name="person",
    image_id="001",
    annotation_id="ann_001",
)
dataset.add_annotation(annotation)

# Get statistics
stats = dataset.get_statistics()
print(f"Total images: {stats['num_images']}")
print(f"Total annotations: {stats['num_annotations']}")

Category Management

Manage object categories in your dataset:

  • add_category(): Add a new category
  • get_category_name(): Retrieve category name by ID
  • get_category_id(): Retrieve category ID by name
  • fix_duplicate_categories(): Resolve duplicate category names

Image Management

Handle image metadata and sources:

  • add_image(): Add image with optional source tracking
  • get_image(): Retrieve image information
  • get_image_source(): Get source name for an image
  • get_sources(): List all unique data sources

Annotation Management

Work with object annotations:

  • add_annotation(): Add bounding box annotation
  • get_annotations(): Get all annotations for an image

Statistics & Analysis

Analyze your dataset:

  • get_statistics(): Compute comprehensive statistics
  • print_statistics(): Display statistics in console
  • num_images(): Get image count
  • num_annotations(): Get annotation count
  • num_categories(): Get category count

Dataset Operations

Transform and combine datasets:

  • split(): Split dataset into train/val/test sets
  • merge(): Combine multiple datasets
  • __add__(): Merge using + operator

Visualization

Visualize dataset content:

  • visualize_sample(): Display image with annotations
  • visualize_category_distribution(): Show category balance

See Also