DataGradients

DataGradients is an open-source python based library designed for computer vision dataset analysis.

Extract valuable insights from your datasets and get comprehensive reports effortlessly.

🔍 Detect Common Data Issues

Corrupted data
Labeling errors
Underlying biases, and more.

💡 Extract Insights for Better Model Design

Informed decisions based on data characteristics.
Object size and location distributions.
High frequency details.

🎯 Reduce Guesswork for Hyperparameters

Define the correct NMS and filtering parameters.
Identify class distribution issues.
Calibrate metrics for your unique dataset.

🛠 Capabilities

Non-exhaustive list of supported features.

General Image Metrics: Explore key attributes like resolution, color distribution, and average brightness.
Class Overview: Get a snapshot of class distributions, most frequent classes, and unlabelled images.
Positional Heatmaps: Visualize where objects tend to appear within your images.
Bounding Box & Mask Details: Delve into dimensions, area coverages, and resolutions of objects.
Class Frequencies Deep Dive: Dive deeper into class distributions, understanding anomalies and rare classes.
Detailed Object Counts: Examine the granularity of components per image, identifying patterns and outliers.
And many more!

📘 Deep Dive into Data Profiling
Puzzled by some dataset challenges while using DataGradients? We've got you covered.
Enrich your understanding with this 🎓free online course. Dive into dataset profiling, confront its complexities, and harness the full potential of DataGradients.

Example of pages from the Report

Example of specific features

Check out the pre-computed dataset analysis for a deeper dive into reports.

Installation

You can install DataGradients directly from the GitHub repository.

pip install data-gradients

Quick Start

Prerequisites

Dataset: Includes a Train set and a Validation or a Test set.
Dataset Iterable: A method to iterate over your Dataset providing images and labels. Can be any of the following:
- PyTorch Dataloader
- PyTorch Dataset
- Generator that yields image/label pairs
- Any other iterable you use for model training/validation
One of:
- Class Names: Either the list of all class names in the dataset OR dictionary mapping of class_id -> class_name.
- Number of classes: Indicate how many unique classes are in your dataset. Ensure this number is greater than the highest class index (e.g., if your highest class index is 9, the number of classes should be at least 10).

Please ensure all the points above are checked before you proceed with DataGradients.

Example

from torchvision.datasets import CocoDetection

train_data = CocoDetection(...)
val_data = CocoDetection(...)
class_names = ["person", "bicycle", "car", "motorcycle", ...]
# OR
# class_names = {0: "person", 1:"bicycle", 2:"car", 3: "motorcycle", ...}

Good to Know - DataGradients will try to find out how the dataset returns images and labels.

If something cannot be automatically determined, you will be asked to provide some extra information through a text input.

In some extreme cases, the process will crash and invite you to implement a custom dataset extractor

Heads up - DataGradients provides a few out-of-the-box dataset/dataloader implementation. You can find more dataset implementations in PyTorch or SuperGradients.

Dataset Analysis

You are now ready to go, chose the relevant analyzer for your task and run it over your datasets!

Image Classification

from data_gradients.managers.classification_manager import ClassificationAnalysisManager 

train_data = ...  # Your dataset iterable (torch dataset/dataloader/...)
val_data = ...    # Your dataset iterable (torch dataset/dataloader/...)
class_names = ... # [<class-1>, <class-2>, ...]

analyzer = ClassificationAnalysisManager(
    report_title="Testing Data-Gradients Classification",
    train_data=train_data,
    val_data=val_data,
    class_names=class_names,
)

analyzer.run()

Object Detection

from data_gradients.managers.detection_manager import DetectionAnalysisManager

train_data = ...  # Your dataset iterable (torch dataset/dataloader/...)
val_data = ...    # Your dataset iterable (torch dataset/dataloader/...)
class_names = ... # [<class-1>, <class-2>, ...]

analyzer = DetectionAnalysisManager(
    report_title="Testing Data-Gradients Object Detection",
    train_data=train_data,
    val_data=val_data,
    class_names=class_names,
)

analyzer.run()

Semantic Segmentation

from data_gradients.managers.segmentation_manager import SegmentationAnalysisManager 

train_data = ...  # Your dataset iterable (torch dataset/dataloader/...)
val_data = ...    # Your dataset iterable (torch dataset/dataloader/...)
class_names = ... # [<class-1>, <class-2>, ...]

analyzer = SegmentationAnalysisManager(
    report_title="Testing Data-Gradients Segmentation",
    train_data=train_data,
    val_data=val_data,
    class_names=class_names,
)

analyzer.run()

Example

You can test the segmentation analysis tool in the following example which does not require you to download any additional data.

Report

Once the analysis is done, the path to your pdf report will be printed. You can find here examples of pre-computed dataset analysis reports.

Feature Configuration

The feature configuration allows you to run the analysis on a subset of features or adjust the parameters of existing features. If you are interested in customizing this configuration, you can check out the documentation on that topic.

Dataset Extractors

Ensuring Comprehensive Dataset Compatibility

DataGradients is adept at automatic dataset inference; however, certain specificities, such as nested annotations structures or unique annotation format, may necessitate a tailored approach.

To address this, DataGradients offers extractors tailored for enhancing compatibility with diverse dataset formats.

For an in-depth understanding and implementation details, we encourage a thorough review of the Dataset Extractors Documentation.

Pre-computed Dataset Analysis

Example notebook on Colab

Detection

Common Datasets

COCO
VOC

Roboflow 100 Datasets

4-fold-defect
abdomen-mri
acl-x-ray
activity-diagrams-qdobr
aerial-cows
aerial-pool
aerial-spheres
animals-ij5d2
apex-videogame
apples-fvpl5
aquarium-qlnqy
asbestos
avatar-recognition-nuexe
axial-mri
bacteria-ptywi
bccd-ouzjz
bees-jt5in
bone-fracture-7fylg
brain-tumor-m2pbp
cable-damage
cables-nl42k
cavity-rs0uf
cell-towers
cells-uyemf
chess-pieces-mjzgj
circuit-elements
circuit-voltages
cloud-types
coins-1apki
construction-safety-gsnvb
coral-lwptl
corrosion-bi3q3
cotton-20xz5
cotton-plant-disease
csgo-videogame
currency-v4f8j
digits-t2eg6
document-parts
excavators-czvg9
farcry6-videogame
fish-market-ggjso
flir-camera-objects
furniture-ngpea
gauge-u2lwv
grass-weeds
gynecology-mri
halo-infinite-angel-videogame
hand-gestures-jps7z
insects-mytwu
leaf-disease-nsdsr
lettuce-pallets
liver-disease
marbles
mask-wearing-608pr
mitosis-gjs3g
number-ops
paper-parts
paragraphs-co84b
parasites-1s07h
peanuts-sd4kf
peixos-fish
people-in-paintings
pests-2xlvx
phages
pills-sxdht
poker-cards-cxcvz
printed-circuit-board
radio-signal
road-signs-6ih4y
road-traffic
robomasters-285km
secondary-chains
sedimentary-features-9eosf
shark-teeth-5atku
sign-language-sokdr
signatures-xc8up
smoke-uvylj
soccer-players-5fuqs
soda-bottles
solar-panels-taxvb
stomata-cells
street-work
tabular-data-wf9uh
team-fight-tactics
thermal-cheetah-my4dp
thermal-dogs-and-people-x6ejw
trail-camera
truck-movement
tweeter-posts
tweeter-profile
underwater-objects-5v7p8
underwater-pipes-4ng4t
uno-deck
valentines-chocolate
vehicles-q0x2v
wall-damage
washroom-rf1fa
weed-crop-aerial
wine-labels
x-ray-rheumatology

Segmentation

COCO
Cityspace
VOC

Community

Click here to join our Discord Community

License

This project is released under the Apache 2.0 license.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

DataGradients

🔍 Detect Common Data Issues

💡 Extract Insights for Better Model Design

🎯 Reduce Guesswork for Hyperparameters

🛠 Capabilities

Table of Contents

Installation

Quick Start

Prerequisites

Dataset Analysis

Report

Feature Configuration

Dataset Extractors

Pre-computed Dataset Analysis

Detection

Segmentation

Community

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

DataGradients

🔍 Detect Common Data Issues

💡 Extract Insights for Better Model Design

🎯 Reduce Guesswork for Hyperparameters

🛠 Capabilities

Table of Contents

Installation

Quick Start

Prerequisites

Dataset Analysis

Report

Feature Configuration

Dataset Extractors

Pre-computed Dataset Analysis

Detection

Segmentation

Community

License