openvinotoolkit · zhiltsov-max · Jan 23, 2021 · Nov 13, 2020 · Nov 23, 2020 · Dec 1, 2020
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -6,21 +6,36 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
 
-## [Unreleased]
+## 01/23/2021 - Release v0.1.5
 ### Added
--
+- `WiderFace` dataset format (<https://github.com/openvinotoolkit/datumaro/pull/65>, <https://github.com/openvinotoolkit/datumaro/pull/90>)
+- Function to transform annotations to labels (<https://github.com/openvinotoolkit/datumaro/pull/66>)
+- Dataset splits for classification, detection and re-id tasks (<https://github.com/openvinotoolkit/datumaro/pull/68>, <https://github.com/openvinotoolkit/datumaro/pull/81>)
+- `VGGFace2` dataset format (<https://github.com/openvinotoolkit/datumaro/pull/69>, <https://github.com/openvinotoolkit/datumaro/pull/82>)
+- Unique image count statistic (<https://github.com/openvinotoolkit/datumaro/pull/87>)
+- Installation with pip by name `datumaro`
 
 ### Changed
--
+- `Dataset` class extended with new operations: `save`, `load`, `export`, `import_from`, `detect`, `run_model` (<https://github.com/openvinotoolkit/datumaro/pull/71>)
+- Allowed importing `Extractor`-only defined formats (in `Project.import_from`, `dataset.import_from` and CLI/`project import`) (<https://github.com/openvinotoolkit/datumaro/pull/71>)
+- `datum project ...` commands replaced with `datum ...` commands (<https://github.com/openvinotoolkit/datumaro/pull/84>)
+- Supported more image formats in `ImageNet` extractors (<https://github.com/openvinotoolkit/datumaro/pull/85>)
+- Allowed adding `Importer`-defined formats as project sources (`source add`) (<https://github.com/openvinotoolkit/datumaro/pull/86>)
+- Added max search depth in `ImageDir` format and importers (<https://github.com/openvinotoolkit/datumaro/pull/86>)
 
 ### Deprecated
--
+- `datum project ...` CLI context (<https://github.com/openvinotoolkit/datumaro/pull/84>)
 
 ### Removed
 -
 
 ### Fixed
--
+- Allow plugins inherited from `Extractor` (instead of only `SourceExtractor`) (<https://github.com/openvinotoolkit/datumaro/pull/70>)
+- Windows installation with `pip` for `pycocotools` (<https://github.com/openvinotoolkit/datumaro/pull/73>)
+- `YOLO` extractor path matching on Windows (<https://github.com/openvinotoolkit/datumaro/pull/73>)
+- Fixed inplace file copying when saving images (<https://github.com/openvinotoolkit/datumaro/pull/76>)
+- Fixed `labelmap` parameter type checking in `VOC` converter (<https://github.com/openvinotoolkit/datumaro/pull/76>)
+- Fixed model copying on addition in CLI (<https://github.com/openvinotoolkit/datumaro/pull/94>)
 
 ### Security
 -

diff --git a/README.md b/README.md
@@ -44,23 +44,23 @@ CVAT annotations                             ---> Publication, statistics etc.
 - Convert only non-`occluded` annotations from a [CVAT](https://github.com/opencv/cvat) project to TFrecord:
   ```bash
   # export Datumaro dataset in CVAT UI, extract somewhere, go to the project dir
-  datum project filter -e '/item/annotation[occluded="False"]' \
+  datum filter -e '/item/annotation[occluded="False"]' \
     --mode items+anno --output-dir not_occluded
-  datum project export --project not_occluded \
+  datum export --project not_occluded \
     --format tf_detection_api -- --save-images
   ```
 
 - Annotate MS COCO dataset, extract image subset, re-annotate it in [CVAT](https://github.com/opencv/cvat), update old dataset:
   ```bash
   # Download COCO dataset http://cocodataset.org/#download
   # Put images to coco/images/ and annotations to coco/annotations/
-  datum project import --format coco --input-path <path/to/coco>
-  datum project export --filter '/image[images_I_dont_like]' --format cvat \
+  datum import --format coco --input-path <path/to/coco>
+  datum export --filter '/image[images_I_dont_like]' --format cvat \
     --output-dir reannotation
   # import dataset and images to CVAT, re-annotate
   # export Datumaro project, extract to 'reannotation-upd'
-  datum project project merge reannotation-upd
-  datum project export --format coco
+  datum merge reannotation-upd
+  datum export --format coco
   ```
 
 - Annotate instance polygons in [CVAT](https://github.com/opencv/cvat), export as masks in COCO:
@@ -72,18 +72,18 @@ CVAT annotations                             ---> Publication, statistics etc.
 - Apply an OpenVINO detection model to some COCO-like dataset,
   then compare annotations with ground truth and visualize in TensorBoard:
   ```bash
-  datum project import --format coco --input-path <path/to/coco>
+  datum import --format coco --input-path <path/to/coco>
   # create model results interpretation script
   datum model add mymodel openvino \
     --weights model.bin --description model.xml \
     --interpretation-script parse_results.py
   datum model run --model mymodel --output-dir mymodel_inference/
-  datum project diff mymodel_inference/ --format tensorboard --output-dir diff
+  datum diff mymodel_inference/ --format tensorboard --output-dir diff
   ```
 
 - Change colors in PASCAL VOC-like `.png` masks:
   ```bash
-  datum project import --format voc --input-path <path/to/voc/dataset>
+  datum import --format voc --input-path <path/to/voc/dataset>
 
   # Create a color map file with desired colors:
   #
@@ -93,24 +93,42 @@ CVAT annotations                             ---> Publication, statistics etc.
   #
   # Save as mycolormap.txt
 
-  datum project export --format voc_segmentation -- --label-map mycolormap.txt
+  datum export --format voc_segmentation -- --label-map mycolormap.txt
   # add "--apply-colormap=0" to save grayscale (indexed) masks
   # check "--help" option for more info
   # use "datum --loglevel debug" for extra conversion info
   ```
 
+- Create a custom COCO-like dataset:
+  ```python
+  import numpy as np
+  from datumaro.components.extractor import (DatasetItem,
+    Bbox, LabelCategories, AnnotationType)
+  from datumaro.components.dataset import Dataset
+
+  dataset = Dataset(categories={
+    AnnotationType.label: LabelCategories.from_iterable(['cat', 'dog'])
+  })
+  dataset.put(DatasetItem(id=0, image=np.ones((5, 5, 3)), annotations=[
+    Bbox(1, 2, 3, 4, label=0),
+  ]))
+  dataset.export('test_dataset', 'coco')
+  ```
+
 <!--lint enable list-item-bullet-indent-->
 <!--lint enable list-item-indent-->
 
 ## Features
 
 [(Back to top)](#table-of-contents)
 
-- Dataset reading, writing, conversion in any direction. Supported formats:
+- Dataset reading, writing, conversion in any direction. [Supported formats](docs/user_manual.md#supported-formats):
   - [COCO](http://cocodataset.org/#format-data) (`image_info`, `instances`, `person_keypoints`, `captions`, `labels`*)
   - [PASCAL VOC](http://host.robots.ox.ac.uk/pascal/VOC/voc2012/htmldoc/index.html) (`classification`, `detection`, `segmentation`, `action_classification`, `person_layout`)
   - [YOLO](https://github.com/AlexeyAB/darknet#how-to-train-pascal-voc-data) (`bboxes`)
   - [TF Detection API](https://github.com/tensorflow/models/blob/master/research/object_detection/g3doc/using_your_own_dataset.md) (`bboxes`, `masks`)
+  - [WIDER Face](http://shuoyang1213.me/WIDERFACE/) (`bboxes`)
+  - [VGGFace2](https://github.com/ox-vgg/vgg_face2) (`landmarks`, `bboxes`)
   - [MOT sequences](https://arxiv.org/pdf/1906.04567.pdf)
   - [MOTS PNG](https://www.vision.rwth-aachen.de/page/mots)
   - [ImageNet](http://image-net.org/)
@@ -129,6 +147,14 @@ CVAT annotations                             ---> Publication, statistics etc.
     - polygons to instance masks and vise-versa
     - apply a custom colormap for mask annotations
     - rename or remove dataset labels
+  - Splitting a dataset into multiple subsets like `train`, `val`, and `test`:
+    - random split
+    - task-specific splits based on annotations,
+      which keep initial label and attribute distributions
+      - for classification task, based on labels
+      - for detection task, based on bboxes
+      - for re-identification task, based on labels,
+        avoiding having same IDs in training and test splits
 - Dataset quality checking
   - Simple checking for errors
   - Comparison with model infernece
@@ -162,7 +188,7 @@ python -m virtualenv venv
 Install Datumaro package:
 
 ``` bash
-pip install 'git+https://github.com/openvinotoolkit/datumaro'
+pip install datumaro
 ```
 
 ## Usage
@@ -208,13 +234,14 @@ dataset = dataset.transform(project.env.transforms.get('remap_labels'),
   {'cat': 'dog', # rename cat to dog
     'truck': 'car', # rename truck to car
     'person': '', # remove this label
-  }, default='delete')
+  }, default='delete') # remove everything else
 
+# iterate over dataset elements
 for item in dataset:
   print(item.id, item.annotations)
 
 # export the resulting dataset in COCO format
-project.env.converters.get('coco').convert(dataset, save_dir='dst/dir')
+dataset.export('dst/dir', 'coco')
 ```
 
 > Check our [developer guide](docs/developer_guide.md) for additional information.

diff --git a/datumaro/cli/__init__.py b/datumaro/cli/__init__.py
@@ -1,4 +1,4 @@
 
-# Copyright (C) 2019-2020 Intel Corporation
+# Copyright (C) 2019-2021 Intel Corporation
 #
 # SPDX-License-Identifier: MIT
diff --git a/datumaro/cli/__main__.py b/datumaro/cli/__main__.py
@@ -1,5 +1,5 @@
 
-# Copyright (C) 2019-2020 Intel Corporation
+# Copyright (C) 2019-2021 Intel Corporation
 #
 # SPDX-License-Identifier: MIT
 
@@ -58,18 +58,25 @@ def make_parser():
     _LogManager._define_loglevel_option(parser)
 
     known_contexts = [
-        ('project', contexts.project, "Actions on projects (datasets)"),
-        ('source', contexts.source, "Actions on data sources"),
-        ('model', contexts.model, "Actions on models"),
+        ('project', contexts.project, "Actions with project (deprecated)"),
+        ('source', contexts.source, "Actions with data sources"),
+        ('model', contexts.model, "Actions with models"),
     ]
     known_commands = [
         ('create', commands.create, "Create project"),
-        ('add', commands.add, "Add source to project"),
-        ('remove', commands.remove, "Remove source from project"),
-        ('export', commands.export, "Export project"),
+        ('import', commands.import_, "Create project from existing dataset"),
+        ('add', commands.add, "Add data source to project"),
+        ('remove', commands.remove, "Remove data source from project"),
+        ('export', commands.export, "Export project in some format"),
+        ('filter', commands.filter, "Filter project"),
+        ('transform', commands.transform, "Transform project"),
+        ('merge', commands.merge, "Merge projects"),
+        ('convert', commands.convert, "Convert dataset into another format"),
+        ('diff', commands.diff, "Compare projects with intersection"),
+        ('ediff', commands.ediff, "Compare projects for equality"),
+        ('stats', commands.stats, "Compute project statistics"),
+        ('info', commands.info, "Print project info"),
         ('explain', commands.explain, "Run Explainable AI algorithm for model"),
-        ('merge', commands.merge, "Merge datasets"),
-        ('convert', commands.convert, "Convert dataset"),
     ]
 
     # Argparse doesn't support subparser groups:

diff --git a/datumaro/cli/commands/__init__.py b/datumaro/cli/commands/__init__.py
@@ -1,6 +1,13 @@
-
-# Copyright (C) 2019-2020 Intel Corporation
+# Copyright (C) 2019-2021 Intel Corporation
 #
 # SPDX-License-Identifier: MIT
 
-from . import add, create, explain, export, remove, merge, convert
+# pylint: disable=redefined-builtin
+
+from . import (
+    create, add, remove, import_,
+    explain,
+    export, merge, convert, transform, filter,
+    diff, ediff, stats,
+    info
+)
diff --git a/datumaro/cli/commands/add.py b/datumaro/cli/commands/add.py
@@ -1,5 +1,4 @@
-
-# Copyright (C) 2019-2020 Intel Corporation
+# Copyright (C) 2020-2021 Intel Corporation
 #
 # SPDX-License-Identifier: MIT
 

diff --git a/datumaro/cli/commands/convert.py b/datumaro/cli/commands/convert.py
@@ -1,5 +1,4 @@
-
-# Copyright (C) 2019-2020 Intel Corporation
+# Copyright (C) 2019-2021 Intel Corporation
 #
 # SPDX-License-Identifier: MIT
 
@@ -9,6 +8,7 @@
 import os.path as osp
 
 from datumaro.components.project import Environment
+from datumaro.components.dataset import Dataset
 
 from ..contexts.project import FilterModes
 from ..util import CliException, MultilineFormatter, make_file_name
@@ -63,51 +63,29 @@ def convert_command(args):
     env = Environment()
 
     try:
-        converter = env.converters.get(args.output_format)
+        converter = env.converters[args.output_format]
     except KeyError:
         raise CliException("Converter for format '%s' is not found" % \
             args.output_format)
-    extra_args = converter.from_cmdline(args.extra_args)
-    def converter_proxy(extractor, save_dir):
-        return converter.convert(extractor, save_dir, **extra_args)
+    extra_args = converter.parse_cmdline(args.extra_args)
 
     filter_args = FilterModes.make_filter_args(args.filter_mode)
 
+    fmt = args.input_format
     if not args.input_format:
-        matches = []
-        for format_name in env.importers.items:
-            log.debug("Checking '%s' format...", format_name)
-            importer = env.make_importer(format_name)
-            try:
-                match = importer.detect(args.source)
-                if match:
-                    log.debug("format matched")
-                    matches.append((format_name, importer))
-            except NotImplementedError:
-                log.debug("Format '%s' does not support auto detection.",
-                    format_name)
-
+        matches = env.detect_dataset(args.source)
         if len(matches) == 0:
             log.error("Failed to detect dataset format. "
                 "Try to specify format with '-if/--input-format' parameter.")
             return 1
         elif len(matches) != 1:
             log.error("Multiple formats match the dataset: %s. "
                 "Try to specify format with '-if/--input-format' parameter.",
-                ', '.join(m[0] for m in matches))
+                ', '.join(matches))
             return 2
 
-        format_name, importer = matches[0]
-        args.input_format = format_name
+        fmt = matches[0]
         log.info("Source dataset format detected as '%s'", args.input_format)
-    else:
-        try:
-            importer = env.make_importer(args.input_format)
-            if hasattr(importer, 'from_cmdline'):
-                extra_args = importer.from_cmdline()
-        except KeyError:
-            raise CliException("Importer for format '%s' is not found" % \
-                args.input_format)
 
     source = osp.abspath(args.source)
 
@@ -121,15 +99,12 @@ def converter_proxy(extractor, save_dir):
             (osp.basename(source), make_file_name(args.output_format)))
     dst_dir = osp.abspath(dst_dir)
 
-    project = importer(source)
-    dataset = project.make_dataset()
+    dataset = Dataset.import_from(source, fmt)
 
     log.info("Exporting the dataset")
-    dataset.export_project(
-        save_dir=dst_dir,
-        converter=converter_proxy,
-        filter_expr=args.filter,
-        **filter_args)
+    if args.filter:
+        dataset = dataset.filter(args.filter, **filter_args)
+    dataset.export(format=args.output_format, save_dir=dst_dir, **extra_args)
 
     log.info("Dataset exported to '%s' as '%s'" % \
         (dst_dir, args.output_format))

diff --git a/datumaro/cli/commands/create.py b/datumaro/cli/commands/create.py
@@ -1,5 +1,4 @@
-
-# Copyright (C) 2019-2020 Intel Corporation
+# Copyright (C) 2019-2021 Intel Corporation
 #
 # SPDX-License-Identifier: MIT
 

diff --git a/datumaro/cli/commands/diff.py b/datumaro/cli/commands/diff.py
@@ -0,0 +1,7 @@
+# Copyright (C) 2019-2021 Intel Corporation
+#
+# SPDX-License-Identifier: MIT
+
+# pylint: disable=unused-import
+
+from ..contexts.project import build_diff_parser as build_parser
diff --git a/datumaro/cli/commands/ediff.py b/datumaro/cli/commands/ediff.py
@@ -0,0 +1,7 @@
+# Copyright (C) 2019-2021 Intel Corporation
+#
+# SPDX-License-Identifier: MIT
+
+# pylint: disable=unused-import
+
+from ..contexts.project import build_ediff_parser as build_parser
diff --git a/datumaro/cli/commands/explain.py b/datumaro/cli/commands/explain.py
@@ -1,5 +1,4 @@
-
-# Copyright (C) 2019-2020 Intel Corporation
+# Copyright (C) 2019-2021 Intel Corporation
 #
 # SPDX-License-Identifier: MIT
 

diff --git a/datumaro/cli/commands/export.py b/datumaro/cli/commands/export.py
@@ -1,5 +1,4 @@
-
-# Copyright (C) 2019-2020 Intel Corporation
+# Copyright (C) 2019-2021 Intel Corporation
 #
 # SPDX-License-Identifier: MIT
 

diff --git a/datumaro/cli/commands/filter.py b/datumaro/cli/commands/filter.py
@@ -0,0 +1,7 @@
+# Copyright (C) 2020-2021 Intel Corporation
+#
+# SPDX-License-Identifier: MIT
+
+# pylint: disable=unused-import
+
+from ..contexts.project import build_filter_parser as build_parser