Skip to content

Commit

Permalink
Merge branch 'develop' into dependabot/pip/pyyaml-6.0.2
Browse files Browse the repository at this point in the history
  • Loading branch information
sovrasov authored Nov 29, 2024
2 parents 11e5b9c + 75797c5 commit f01a7aa
Show file tree
Hide file tree
Showing 157 changed files with 3,724 additions and 320 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/code_scan.yml
Original file line number Diff line number Diff line change
Expand Up @@ -31,14 +31,14 @@ jobs:
mkdir -p .ci/base/docs
pip-compile -o .ci/base/docs/requirements.txt docs/requirements.txt
- name: Run Trivy Scan (full, csv)
uses: aquasecurity/trivy-action@6e7b7d1fd3e4fef0c5fa8cce1229c54b2c9bd0d8 # 0.24.0
uses: aquasecurity/trivy-action@18f2510ee396bbf400402947b394f2dd8c87dbb0 # 0.29.0
with:
trivy-config: ".ci/trivy-csv.yaml"
scan-type: 'fs'
scan-ref: ".ci/"
scanners: vuln,secret
- name: Run Trivy Scan (prod, spdx.json)
uses: aquasecurity/trivy-action@6e7b7d1fd3e4fef0c5fa8cce1229c54b2c9bd0d8 # 0.24.0
uses: aquasecurity/trivy-action@18f2510ee396bbf400402947b394f2dd8c87dbb0 # 0.29.0
with:
trivy-config: ".ci/trivy-json.yaml"
scan-type: 'fs'
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/codeql.yml
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,7 @@ jobs:

# Initializes the CodeQL tools for scanning.
- name: Initialize CodeQL
uses: github/codeql-action/init@afb54ba388a7dca6ecae48f608c4ff05ff4cc77a # v3.25.15
uses: github/codeql-action/init@f09c1c0a94de965c15400f5634aa42fac8fb8f88 # v3.27.5
with:
languages: ${{ matrix.language }}
# If you wish to specify custom queries, you can do so here or in a config file.
Expand All @@ -73,7 +73,7 @@ jobs:
python -m build
- name: Perform CodeQL Analysis
uses: github/codeql-action/analyze@afb54ba388a7dca6ecae48f608c4ff05ff4cc77a # v3.25.15
uses: github/codeql-action/analyze@f09c1c0a94de965c15400f5634aa42fac8fb8f88 # v3.27.5
with:
category: "/language:${{matrix.language}}"
- name: Generate Security Report
Expand Down
4 changes: 2 additions & 2 deletions .github/workflows/issue_assignment.yml
Original file line number Diff line number Diff line change
Expand Up @@ -14,9 +14,9 @@ jobs:

steps:
- name: Auto-assign Issue
uses: pozil/[email protected].0
uses: pozil/[email protected].1
with:
repo-token: ${{ secrets.GITHUB_TOKEN }}
assignees: vinnamkim,jihyeonyi,sooahleex,itrushkin
assignees: jihyeonyi,sooahleex,itrushkin
numOfAssignee: 1
allowSelfAssign: false
2 changes: 1 addition & 1 deletion .github/workflows/pr_check.yml
Original file line number Diff line number Diff line change
Expand Up @@ -62,6 +62,6 @@ jobs:
run: |
tox -vvv -e tests-py${{ matrix.tox-env-py }}-${{ matrix.tox-env-os }} -- tests/integration
- name: Upload coverage reports to Codecov
uses: codecov/codecov-action@v4
uses: codecov/codecov-action@v5
with:
flags: ${{ matrix.os }}_Python-${{ matrix.python-version }}
4 changes: 2 additions & 2 deletions .github/workflows/publish_to_pypi.yml
Original file line number Diff line number Diff line change
Expand Up @@ -80,12 +80,12 @@ jobs:
file_glob: true
- name: Publish package distributions to PyPI
if: ${{ steps.check-tag.outputs.match != '' }}
uses: pypa/gh-action-pypi-publish@v1.9.0
uses: pypa/gh-action-pypi-publish@v1.12.2
with:
password: ${{ secrets.PYPI_API_TOKEN }}
- name: Publish package distributions to TestPyPI
if: ${{ steps.check-tag.outputs.match == '' }}
uses: pypa/gh-action-pypi-publish@v1.9.0
uses: pypa/gh-action-pypi-publish@v1.12.2
with:
password: ${{ secrets.TESTPYPI_API_TOKEN }}
repository-url: https://test.pypi.org/legacy/
Expand Down
2 changes: 1 addition & 1 deletion .github/workflows/scorecard.yml
Original file line number Diff line number Diff line change
Expand Up @@ -67,6 +67,6 @@ jobs:

# Upload the results to GitHub's code scanning dashboard.
- name: "Upload to code-scanning"
uses: github/codeql-action/upload-sarif@afb54ba388a7dca6ecae48f608c4ff05ff4cc77a # v3.25.15
uses: github/codeql-action/upload-sarif@f09c1c0a94de965c15400f5634aa42fac8fb8f88 # v3.27.5
with:
sarif_file: results.sarif
17 changes: 17 additions & 0 deletions 3rd-party.txt
Original file line number Diff line number Diff line change
Expand Up @@ -7518,5 +7518,22 @@ Apache-2.0
See the License for the specific language governing permissions and
limitations under the License.
-------------------------------------------------------------
portalocker

BSD-3-Clause

Copyright 2022 Rick van Hattem

Redistribution and use in source and binary forms, with or without modification, are permitted provided that the following conditions are met:

1. Redistributions of source code must retain the above copyright notice, this list of conditions and the following disclaimer.

2. Redistributions in binary form must reproduce the above copyright notice, this list of conditions and the following disclaimer in the documentation and/or other materials provided with the distribution.

3. Neither the name of the copyright holder nor the names of its contributors may be used to endorse or promote products derived from this software without specific prior written permission.

THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT HOLDER OR CONTRIBUTORS BE LIABLE FOR ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES (INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES; LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT (INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.

-------------------------------------------------------------

* Other names and brands may be claimed as the property of others.
65 changes: 64 additions & 1 deletion CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -5,16 +5,79 @@ All notable changes to this project will be documented in this file.
The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/),
and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).

## \[unreleased\]
## \[Unreleased\]

### New features
- Convert Cuboid2D annotation to/from 3D data
(<https://github.com/openvinotoolkit/datumaro/pull/1639>)
- Add label groups for hierarchical classification in ImageNet
(<https://github.com/openvinotoolkit/datumaro/pull/1645>)

### Enhancements
- Enhance 'id_from_image_name' transform to ensure each identifier is unique
(<https://github.com/openvinotoolkit/datumaro/pull/1635>)
- Optimize path assignment to handle point cloud in JSON without images
(<https://github.com/openvinotoolkit/datumaro/pull/1643>)
- Add documentation for framework conversion
(<https://github.com/openvinotoolkit/datumaro/pull/1659>)

### Bug fixes
- Fix assertion to compare hashkeys against expected value
(<https://github.com/openvinotoolkit/datumaro/pull/1641>)

## Q4 2024 Release 1.10.0

### New features
- Support KITTI 3D format
(<https://github.com/openvinotoolkit/datumaro/pull/1619>, <https://github.com/openvinotoolkit/datumaro/pull/1621>)
- Add PseudoLabeling transform for unlabeled dataset
(<https://github.com/openvinotoolkit/datumaro/pull/1594>)

### Enhancements
- Raise an appropriate error when exporting a datumaro dataset if its subset name contains path separators.
(<https://github.com/openvinotoolkit/datumaro/pull/1615>)
- Update docs for transform plugins
(<https://github.com/openvinotoolkit/datumaro/pull/1599>)
- Update ov ir model for explorer openvino launcher with CLIP ViT-L/14@336px model
(<https://github.com/openvinotoolkit/datumaro/pull/1603>)
- Optimize path assignment to handle point cloud in JSON without images
(<https://github.com/openvinotoolkit/datumaro/pull/1643>)
- Set TabularTransform to process clean transform in parallel
(<https://github.com/openvinotoolkit/datumaro/pull/1648>)

### Bug fixes
- Fix datumaro format to load visibility information from Points annotations
(<https://github.com/openvinotoolkit/datumaro/pull/1644>)

## Q4 2024 Release 1.9.1
### Enhancements
- Support multiple labels for kaggle format
(<https://github.com/openvinotoolkit/datumaro/pull/1607>)
- Use DataFrame.map instead of DataFrame.applymap
(<https://github.com/openvinotoolkit/datumaro/pull/1613>)

### Bug fixes
- Fix StreamDataset merging when importing in eager mode
(<https://github.com/openvinotoolkit/datumaro/pull/1609>)

## Q3 2024 Release 1.9.0
### New features
- Add a new CLI command: datum format
(<https://github.com/openvinotoolkit/datumaro/pull/1570>)
- Add a new Cuboid2D annotation type
(<https://github.com/openvinotoolkit/datumaro/pull/1601>)
- Support language dataset for DmTorchDataset
(<https://github.com/openvinotoolkit/datumaro/pull/1592>)

### Enhancements
- Change _Shape to Shape and add comments for subclasses of Shape
(<https://github.com/openvinotoolkit/datumaro/pull/1568>)
- Fix `kitti_raw` importer and exporter for dimensions (height, width, length) in meters
(<https://github.com/openvinotoolkit/datumaro/pull/1596>)

### Bug fixes
- Fix KITTI-3D importer and exporter
(<https://github.com/openvinotoolkit/datumaro/pull/1596>)

## Q3 2024 Release 1.8.0
### New features
Expand Down
1 change: 1 addition & 0 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,7 @@

[![Build status](https://github.com/openvinotoolkit/datumaro/actions/workflows/health_check.yml/badge.svg)](https://github.com/openvinotoolkit/datumaro/actions/workflows/health_check.yml)
[![codecov](https://codecov.io/gh/openvinotoolkit/datumaro/branch/develop/graph/badge.svg?token=FG25VU096Q)](https://codecov.io/gh/openvinotoolkit/datumaro)
[![Downloads](https://static.pepy.tech/badge/datumaro)](https://pepy.tech/project/datumaro)

A framework and CLI tool to build, transform, and analyze datasets.

Expand Down
59 changes: 27 additions & 32 deletions docker/segment-anything/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -371,38 +371,33 @@ numpy==1.26.4 \
# onnxruntime
# opencv-python
# pycocotools
onnx==1.16.0 \
--hash=sha256:034ae21a2aaa2e9c14119a840d2926d213c27aad29e5e3edaa30145a745048e1 \
--hash=sha256:03a627488b1a9975d95d6a55582af3e14c7f3bb87444725b999935ddd271d352 \
--hash=sha256:0e60ca76ac24b65c25860d0f2d2cdd96d6320d062a01dd8ce87c5743603789b8 \
--hash=sha256:0efeb46985de08f0efe758cb54ad3457e821a05c2eaf5ba2ccb8cd1602c08084 \
--hash=sha256:209fe84995a28038e29ae8369edd35f33e0ef1ebc3bddbf6584629823469deb1 \
--hash=sha256:237c6987c6c59d9f44b6136f5819af79574f8d96a760a1fa843bede11f3822f7 \
--hash=sha256:257858cbcb2055284f09fa2ae2b1cfd64f5850367da388d6e7e7b05920a40c90 \
--hash=sha256:298f28a2b5ac09145fa958513d3d1e6b349ccf86a877dbdcccad57713fe360b3 \
--hash=sha256:30f02beaf081c7d9fa3a8c566a912fc4408e28fc33b1452d58f890851691d364 \
--hash=sha256:3e0860fea94efde777e81a6f68f65761ed5e5f3adea2e050d7fbe373a9ae05b3 \
--hash=sha256:5202559070afec5144332db216c20f2fff8323cf7f6512b0ca11b215eacc5bf3 \
--hash=sha256:62a2e27ae8ba5fc9b4a2620301446a517b5ffaaf8566611de7a7c2160f5bcf4c \
--hash=sha256:66300197b52beca08bc6262d43c103289c5d45fde43fb51922ed1eb83658cf0c \
--hash=sha256:70a90649318f3470985439ea078277c9fb2a2e6e2fd7c8f3f2b279402ad6c7e6 \
--hash=sha256:71839546b7f93be4fa807995b182ab4b4414c9dbf049fee11eaaced16fcf8df2 \
--hash=sha256:7449241e70b847b9c3eb8dae622df8c1b456d11032a9d7e26e0ee8a698d5bf86 \
--hash=sha256:7532343dc5b8b5e7c3e3efa441a3100552f7600155c4db9120acd7574f64ffbf \
--hash=sha256:7665217c45a61eb44718c8e9349d2ad004efa0cb9fbc4be5c6d5e18b9fe12b52 \
--hash=sha256:7755cbd5f4e47952e37276ea5978a46fc8346684392315902b5ed4a719d87d06 \
--hash=sha256:77579e7c15b4df39d29465b216639a5f9b74026bdd9e4b6306cd19a32dcfe67c \
--hash=sha256:7fb29a9a692b522deef1f6b8f2145da62c0c43ea1ed5b4c0f66f827fdc28847d \
--hash=sha256:81b4ee01bc554e8a2b11ac6439882508a5377a1c6b452acd69a1eebb83571117 \
--hash=sha256:8cf3e518b1b1b960be542e7c62bed4e5219e04c85d540817b7027029537dec92 \
--hash=sha256:9eadbdce25b19d6216f426d6d99b8bc877a65ed92cbef9707751c6669190ba4f \
--hash=sha256:ae0029f5e47bf70a1a62e7f88c80bca4ef39b844a89910039184221775df5e43 \
--hash=sha256:c392faeabd9283ee344ccb4b067d1fea9dfc614fa1f0de7c47589efd79e15e78 \
--hash=sha256:d7886c05aa6d583ec42f6287678923c1e343afc4350e49d5b36a0023772ffa22 \
--hash=sha256:ddf14a3d32234f23e44abb73a755cb96a423fac7f004e8f046f36b10214151ee \
--hash=sha256:e5752bbbd5717304a7643643dba383a2fb31e8eb0682f4e7b7d141206328a73b \
--hash=sha256:ec22a43d74eb1f2303373e2fbe7fbcaa45fb225f4eb146edfed1356ada7a9aea \
--hash=sha256:f51179d4af3372b4f3800c558d204b592c61e4b4a18b8f61e0eea7f46211221a
onnx==1.17.0 \
--hash=sha256:0141c2ce806c474b667b7e4499164227ef594584da432fd5613ec17c1855e311 \
--hash=sha256:081ec43a8b950171767d99075b6b92553901fa429d4bc5eb3ad66b36ef5dbe3a \
--hash=sha256:0e906e6a83437de05f8139ea7eaf366bf287f44ae5cc44b2850a30e296421f2f \
--hash=sha256:23b8d56a9df492cdba0eb07b60beea027d32ff5e4e5fe271804eda635bed384f \
--hash=sha256:317870fca3349d19325a4b7d1b5628f6de3811e9710b1e3665c68b073d0e68d7 \
--hash=sha256:3193a3672fc60f1a18c0f4c93ac81b761bc72fd8a6c2035fa79ff5969f07713e \
--hash=sha256:38b5df0eb22012198cdcee527cc5f917f09cce1f88a69248aaca22bd78a7f023 \
--hash=sha256:3d955ba2939878a520a97614bcf2e79c1df71b29203e8ced478fa78c9a9c63c2 \
--hash=sha256:3e19fd064b297f7773b4c1150f9ce6213e6d7d041d7a9201c0d348041009cdcd \
--hash=sha256:48ca1a91ff73c1d5e3ea2eef20ae5d0e709bb8a2355ed798ffc2169753013fd3 \
--hash=sha256:4a183c6178be001bf398260e5ac2c927dc43e7746e8638d6c05c20e321f8c949 \
--hash=sha256:4f3fb5cc4e2898ac5312a7dc03a65133dd2abf9a5e520e69afb880a7251ec97a \
--hash=sha256:5ca7a0894a86d028d509cdcf99ed1864e19bfe5727b44322c11691d834a1c546 \
--hash=sha256:659b8232d627a5460d74fd3c96947ae83db6d03f035ac633e20cd69cfa029227 \
--hash=sha256:67e1c59034d89fff43b5301b6178222e54156eadd6ab4cd78ddc34b2f6274a66 \
--hash=sha256:76884fe3e0258c911c749d7d09667fb173365fd27ee66fcedaf9fa039210fd13 \
--hash=sha256:8167295f576055158a966161f8ef327cb491c06ede96cc23392be6022071b6ed \
--hash=sha256:95c03e38671785036bb704c30cd2e150825f6ab4763df3a4f1d249da48525957 \
--hash=sha256:d545335cb49d4d8c47cc803d3a805deb7ad5d9094dc67657d66e568610a36d7d \
--hash=sha256:d6fc3a03fc0129b8b6ac03f03bc894431ffd77c7d79ec023d0afd667b4d35869 \
--hash=sha256:dfd777d95c158437fda6b34758f0877d15b89cbe9ff45affbedc519b35345cf9 \
--hash=sha256:e4673276b558b5b572b960b7f9ef9214dce9305673683eb289bb97a7df379a4b \
--hash=sha256:ea5023a8dcdadbb23fd0ed0179ce64c1f6b05f5b5c34f2909b4e927589ebd0e4 \
--hash=sha256:ecf2b617fd9a39b831abea2df795e17bac705992a35a98e1f0363f005c4a5247 \
--hash=sha256:f01a4b63d4e1d8ec3e2f069e7b798b2955810aa434f7361f01bc8ca08d69cce4 \
--hash=sha256:f0e437f8f2f0c36f629e9743d28cf266312baa90be6a899f405f78f2d4cb2e1d
# via segment_anything (./segment-anything/setup.py)
onnxruntime==1.17.1 \
--hash=sha256:2dff1a24354220ac30e4a4ce2fb1df38cb1ea59f7dac2c116238d63fe7f4c5ff \
Expand Down
4 changes: 2 additions & 2 deletions docs/requirements.txt
Original file line number Diff line number Diff line change
Expand Up @@ -5,9 +5,9 @@
opencv-python-headless==4.10.0.84

# docs
markupsafe==2.1.5
markupsafe==3.0.2
nbconvert>=7.2.3
ipython==8.26.0
ipython==8.29.0
sphinx==7.2.6
pydata-sphinx-theme==0.15.2
sphinx-copybutton
Expand Down
2 changes: 1 addition & 1 deletion docs/source/docs/command-reference/context_free/prune.md
Original file line number Diff line number Diff line change
Expand Up @@ -13,7 +13,7 @@ Prune supports various methodology.

By default, datasets are updated in-place. The `-o/--output-dir` option can be used to specify another output directory. When updating in-place, use the `--overwirte` parameter (in-place updates fail by default to prevent data loss), unless a project target is modified.

The current project (`-p/--project`) is also used as a context for plugins, so it can be useful for datasest paths having custom formats. When not specified, the current project's working tree is used.
The current project (`-p/--project`) is also used as a context for plugins, so it can be useful for dataset paths having custom formats. When not specified, the current project's working tree is used.

The command can be applied to a dataset or a project build target, a stage or the combined `project` target, in which case all the project targets will be affected.

Expand Down
85 changes: 81 additions & 4 deletions docs/source/docs/command-reference/context_free/transform.md
Original file line number Diff line number Diff line change
Expand Up @@ -101,7 +101,10 @@ Basic dataset item manipulations:
- [`remove_images`](#remove_images) - Removes specific images
- [`remove_annotations`](#remove_annotations) - Removes annotations
- [`remove_attributes`](#remove_attributes) - Removes attributes
- [`astype_annotations`](#astype_annotations) - Convert annotation type
- [`astype_annotations`](#astype_annotations) - Transforms annotation types
- [`pseudo_labeling`](#pseudo_labeling) - Generates pseudo labels for unlabeled data
- [`correct`](#correct) - Corrects annotation types
- [`clean`](#clean) - Removes noisy data for tabular dataset

Subset manipulations:
- [`random_split`](#random_split) - Splits dataset into subsets
Expand Down Expand Up @@ -173,15 +176,36 @@ Examples:

#### `id_from_image_name`

Renames items in the dataset using image file name (without extension).
Renames items in the dataset based on the image file name, excluding the extension.
When 'ensure_unique' is enabled, a random suffix is appended to ensure each identifier is unique
in cases where the image name is not distinct. By default, the random suffix is three characters long,
but this can be adjusted with the 'suffix_length' parameter.

Usage:
```console
id_from_image_name [-h]
id_from_image_name [-h] [-u] [-l SUFFIX_LENGTH]
```

Optional arguments:
- `-h`, `--help` (flag) - Show this help message and exit
- `-h`, `--help` (flag) - show this help message and exit
- `-u`, `--ensure_unique` (flag) - Appends a random suffix to ensure each identifier is unique if the image name is duplicated
- `-l`, `--suffix_length` (int) - Alters the length of the random suffix if the `ensure_unique` is enabled(default: 3)

Examples:
- Renames items without duplication check
```console
datum transform -t id_from_image_name
```

- Renames items with duplication check
```console
datum transform -t id_from_image_name -- --ensure_unique
```

- Renames items with duplication check and alters the suffix length(default: 3)
```console
datum transform -t id_from_image_name -- --ensure_unique --suffix_length 2
```

#### `reindex`

Expand Down Expand Up @@ -826,6 +850,35 @@ bbox_values_decrement [-h]
Optional arguments:
- `-h`, `--help` (flag) - Show this help message and exit

#### `pseudo_labeling`

Assigns pseudo-labels to items in a dataset based on their similarity to predefined labels. This class is useful for semi-supervised learning when dealing with missing or uncertain labels.

The process includes:

- Similarity Computation: Uses hashing techniques to compute the similarity between items and predefined labels.
- Pseudo-Label Assignment: Assigns the most similar label as a pseudo-label to each item.

Attributes:

- `extractor` (IDataset) - Provides access to dataset items and their annotations.
- `labels` (Optional[List[str]]) - List of predefined labels for pseudo-labeling. Defaults to all available labels if not provided.
- `explorer` (Optional[Explorer]) - Computes hash keys for items and labels. If not provided, a new Explorer is created.

Usage:
```console
pseudo_labeling [-h] [--labels LABELS]

Optional arguments:
- `-h`, `--help` (flag) - Show this help message and exit
- `--labels` (str) - Comma-separated list of label names for pseudo-labeling

Examples:
- Assign pseudo-labels based on predefined labels
```console
datum transform -t pseudo_labeling -- --labels 'label1,label2'
```

#### `correct`

Correct the dataset from a validation report
Expand All @@ -838,3 +891,27 @@ correct [-h] [-r REPORT_PATH]
Optional arguments:
- `-h`, `--help` (flag) - Show this help message and exit
- `-r`, `--reports` (str) - A validation report from a 'validate' CLI (default=validation_reports.json)

#### `clean`

Refines and preprocesses media items in a dataset, focusing on string, numeric, and categorical data. This transform is designed to clean and improve the quality of the data, making it more suitable for analysis and modeling.

The cleaning process includes:

- String Data: Removes unnecessary characters using NLP techniques.
- Numeric Data: Identifies and handles outliers and missing values.
- Categorical Data: Cleans and refines categorical information.

Usage:
```console
clean [-h]
```

Optional arguments:
- `-h`, `--help` (flag) - Show this help message and exit

Examples:
- Clean and preprocess dataset items
```console
datum transform -t clean
```
Loading

0 comments on commit f01a7aa

Please sign in to comment.