Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Datasets API #53

Open
4 tasks
hassiahk opened this issue Nov 24, 2020 · 2 comments
Open
4 tasks

[RFC] Datasets API #53

hassiahk opened this issue Nov 24, 2020 · 2 comments
Assignees
Labels
datasets Providing Datasets to users enhancement New feature or request

Comments

@hassiahk
Copy link
Contributor

hassiahk commented Nov 24, 2020

🚀 Feature

Having Datasets API for commonly used formats will come in handy.

Pitch

A non-exhaustive list of formats that are commonly used:

  • CSV file with image_id and target columns (Binary or Multi-Class Classification). There are two ways that are used most often in this:
image_id        target
100011               1
100015               0
100007               2

Above has been implemented using CSVSingleLabelDataset. Should we add support for below in the same or should we create a separate one? I think we can have both in the same.

image_id        target
100011.png           1
100015.png           0
100007.png           2
  • CSV file with image_id and target columns (Multi-Label Classification). Similarly, there are two ways that are used most often in this:
image_id        target
100011             0 1
100015             0 2
100007             1 2
image_id        target
100011.png         0 1
100015.png         0 2
100007.png         1 2
  • Folder structure like below:
folder
|-- test
`-- train
    |-- class_1
    |   |-- 10001.png
    |   `-- 10002.png
    |-- class_2
    |   |-- 10005.png
    |   `-- 10009.png
    `-- class_3
        |-- 10014.png
        `-- 10027.png

Above has been implemented using create_folder_dataset but we don't always need to split the train into train_set and valid_set. Because we may have cases where valid_set is pre-defined like below:

folder
|-- test
|-- train
|   |-- class_1
|   |   |-- 10001.png
|   |   `-- 10002.png
|   |-- class_2
|   |   |-- 10005.png
|   |   `-- 10009.png
|   `-- class_3
|       |-- 10014.png
|       `-- 10027.png
`-- valid
    |-- class_1
    |   |-- 10023.png
    |   `-- 10035.png
    |-- class_2
    |   |-- 1002.png
    |   `-- 10042.png
    `-- class_3
        |-- 10029.png
        `-- 10076.png
  • CSV file with image_id and bbox column (Object Detection). Similar to classification tasks, there can be two ways that are used most often in this:
image_id        bbox
100011          [834.0, 222.0, 56.0, 36.0]
100011          [226.0, 548.0, 130.0, 58.0]
100007          [377.0, 504.0, 74.0, 160.0]

Honestly, I have never seen the below format but still we can have support for this.

image_id             bbox
100011.jpg           [834.0, 222.0, 56.0, 36.0]
100011.jpg           [226.0, 548.0, 130.0, 58.0]
100007.jpg           [377.0, 504.0, 74.0, 160.0]

I have come across only the above four formats, but do let me know if I missed any. And also let me know your thoughts on the above.

cc @zhiqwang

@hassiahk hassiahk added the enhancement New feature or request label Nov 24, 2020
@oke-aditya oke-aditya added the datasets Providing Datasets to users label Nov 24, 2020
@zhiqwang
Copy link
Contributor

zhiqwang commented Nov 24, 2020

For object detection task, there are two other frequently used formats: Pascal VOC and MSCOCO, and it is supported in torchvision, I am not sure that we didn't mention this two Datasets is for we just use torchvision's implementation when we met this two datasets?

@oke-aditya
Copy link
Owner

I think we should discuss more over this. Datasets is really tricky especially when it comes to object detection etc.
For the Torchvision models, we expect VOC format.
And for Detr, a normalized YOLO format.
We haven't enforced these as these have come from models themselves.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
datasets Providing Datasets to users enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants