Allow changing Dataset (class) in Pipelines #528

AlekseySh · 2024-04-12T00:18:41Z

Let's allow changing Dataset in Pipelines.
It also assumes we need registry for Datasets.

Note, let's keep back compatibility with the previous format. We can have a condition which checks if the dataset is in the old format. For example, if one of the old keys is presented (like dataframe_name or dataset_root). If you see such keys, first, you need to reorgonize yaml/dict, second, you can process it by an updated parser.

The text was updated successfully, but these errors were encountered:

leoromanovich · 2024-06-08T12:04:58Z

I've started work. WIP PR will be attached shortly.

AlekseySh · 2024-06-08T14:19:01Z

@leoromanovich great, waiting for it

leoromanovich · 2024-06-09T16:02:49Z

Start work here: #585

AlekseySh · 2024-06-09T17:42:10Z

Let's start with the first PR, where we don't add texts support, but refactor the way of processing images datasets.
Particularly, we had get_retrieval_datasets function that was hardcoded, but now we introduce registry on functions
like this.

Registry

DATASETS_BUILDER_REGISTRY = {"oml_img_datasets": build_img_dataset, "oml_txt_datasets": build_txt_dataset}

def build_img_dataset(cfg) -> (IQGLD | ILD):
    df = pd.read_csv(cfg["df_path"])
    df = enumerate(df)
    df_train, df_val = df.split(by='split')
    
    dataset_train = ImageLD(df_train)
    dataset_val = ImageQGLD(df_val)
    
    # or just reuse get_retrieval_datasets 
   
    return dataset_train, dataset_val

def build_txt_dataset(cfg) -> (IQGLD | ILD):
    pass

...

update oml.confifgs and test_registry + update doc about customisation of pipeline: link

Config.yaml

dataset_builder: oml_img_datasets
args:
    df: df_full.csv
    cache_size: 100
    transforms_train:
      name: hypvit_resize
      args:
        im_Size: 224
    trainsforms_val:
      name: hypvit_resize
      args:
        im_Size: 224

Back compatibility

def convert_to_oml_three_format_if_needed(cfg):
     if "dataset_root" and "transforms_train" and ... in cfg:
        cfg["dataset_train"] = {"name": "image_label_dataset", args: {"df": ..., "transform": }}
    # don't forget to delete refactored keys
     ....

def extractor_training_pipeline():
        cfg = dictconfig_to_dict(cfg)
        cfg = convert_to_new_format_if_needed(cfg)

        dataset_train, dataset_val = get_datasets_builder(cfg)
        assert dataset_train is ILD and dataset_val is IQGLD
        assert check_consistency(dataset_train, dataset_val)

Update mock dataset and pipelines test

@hydra.main(config_path="configs", config_name="train_postprocessor.yaml", version_base=HYDRA_BEHAVIOUR)
def main_hydra(cfg: DictConfig) -> None:
    cfg = dictconfig_to_dict(cfg)
    download_mock_dataset(MOCK_DATASET_PATH) 
    cfg["dataset_builder"]["dataset_root"] = str(MOCK_DATASET_PATH)
    extractor_training_pipeline(cfg)


if __name__ == "__main__":
    main_hydra()

TESTS PIPELINES

we keep some configs in the old format
we rework some configs to the new format
one of configs uses custom dataset builder (which is just mocked default img_dataset_builder) - use custom_augmentations in train_with_bboxes for reference

AlekseySh added the rework label Apr 12, 2024

AlekseySh changed the title ~~Allow changing dataset in Pipelines~~ Allow changing Dataset (class) in Pipelines Apr 12, 2024

AlekseySh added the new feature label Jun 7, 2024

AlekseySh assigned leoromanovich Jun 7, 2024

AlekseySh linked a pull request Jun 9, 2024 that will close this issue

[WIP] Add datasets to config api (pipelines) #585

Open

3 tasks

github-project-automation bot added this to OML-planning Aug 30, 2024

github-project-automation bot moved this to In progress in OML-planning Aug 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow changing Dataset (class) in Pipelines #528

Allow changing Dataset (class) in Pipelines #528

AlekseySh commented Apr 12, 2024 •

edited

Loading

leoromanovich commented Jun 8, 2024

AlekseySh commented Jun 8, 2024

leoromanovich commented Jun 9, 2024

AlekseySh commented Jun 9, 2024 •

edited

Loading

Allow changing Dataset (class) in Pipelines #528

Allow changing Dataset (class) in Pipelines #528

Comments

AlekseySh commented Apr 12, 2024 • edited Loading

leoromanovich commented Jun 8, 2024

AlekseySh commented Jun 8, 2024

leoromanovich commented Jun 9, 2024

AlekseySh commented Jun 9, 2024 • edited Loading

AlekseySh commented Apr 12, 2024 •

edited

Loading

AlekseySh commented Jun 9, 2024 •

edited

Loading