[WIP] Add datasets to config api (pipelines) #585

leoromanovich · 2024-06-09T16:01:48Z

For now, we don't push you to follow any predefined schema of issue, but ensure you've already read our contribution guide: https://open-metric-learning.readthedocs.io/en/latest/from_readme/contributing.html.

Example of making custom dataset builder
Change docs
Remain some configs in old format (to test that nothing is breaking)

test for registry passed add dataset registry config to configs.

forgot to run previously

leoromanovich · 2024-06-16T15:31:09Z

@AlekseySh what do you think about implementation like that?

oml/registry/datasets.py

AlekseySh

Thx for PR.

We also need to update the documentation here.

Also update all existing examples in pipelines

oml/lightning/pipelines/parser.py

oml/lightning/pipelines/logging.py

oml/lightning/pipelines/parser.py

tests/test_runs/test_pipelines/configs/train.yaml

tests/test_runs/test_pipelines/configs/validate.yaml

oml/registry/datasets.py

tests/test_oml/test_registry/test_registry.py

AlekseySh · 2024-06-17T18:17:55Z

@leoromanovich by the way, we also need to update postprocessing pipeline
seems like get_loaders_with_embeddings is already in a format which is very close to what we expect in builders registry, right?

leoromanovich · 2024-06-17T18:29:37Z

@leoromanovich by the way, we also need to update postprocessing pipeline seems like get_loaders_with_embeddings is already in a format which is very close to what we expect in builders registry, right?

Looks like) I've added some changes, tests passed.)

fix test for convertor update docs

leoromanovich · 2024-07-18T10:01:59Z

@AlekseySh Check changes for reranking builder, please.
What I don't like about current solution:
Because of using feature extractor inside, we need to pass args for extractor (like precision, num_workers, bs_inference).
Not critical, but I believe we need to change options, used in different places upper in config, if we decide, that reranking dataset builder approach is good enough.

AlekseySh · 2024-07-21T18:11:56Z

oml/configs/datasets/oml_reranking_dataset.yaml

@@ -0,0 +1,29 @@
+name: "oml_reranking_dataset"


"oml_reranking_dataset" assumes a way dataset is going to be used, which is not a dataset's business
let's name it oml_image_dataset_with_embeddings

AlekseySh · 2024-07-21T18:13:58Z

oml/configs/datasets/oml_reranking_dataset.yaml

@@ -0,0 +1,29 @@
+name: "oml_reranking_dataset"
+args:
+#  dataset_root: /path/to/dataset/   # use your dataset root here


let's do as in the test_pipelines (check entrypoint and config)

AlekseySh · 2024-07-21T18:14:20Z

oml/configs/datasets/oml_reranking_dataset.yaml

+args:
+#  dataset_root: /path/to/dataset/   # use your dataset root here
+  dataframe_name: df.csv
+#  embeddings_cache_dir: ${args.dataset_root} # CACHE EMBEDDINGS PRODUCED BY BASELINE FEATURE EXTRACTOR


let's use None, and why the comment is in UPPER CASE?

AlekseySh · 2024-07-21T18:18:39Z

oml/lightning/pipelines/logging.py

+        if cfg["datasets"]["args"].get("dataframe_name", None) is not None:
+            # log dataframe
+            self.run["dataset"].upload(
+                str(Path(cfg["datasets"]["args"]["dataset_root"]) / cfg["datasets"]["args"]["dataframe_name"])


you only checked you have dataframe_name, what if you don't have dataset_root? I think text datasets don't have root, but they have dataframe

the same for the changes above in this file

lets also check that dataset_root is presented

and let's not repeat ourselves and introduce a function:

def dataframe_from_cfg_if_presented()

and reuse it all places

AlekseySh · 2024-07-21T18:23:01Z

oml/lightning/pipelines/parser.py

@@ -104,7 +104,32 @@ def parse_ckpt_callback_from_config(cfg: TCfg) -> ModelCheckpoint:
    )


+def convert_to_new_format_if_needed(


discussed offline: we don't support back compatibility for re-ranking pipeline

AlekseySh · 2024-07-21T19:31:51Z

oml/lightning/pipelines/predict.py

@@ -30,12 +30,12 @@ def extractor_prediction_pipeline(cfg: TCfg) -> None:

    pprint(cfg)

-    transforms = get_transforms_by_cfg(cfg["transforms_predict"])
-    filenames = [list(Path(cfg["data_dir"]).glob(f"**/*.{ext}")) for ext in IMAGE_EXTENSIONS]
+    transforms = get_transforms_by_cfg(cfg["datasets"]["args"]["transforms_predict"])


something weird happened here, I think you should not touch this pipeline at the moment

AlekseySh · 2024-07-22T11:58:37Z

based on offline discussion:

------------------------
PREDICT.YAML

precision: 32
accelerator: gpu
devices: 1

dataset:
  name: BaseImgDataset
  im_paths: ...  # or im_dir
  transforms_predict:
    name: norm_resize_albu
    args:
      im_size: 224

save_dir: "."

bs: 64
num_workers: 10

extractor:
  name: vit
  args:
    arch: vits16
    normalise_features: False
    use_multi_scale: False
    weights: vits16_cars

hydra:
  run:
    dir: ${save_dir}
  searchpath:
   - pkg://oml.configs
  job:
    chdir: True

-------------------
VALIDATE.YAML

accelerator: gpu
devices: 1
precision: 32


bs_val: 256
num_workers: 8

val_dataset:
  name: image_dataset
  dataframe_name: df_with_bboxes.csv  # df/path_to_df
  args:
    dataset_root: data/CARS196/
    transforms_val:
      name: norm_resize_albu
      args:
        im_size: 224

extractor:
  name: vit
  args:
    arch: vits16
    normalise_features: False
    use_multi_scale: False
    weights: vits16_cars

metric_args:
  metrics_to_exclude_from_visualization: [cmc,]
  cmc_top_k: [1, 5]
  map_top_k: [5]
  precision_top_k: [5]
  fmr_vals: [0.01]
  pcf_variance: [0.5, 0.9, 0.99]
  return_only_overall_category: False
  visualize_only_overall_category: True

hydra:
  searchpath:
   - pkg://oml.configs
  job:
    chdir: True

-----------------------------
REGISTRY.PY

REGISTRY_DATASETS = {
  "image_dataset": ImageQGLDataset
}

def get_dataset_by_cfg(cfg,  split_val=None):
  if split_val and dataframe_name in cfg:
    df = pd.read_csv(cfg["dataframe_name"])
    df = df[df.split == split_val]

  return REGISTRY_DATASETS["image_dataset"](df_path/df)


-------------------------------
PIPELINES.PY


train_dataset = get_dataset_by_cfg(cfg["train_dataset"], split_val="train")
val_dataset = get_dataset_by_cfg(cfg["valid_dataset"], split_val="validate")

leoromanovich · 2024-07-28T14:54:55Z

Looks like we can't avoid a small builder, because of transforms initialisation and mapping.
from this

REGISTRY_DATASETS = {
  "image_dataset": ImageQGLDataset
}

def get_dataset_by_cfg(cfg,  split_val=None):
  if split_val and dataframe_name in cfg:
    df = pd.read_csv(cfg["dataframe_name"])
    df = df[df.split == split_val]

  return REGISTRY_DATASETS["image_dataset"](df_path/df)

to something like that:

REGISTRY_DATASETS = {
  "image_qg_dataset": qg_builder,
}

def qg_builder(args):
    transforms = ...
    dataset = QGDataset(....)
    return dataset
    

def get_dataset_by_cfg(cfg,  split_val=None):
  if split_val and dataframe_name in cfg:
    df = pd.read_csv(cfg["dataframe_name"])
    mapper = {...}
    df = df[df.split == split_val]
    df = df[].map(mapper)
    cfg['df'] = df  # because not all datasets will have a dataframe, we pack it inside cfg.
  return REGISTRY_DATASETS["image_qg_dataset"](df_path/df)

upd: offline decided to get transforms before dataset initialisation

simple example (tmp, just approach testing)

8309c5a

leoromanovich mentioned this pull request Jun 9, 2024

Allow changing Dataset (class) in Pipelines #528

Open

AlekseySh added the new feature label Jun 9, 2024

AlekseySh linked an issue Jun 9, 2024 that may be closed by this pull request

Allow changing Dataset (class) in Pipelines #528

Open

AlekseySh assigned leoromanovich Jun 9, 2024

leoromanovich added 7 commits June 12, 2024 20:05

add dataset registry with image builder

eca39c6

test for registry passed add dataset registry config to configs.

fix pre-commit

118f36a

forgot to run previously

update logging with new config format

cebaf77

change mock_dataset to image_mock_dataset

c77adff

add build image dataset and parser to make new config from old

a9f9bf4

update pipelines, configs and test runs

3d94785

update to new image_mock_dataset signature

4f013eb

AlekseySh reviewed Jun 17, 2024

View reviewed changes

oml/registry/datasets.py Outdated Show resolved Hide resolved

AlekseySh requested changes Jun 17, 2024

View reviewed changes

update configs and add changes to py scripts

326a658

leoromanovich added 2 commits June 17, 2024 21:27

add existence checks for logging

d14e406

update postprocessor

c3f627c

leoromanovich added 5 commits June 19, 2024 10:33

turn back to old format two configs

71ec834

fix test for convertor update docs

Merge branch main into add_dataset_to_config_api

014bf17

add postprocessor config to configs examples

5b3e9f1

merge main into branch

b8d222d

add reranking builder

1e0363e

AlekseySh reviewed Jul 21, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Add datasets to config api (pipelines) #585

[WIP] Add datasets to config api (pipelines) #585

leoromanovich commented Jun 9, 2024 •

edited

Loading

leoromanovich commented Jun 16, 2024

AlekseySh left a comment

AlekseySh commented Jun 17, 2024

leoromanovich commented Jun 17, 2024

leoromanovich commented Jul 18, 2024 •

edited

Loading

AlekseySh Jul 21, 2024

AlekseySh Jul 21, 2024

AlekseySh Jul 21, 2024

AlekseySh Jul 21, 2024

AlekseySh Jul 21, 2024

AlekseySh Jul 21, 2024

AlekseySh Jul 21, 2024

AlekseySh Jul 21, 2024

AlekseySh Jul 21, 2024

AlekseySh commented Jul 22, 2024

leoromanovich commented Jul 28, 2024 •

edited

Loading

		@@ -104,7 +104,32 @@ def parse_ckpt_callback_from_config(cfg: TCfg) -> ModelCheckpoint:
		)


		def convert_to_new_format_if_needed(

[WIP] Add datasets to config api (pipelines) #585

Are you sure you want to change the base?

[WIP] Add datasets to config api (pipelines) #585

Conversation

leoromanovich commented Jun 9, 2024 • edited Loading

leoromanovich commented Jun 16, 2024

AlekseySh left a comment

Choose a reason for hiding this comment

AlekseySh commented Jun 17, 2024

leoromanovich commented Jun 17, 2024

leoromanovich commented Jul 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AlekseySh commented Jul 22, 2024

leoromanovich commented Jul 28, 2024 • edited Loading

leoromanovich commented Jun 9, 2024 •

edited

Loading

leoromanovich commented Jul 18, 2024 •

edited

Loading

leoromanovich commented Jul 28, 2024 •

edited

Loading