Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Add datasets to config api (pipelines) #585

Open
wants to merge 16 commits into
base: main
Choose a base branch
from

Conversation

leoromanovich
Copy link
Collaborator

@leoromanovich leoromanovich commented Jun 9, 2024

For now, we don't push you to follow any predefined schema of issue, but ensure you've already read our contribution guide: https://open-metric-learning.readthedocs.io/en/latest/from_readme/contributing.html.

  • Example of making custom dataset builder
  • Change docs
  • Remain some configs in old format (to test that nothing is breaking)

@leoromanovich
Copy link
Collaborator Author

@AlekseySh what do you think about implementation like that?

oml/registry/datasets.py Outdated Show resolved Hide resolved
Copy link
Contributor

@AlekseySh AlekseySh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thx for PR.

We also need to update the documentation here.

Also update all existing examples in pipelines

oml/lightning/pipelines/parser.py Outdated Show resolved Hide resolved
oml/lightning/pipelines/logging.py Outdated Show resolved Hide resolved
oml/lightning/pipelines/parser.py Outdated Show resolved Hide resolved
tests/test_runs/test_pipelines/configs/train.yaml Outdated Show resolved Hide resolved
tests/test_runs/test_pipelines/configs/validate.yaml Outdated Show resolved Hide resolved
oml/registry/datasets.py Outdated Show resolved Hide resolved
oml/registry/datasets.py Outdated Show resolved Hide resolved
tests/test_oml/test_registry/test_registry.py Outdated Show resolved Hide resolved
tests/test_oml/test_registry/test_registry.py Outdated Show resolved Hide resolved
@AlekseySh
Copy link
Contributor

@leoromanovich by the way, we also need to update postprocessing pipeline
seems like get_loaders_with_embeddings is already in a format which is very close to what we expect in builders registry, right?

@leoromanovich
Copy link
Collaborator Author

@leoromanovich by the way, we also need to update postprocessing pipeline seems like get_loaders_with_embeddings is already in a format which is very close to what we expect in builders registry, right?

Looks like) I've added some changes, tests passed.)

@leoromanovich
Copy link
Collaborator Author

leoromanovich commented Jul 18, 2024

@AlekseySh Check changes for reranking builder, please.
What I don't like about current solution:
Because of using feature extractor inside, we need to pass args for extractor (like precision, num_workers, bs_inference).
Not critical, but I believe we need to change options, used in different places upper in config, if we decide, that reranking dataset builder approach is good enough.

@@ -0,0 +1,29 @@
name: "oml_reranking_dataset"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"oml_reranking_dataset" assumes a way dataset is going to be used, which is not a dataset's business
let's name it oml_image_dataset_with_embeddings

@@ -0,0 +1,29 @@
name: "oml_reranking_dataset"
args:
# dataset_root: /path/to/dataset/ # use your dataset root here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's do as in the test_pipelines (check entrypoint and config)

args:
# dataset_root: /path/to/dataset/ # use your dataset root here
dataframe_name: df.csv
# embeddings_cache_dir: ${args.dataset_root} # CACHE EMBEDDINGS PRODUCED BY BASELINE FEATURE EXTRACTOR
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

let's use None, and why the comment is in UPPER CASE?

if cfg["datasets"]["args"].get("dataframe_name", None) is not None:
# log dataframe
self.run["dataset"].upload(
str(Path(cfg["datasets"]["args"]["dataset_root"]) / cfg["datasets"]["args"]["dataframe_name"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you only checked you have dataframe_name, what if you don't have dataset_root? I think text datasets don't have root, but they have dataframe

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the same for the changes above in this file

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets also check that dataset_root is presented

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and let's not repeat ourselves and introduce a function:

def dataframe_from_cfg_if_presented()

and reuse it all places

@@ -104,7 +104,32 @@ def parse_ckpt_callback_from_config(cfg: TCfg) -> ModelCheckpoint:
)


def convert_to_new_format_if_needed(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

discussed offline: we don't support back compatibility for re-ranking pipeline

@@ -30,12 +30,12 @@ def extractor_prediction_pipeline(cfg: TCfg) -> None:

pprint(cfg)

transforms = get_transforms_by_cfg(cfg["transforms_predict"])
filenames = [list(Path(cfg["data_dir"]).glob(f"**/*.{ext}")) for ext in IMAGE_EXTENSIONS]
transforms = get_transforms_by_cfg(cfg["datasets"]["args"]["transforms_predict"])
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

something weird happened here, I think you should not touch this pipeline at the moment

@AlekseySh
Copy link
Contributor

based on offline discussion:

------------------------
PREDICT.YAML

precision: 32
accelerator: gpu
devices: 1

dataset:
  name: BaseImgDataset
  im_paths: ...  # or im_dir
  transforms_predict:
    name: norm_resize_albu
    args:
      im_size: 224

save_dir: "."

bs: 64
num_workers: 10

extractor:
  name: vit
  args:
    arch: vits16
    normalise_features: False
    use_multi_scale: False
    weights: vits16_cars

hydra:
  run:
    dir: ${save_dir}
  searchpath:
   - pkg://oml.configs
  job:
    chdir: True

-------------------
VALIDATE.YAML

accelerator: gpu
devices: 1
precision: 32


bs_val: 256
num_workers: 8

val_dataset:
  name: image_dataset
  dataframe_name: df_with_bboxes.csv  # df/path_to_df
  args:
    dataset_root: data/CARS196/
    transforms_val:
      name: norm_resize_albu
      args:
        im_size: 224

extractor:
  name: vit
  args:
    arch: vits16
    normalise_features: False
    use_multi_scale: False
    weights: vits16_cars

metric_args:
  metrics_to_exclude_from_visualization: [cmc,]
  cmc_top_k: [1, 5]
  map_top_k: [5]
  precision_top_k: [5]
  fmr_vals: [0.01]
  pcf_variance: [0.5, 0.9, 0.99]
  return_only_overall_category: False
  visualize_only_overall_category: True

hydra:
  searchpath:
   - pkg://oml.configs
  job:
    chdir: True

-----------------------------
REGISTRY.PY

REGISTRY_DATASETS = {
  "image_dataset": ImageQGLDataset
}

def get_dataset_by_cfg(cfg,  split_val=None):
  if split_val and dataframe_name in cfg:
    df = pd.read_csv(cfg["dataframe_name"])
    df = df[df.split == split_val]

  return REGISTRY_DATASETS["image_dataset"](df_path/df)


-------------------------------
PIPELINES.PY


train_dataset = get_dataset_by_cfg(cfg["train_dataset"], split_val="train")
val_dataset = get_dataset_by_cfg(cfg["valid_dataset"], split_val="validate")

@leoromanovich
Copy link
Collaborator Author

leoromanovich commented Jul 28, 2024

Looks like we can't avoid a small builder, because of transforms initialisation and mapping.
from this

REGISTRY_DATASETS = {
  "image_dataset": ImageQGLDataset
}

def get_dataset_by_cfg(cfg,  split_val=None):
  if split_val and dataframe_name in cfg:
    df = pd.read_csv(cfg["dataframe_name"])
    df = df[df.split == split_val]

  return REGISTRY_DATASETS["image_dataset"](df_path/df)

to something like that:

REGISTRY_DATASETS = {
  "image_qg_dataset": qg_builder,
}

def qg_builder(args):
    transforms = ...
    dataset = QGDataset(....)
    return dataset
    

def get_dataset_by_cfg(cfg,  split_val=None):
  if split_val and dataframe_name in cfg:
    df = pd.read_csv(cfg["dataframe_name"])
    mapper = {...}
    df = df[df.split == split_val]
    df = df[].map(mapper)
    cfg['df'] = df  # because not all datasets will have a dataframe, we pack it inside cfg.
  return REGISTRY_DATASETS["image_qg_dataset"](df_path/df)

upd: offline decided to get transforms before dataset initialisation

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Review in progress
Development

Successfully merging this pull request may close these issues.

Allow changing Dataset (class) in Pipelines
3 participants