-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Add datasets to config api (pipelines) #585
base: main
Are you sure you want to change the base?
Conversation
test for registry passed add dataset registry config to configs.
forgot to run previously
@AlekseySh what do you think about implementation like that? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thx for PR.
We also need to update the documentation here.
Also update all existing examples in pipelines
@leoromanovich by the way, we also need to update postprocessing pipeline |
Looks like) I've added some changes, tests passed.) |
@AlekseySh Check changes for reranking builder, please. |
@@ -0,0 +1,29 @@ | |||
name: "oml_reranking_dataset" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"oml_reranking_dataset" assumes a way dataset is going to be used, which is not a dataset's business
let's name it oml_image_dataset_with_embeddings
@@ -0,0 +1,29 @@ | |||
name: "oml_reranking_dataset" | |||
args: | |||
# dataset_root: /path/to/dataset/ # use your dataset root here |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's do as in the test_pipelines
(check entrypoint and config)
args: | ||
# dataset_root: /path/to/dataset/ # use your dataset root here | ||
dataframe_name: df.csv | ||
# embeddings_cache_dir: ${args.dataset_root} # CACHE EMBEDDINGS PRODUCED BY BASELINE FEATURE EXTRACTOR |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's use None, and why the comment is in UPPER CASE?
if cfg["datasets"]["args"].get("dataframe_name", None) is not None: | ||
# log dataframe | ||
self.run["dataset"].upload( | ||
str(Path(cfg["datasets"]["args"]["dataset_root"]) / cfg["datasets"]["args"]["dataframe_name"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you only checked you have dataframe_name
, what if you don't have dataset_root
? I think text datasets don't have root, but they have dataframe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the same for the changes above in this file
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lets also check that dataset_root is presented
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
and let's not repeat ourselves and introduce a function:
def dataframe_from_cfg_if_presented()
and reuse it all places
@@ -104,7 +104,32 @@ def parse_ckpt_callback_from_config(cfg: TCfg) -> ModelCheckpoint: | |||
) | |||
|
|||
|
|||
def convert_to_new_format_if_needed( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
discussed offline: we don't support back compatibility for re-ranking pipeline
@@ -30,12 +30,12 @@ def extractor_prediction_pipeline(cfg: TCfg) -> None: | |||
|
|||
pprint(cfg) | |||
|
|||
transforms = get_transforms_by_cfg(cfg["transforms_predict"]) | |||
filenames = [list(Path(cfg["data_dir"]).glob(f"**/*.{ext}")) for ext in IMAGE_EXTENSIONS] | |||
transforms = get_transforms_by_cfg(cfg["datasets"]["args"]["transforms_predict"]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
something weird happened here, I think you should not touch this pipeline at the moment
based on offline discussion:
|
Looks like we can't avoid a small builder, because of transforms initialisation and mapping. REGISTRY_DATASETS = {
"image_dataset": ImageQGLDataset
}
def get_dataset_by_cfg(cfg, split_val=None):
if split_val and dataframe_name in cfg:
df = pd.read_csv(cfg["dataframe_name"])
df = df[df.split == split_val]
return REGISTRY_DATASETS["image_dataset"](df_path/df) to something like that: REGISTRY_DATASETS = {
"image_qg_dataset": qg_builder,
}
def qg_builder(args):
transforms = ...
dataset = QGDataset(....)
return dataset
def get_dataset_by_cfg(cfg, split_val=None):
if split_val and dataframe_name in cfg:
df = pd.read_csv(cfg["dataframe_name"])
mapper = {...}
df = df[df.split == split_val]
df = df[].map(mapper)
cfg['df'] = df # because not all datasets will have a dataframe, we pack it inside cfg.
return REGISTRY_DATASETS["image_qg_dataset"](df_path/df) upd: offline decided to get transforms before dataset initialisation |
For now, we don't push you to follow any predefined schema of issue, but ensure you've already read our contribution guide: https://open-metric-learning.readthedocs.io/en/latest/from_readme/contributing.html.