-
Notifications
You must be signed in to change notification settings - Fork 64
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow changing Dataset (class) in Pipelines #528
Comments
I've started work. WIP PR will be attached shortly. |
@leoromanovich great, waiting for it |
Start work here: #585 |
Let's start with the first PR, where we don't add texts support, but refactor the way of processing images datasets. Registry DATASETS_BUILDER_REGISTRY = {"oml_img_datasets": build_img_dataset, "oml_txt_datasets": build_txt_dataset}
def build_img_dataset(cfg) -> (IQGLD | ILD):
df = pd.read_csv(cfg["df_path"])
df = enumerate(df)
df_train, df_val = df.split(by='split')
dataset_train = ImageLD(df_train)
dataset_val = ImageQGLD(df_val)
# or just reuse get_retrieval_datasets
return dataset_train, dataset_val
def build_txt_dataset(cfg) -> (IQGLD | ILD):
pass
...
Config.yaml dataset_builder: oml_img_datasets
args:
df: df_full.csv
cache_size: 100
transforms_train:
name: hypvit_resize
args:
im_Size: 224
trainsforms_val:
name: hypvit_resize
args:
im_Size: 224
Back compatibility def convert_to_oml_three_format_if_needed(cfg):
if "dataset_root" and "transforms_train" and ... in cfg:
cfg["dataset_train"] = {"name": "image_label_dataset", args: {"df": ..., "transform": }}
# don't forget to delete refactored keys
....
def extractor_training_pipeline():
cfg = dictconfig_to_dict(cfg)
cfg = convert_to_new_format_if_needed(cfg)
dataset_train, dataset_val = get_datasets_builder(cfg)
assert dataset_train is ILD and dataset_val is IQGLD
assert check_consistency(dataset_train, dataset_val) Update mock dataset and pipelines test @hydra.main(config_path="configs", config_name="train_postprocessor.yaml", version_base=HYDRA_BEHAVIOUR)
def main_hydra(cfg: DictConfig) -> None:
cfg = dictconfig_to_dict(cfg)
download_mock_dataset(MOCK_DATASET_PATH)
cfg["dataset_builder"]["dataset_root"] = str(MOCK_DATASET_PATH)
extractor_training_pipeline(cfg)
if __name__ == "__main__":
main_hydra() TESTS PIPELINES
|
Let's allow changing Dataset in Pipelines.
It also assumes we need registry for Datasets.
Note, let's keep back compatibility with the previous format. We can have a condition which checks if the dataset is in the old format. For example, if one of the old keys is presented (like
dataframe_name
ordataset_root
). If you see such keys, first, you need to reorgonize yaml/dict, second, you can process it by an updated parser.The text was updated successfully, but these errors were encountered: