Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Inconsistency when setting version via versioned flag and dataset parameter #4326

Closed
ElenaKhaustova opened this issue Nov 13, 2024 · 3 comments
Assignees
Labels
Component: Framework Issue/PR that addresses core framework functionality Component: IO Issue/PR addresses data loading/saving/versioning and validation, the DataCatalog and DataSets

Comments

@ElenaKhaustova
Copy link
Contributor

Description

Currently, we have several options to mark dataset as versioned.

Option 1 - set versioned: true via configuration

regressor:
  type: pickle.PickleDataset
  filepath: data/06_models/regressor.pickle
  versioned: true

Option 2 - pass version object to dataset constructor

version = Version(
    load="load_version.csv",  # load exact version
    save="save_version.csv",  # save to exact version
)

test_dataset = ExcelDataset(
    filepath="data/01_raw/shuttles.xlsx", load_args={"engine": "openpyxl"}, version=version
)

Out KedroDataCatalog.from_config method allow to pass load_versions and save_versions:

@classmethod
    def from_config(
        cls,
        catalog: dict[str, dict[str, Any]] | None,
        credentials: dict[str, dict[str, Any]] | None = None,
        load_versions: dict[str, str] | None = None,
        save_version: str | None = None,
    ) -> KedroDataCatalog:

However, the condition required to set version is versioned flag set to True:

if config.pop(VERSIONED_FLAG_KEY, False) or getattr(
, otherwise passed load and save versions are ignored.

So we have Option 3 to set the version via KedroDataCatalog.from_config and for that both versioned flag and load_versions/save_version should be set.

Context

  1. First of all it's very confusing since we have three different ways of setting version.
  2. load_versions/save_version parameters are ignored when creating catalog via KedroDataCatalog.from_config if versioned flag is not set.
  3. It's impossible to set versioned flag for dataset object and set load_versions/save_version via config.
  4. Some datasets are setting versioned flag when , but most - don't:
    if version:
  5. The above problem introduces corner cases that make it harder to implement the serialization/deserialization feature [DataCatalog]: Spike - Catalog serialization and deserialization support #3932 for the catalog.

Possible Implementation

  1. Consider removing versioned flag
  2. Allow setting a version based on load_versions or/and save_version provided

Possible Alternatives

Make only step two as a temporal solution without breaking change.

@ElenaKhaustova ElenaKhaustova added Component: Framework Issue/PR that addresses core framework functionality Component: IO Issue/PR addresses data loading/saving/versioning and validation, the DataCatalog and DataSets labels Nov 13, 2024
@ElenaKhaustova ElenaKhaustova added this to the Dataset Versioning milestone Nov 13, 2024
@ElenaKhaustova ElenaKhaustova moved this to In Progress in Kedro Framework Nov 14, 2024
@ElenaKhaustova ElenaKhaustova self-assigned this Nov 14, 2024
@ElenaKhaustova
Copy link
Contributor Author

We currently solved the problem for #4329 by adding logic to update VERSIONED_FLAG_KEY if version is provided.

@ElenaKhaustova
Copy link
Contributor Author

We keep this issue open until we decide whether we want to fix it within Dataset Versioning workstream.

@ElenaKhaustova
Copy link
Contributor Author

After the discussion with @idanov we decided not to proceed with this issue as #4329 was unblocked by adding extra logic to update VERSIONED_FLAG_KEY if version is provided.

@github-project-automation github-project-automation bot moved this from In Progress to Done in Kedro Framework Nov 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Framework Issue/PR that addresses core framework functionality Component: IO Issue/PR addresses data loading/saving/versioning and validation, the DataCatalog and DataSets
Projects
Status: Done
Development

No branches or pull requests

1 participant