Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kedro versioning system throws (seemingly) randomly kedro.io.core.DatasetError to some versioned datasets #3628

Closed
EloyID opened this issue Feb 19, 2024 · 2 comments

Comments

@EloyID
Copy link

EloyID commented Feb 19, 2024

Description

I have encountered for some versioned datasets that Kedro throws an error kedro.io.core.DatasetError: Cannot save versioned dataset, even if there is no not-versioned dataset with the same name in the expected path. It actually creates the folder where to save the versioned dataset. In the image you can see the created folder that causes the error and a similar versioned dataset

image

Context

This error prevents me from being able to save some versioned datasets.

Steps to Reproduce

# the error causing one

X_train_batched_energy_pca_as_target_dataset_preprocessed:
  type: pickle.PickleDataset
  filepath: data/05_model_input/X_train_batched_energy_pca_as_target_dataset_preprocessed.pkl
  versioned: true
  metadata:
    kedro-viz:
      layer: train_data

# one correctly working

X_train_merged_input_data_preprocessed:
  type: pickle.PickleDataset
  filepath: data/05_model_input/X_train_merged_input_data_preprocessed.pkl
  versioned: true
  metadata:
    kedro-viz:
      layer: train_data

Expected Result

Not raising the error and creating the dataset

Actual Result

The containing folder is created but it raises and error instead of creating the dataset

                   INFO     Saving data to                                                       data_catalog.py:525
                             X_train_batched_energy_pca_as_target_dataset_preprocessed
                             (PickleDataset)...

Traceback (most recent call last):
  File "C:\Path\to\my\windows\python\lib\site-packages\kedro\io\core.py", line 614, in save
    super().save(data)
  File "C:\Path\to\my\windows\python\lib\site-packages\kedro\io\core.py", line 214, in save
    self._save(data)
  File "C:\Path\to\my\windows\python\lib\site-packages\kedro_datasets\pickle\pickle_dataset.py", line 225, in _save
    with self._fs.open(save_path, **self._fs_open_args_save) as fs_file:
  File "C:\Path\to\my\windows\python\lib\site-packages\fsspec\spec.py", line 1295, in open
    f = self._open(
  File "C:\Path\to\my\windows\python\lib\site-packages\fsspec\implementations\local.py", line 180, in _open
    return LocalFileOpener(path, mode, fs=self, **kwargs)
  File "C:\Path\to\my\windows\python\lib\site-packages\fsspec\implementations\local.py", line 302, in __init__
    self._open()
  File "C:\Path\to\my\windows\python\lib\site-packages\fsspec\implementations\local.py", line 307, in _open
    self.f = open(self.path, mode=self.mode)
FileNotFoundError: [Errno 2] No such file or directory: 'C:/Path/to/my/kedroproject/energy-market-forecast/data/05_model_input/X_train_batched_energy_pca_as_target_dataset_preprocessed.pkl/2024-02-19T08.33.20.180Z/X_train_batched_energy_pca_as_target_dataset_preprocessed.pkl'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "C:\Path\to\my\windows\python\lib\site-packages\kedro\runner\sequential_runner.py", line 75, in _run
    run_node(node, catalog, hook_manager, self._is_async, session_id)
  File "C:\Path\to\my\windows\python\lib\site-packages\kedro\runner\runner.py", line 331, in run_node
    node = _run_node_sequential(node, catalog, hook_manager, session_id)
  File "C:\Path\to\my\windows\python\lib\site-packages\kedro\runner\runner.py", line 444, in _run_node_sequential
    catalog.save(name, data)
  File "C:\Path\to\my\windows\python\lib\site-packages\kedro\io\data_catalog.py", line 532, in save
    dataset.save(data)
  File "C:\Path\to\my\windows\python\lib\site-packages\kedro\io\core.py", line 618, in save
    raise DatasetError(
kedro.io.core.DatasetError: Cannot save versioned dataset 'X_train_batched_energy_pca_as_target_dataset_preprocessed.pkl' to 'C:/Path/to/my/kedroproject/energy-market-forecast/data/05_model_input' because a file with the same name already exists in the directory. This is likely because versioning was enabled on a dataset already saved previously. Either remove 'X_train_batched_energy_pca_as_target_dataset_preprocessed.pkl' from the directory or manually convert it into a versioned dataset by placing it in a versioned directory (e.g. with default versioning format 'C:/Path/to/my/kedroproject/energy-market-forecast/data/05_model_input/X_train_batched_energy_pca_as_target_dataset_preprocessed.pkl/YYYY-MM-DDThh.mm.ss.sssZ/X_train_batched_energy_pca_as_target_dataset_preprocessed.pkl').

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "C:\Path\to\my\windows\python\lib\runpy.py", line 196, in _run_module_as_main
    return _run_code(code, main_globals, None,
  File "C:\Path\to\my\windows\python\lib\runpy.py", line 86, in _run_code       
    exec(code, run_globals)
  File "C:\Path\to\my\windows\python\Scripts\kedro.exe\__main__.py", line 7, in <module>
  File "C:\Path\to\my\windows\python\lib\site-packages\kedro\framework\cli\cli.py", line 198, in main
    cli_collection()
  File "C:\Path\to\my\windows\python\lib\site-packages\click\core.py", line 1157, in __call__
    return self.main(*args, **kwargs)
  File "C:\Path\to\my\windows\python\lib\site-packages\kedro\framework\cli\cli.py", line 127, in main
    super().main(
  File "C:\Path\to\my\windows\python\lib\site-packages\click\core.py", line 1078, in main
    rv = self.invoke(ctx)
  File "C:\Path\to\my\windows\python\lib\site-packages\click\core.py", line 1688, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "C:\Path\to\my\windows\python\lib\site-packages\click\core.py", line 1434, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "C:\Path\to\my\windows\python\lib\site-packages\click\core.py", line 783, in invoke
    return __callback(*args, **kwargs)
  File "C:\Path\to\my\windows\python\lib\site-packages\kedro\framework\cli\project.py", line 225, in run
    session.run(
  File "C:\Path\to\my\windows\python\lib\site-packages\kedro\framework\session\session.py", line 392, in run
    run_result = runner.run(
  File "C:\Path\to\my\windows\python\lib\site-packages\kedro\runner\runner.py", line 117, in run
    self._run(pipeline, catalog, hook_or_null_manager, session_id)  # type: ignore[arg-type]
  File "C:\Path\to\my\windows\python\lib\site-packages\kedro\runner\sequential_runner.py", line 78, in _run
    self._suggest_resume_scenario(pipeline, done_nodes, catalog)
  File "C:\Path\to\my\windows\python\lib\site-packages\kedro\runner\runner.py", line 206, in _suggest_resume_scenario
    start_p_persistent_ancestors = _find_persistent_ancestors(
  File "C:\Path\to\my\windows\python\lib\site-packages\kedro\runner\runner.py", line 249, in _find_persistent_ancestors
    if _has_persistent_inputs(current_node, catalog):
  File "C:\Path\to\my\windows\python\lib\site-packages\kedro\runner\runner.py", line 290, in _has_persistent_inputs
    if isinstance(catalog._datasets[node_input], MemoryDataset):
KeyError: 'pca_target_regression.trained_pca_target_regression'

Your Environment

  • Kedro version used (pip show kedro or kedro -V): kedro, version 0.19.2
  • Python version used (python -V): Python 3.10.13
  • Operating system and version: Microsoft Windows [Version 10.0.22621.1928]

Thank you for your help and your work, I really like Kedro!

@EloyID
Copy link
Author

EloyID commented Feb 20, 2024

I am not 100% sure, but it is related to the length of the filename since it works when changing to short names but fails equally with random long names. Maybe the thrown error should be more explicit on this subject.

@noklam
Copy link
Contributor

noklam commented Mar 5, 2024

https://stackoverflow.com/questions/62606023/filenotfounderror-on-long-pathname-in-python-in-windows

Closing this as this is a Window issue and Kedro cannot detect anything from the FileNotFound error.

@noklam noklam closed this as completed Mar 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants