Node caching: avoid unnecessary node execution when inputs, outputs, or logic remain unchanged #4350
Labels
Community
Issue/PR opened by the open-source community
Issue: Feature Request
New feature or improvement to existing feature
Description
First of all, thank you for your efforts in developing Kedro.
I believe it would be highly beneficial if Kedro had a built-in node caching feature. By node caching, I mean a mechanism to avoid re-executing a node when its inputs, outputs, and logic remain unchanged.
Context
This feature is important to me because, in some scenarios, it is necessary to run the entire pipeline multiple times with different configurations. Re-executing nodes that remain unchanged between runs can significantly increase the time required for experiments.
For instance, when tracking pipeline parameters using MLFlow, we need to run the entire pipeline to record parameters for every node. This is because kedro-mlflow records parameters node by node.
Possible Implementation
There is already an existing plugin, kedro-cache, that implements similar functionality. The plugin is well-written and could work effectively with some adjustments. However, it is outdated and incompatible with the most recent Kedro releases. Moreover, there are compatibility issues with specific datasets, such as tracking.JSONDataset and tracking.MetricsDataset, which are write-only and cannot be loaded.
I believe that integrating node caching directly into Kedro's core design would help mitigate such compatibility issues and provide a more robust solution for users.
The text was updated successfully, but these errors were encountered: