[WIP] Azureml v2-datasets local execution using fsspec #61

fdroessler · 2023-05-19T09:06:19Z

Hi all,

as mentioned by @tomasvanpottelbergh in kedro-org/kedro-plugins#60 here is a proposal on enabling local execution of kedro pipelines that have aml datasets in the catalog as input and intermediate datasets. In essence it converts all AML datasets that are not root input datasets to a pipeline run into Pickle datasets and saves them in a local-run folder. It currently depends on "activating" azureml-fsspec in kedro-datasets see this comment: kedro-org/kedro#4314 but other than that works for me in some local tests.

As mentioned in the other PR the guiding idea here is that during local execution there is only read and no write to AML datasets. This is to "guarantee" proper metadata flow and traceability. There would be an option to also upload from local runs but this would be more involved and should maybe be controlled via a CLI argument? Still probably prefer local executions not overwriting AML data.

This can easily be adopted to work with @tomasvanpottelbergh PR for AML folder datasets as well. Looking forward to some thoughts.

tomasvanpottelbergh · 2023-05-19T15:24:32Z

Thanks for this @fdroessler! How do you activate or deactivate the hook?

I actually had a slightly different solution in mind:

Implement the AzureMLFolderDataSet specifically for local loading
In the AzurePipelinesRunner convert every AzureMLFolderDataSet to a AzureMLPipelineDataSet

I think that might be a bit easier do make the dataset work locally by default, since we fully control the AzurePipelinesRunner. What do you think?

marrrcin · 2023-05-20T18:11:00Z

Yeah, we definitely need a way to conveniently enable/disable the hook.
Also: unit tests coverage.

FYI, I will be off until 2023/06/05, so please be patient with the PR 🙃

fdroessler · 2023-05-21T09:26:20Z

Thanks for this @fdroessler! How do you activate or deactivate the hook?

Multiple ideas which are hacky so still thinking about this one.

I actually had a slightly different solution in mind:

Implement the AzureMLFolderDataSet specifically for local loading

I agree with you here, this would be ideal and I think not too difficult. We can use the dataset also to resolve the issue of azureml-fsspec not yet being supported by creating the self.fs in the dataset according to the aml specifications. I have that implemented for AzureMLFolderDataSet in a branch forked from you.

For me the challenge, which I tried to solve with the hook, is that only at the time when we have the pipeline object, we know which datasets to load from AML and which are intermediate. Not sure if there is another way to make a Dataset "aware" if it's a root-input dataset for a particular pipeline run. But maybe there is another way of doing that. Any ideas?

In the AzurePipelinesRunner convert every AzureMLFolderDataSet to a AzureMLPipelineDataSet

I think that might be a bit easier do make the dataset work locally by default, since we fully control the AzurePipelinesRunner. What do you think?

tomasvanpottelbergh · 2023-05-22T09:25:24Z

Thanks for this @fdroessler! How do you activate or deactivate the hook?

Multiple ideas which are hacky so still thinking about this one.

Although not in the documentation, the hook should have access to the name of the runner (https://github.com/kedro-org/kedro/blob/main/kedro/framework/session/session.py#L409). So maybe this can be used to turn it off when using the AzurePipelinesRunner?

I actually had a slightly different solution in mind:

Implement the AzureMLFolderDataSet specifically for local loading

I agree with you here, this would be ideal and I think not too difficult. We can use the dataset also to resolve the issue of azureml-fsspec not yet being supported by creating the self.fs in the dataset according to the aml specifications. I have that implemented for AzureMLFolderDataSet in a branch forked from you.

👍 It would also be great to use the underlying dataset definition instead of PickleDataSet when doing the replacement.

For me the challenge, which I tried to solve with the hook, is that only at the time when we have the pipeline object, we know which datasets to load from AML and which are intermediate. Not sure if there is another way to make a Dataset "aware" if it's a root-input dataset for a particular pipeline run. But maybe there is another way of doing that. Any ideas?

Good point, I didn't think too much about using the dataset for registering intermediate datasets. I can't immediately see any other robust solution, but as long as the hook can be automatically disabled on Azure ML, I think it's a nice solution!

fdroessler · 2023-06-07T11:44:07Z

Will be included in kedro-org/kedro-plugins#60

hook implementation

b07768d

fdroessler changed the title ~~Azureml v2-datasets local execution using fsspec~~ [WIP] Azureml v2-datasets local execution using fsspec May 19, 2023

fdroessler closed this Jun 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP] Azureml v2-datasets local execution using fsspec #61

[WIP] Azureml v2-datasets local execution using fsspec #61

fdroessler commented May 19, 2023

tomasvanpottelbergh commented May 19, 2023

marrrcin commented May 20, 2023

fdroessler commented May 21, 2023

tomasvanpottelbergh commented May 22, 2023

fdroessler commented Jun 7, 2023

[WIP] Azureml v2-datasets local execution using fsspec #61

[WIP] Azureml v2-datasets local execution using fsspec #61

Conversation

fdroessler commented May 19, 2023

tomasvanpottelbergh commented May 19, 2023

marrrcin commented May 20, 2023

fdroessler commented May 21, 2023

tomasvanpottelbergh commented May 22, 2023

fdroessler commented Jun 7, 2023