Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Azureml v2-datasets local execution using fsspec #61

Closed
wants to merge 1 commit into from

Conversation

fdroessler
Copy link
Contributor

Hi all,

as mentioned by @tomasvanpottelbergh in kedro-org/kedro-plugins#60 here is a proposal on enabling local execution of kedro pipelines that have aml datasets in the catalog as input and intermediate datasets. In essence it converts all AML datasets that are not root input datasets to a pipeline run into Pickle datasets and saves them in a local-run folder. It currently depends on "activating" azureml-fsspec in kedro-datasets see this comment: kedro-org/kedro#4314 but other than that works for me in some local tests.

As mentioned in the other PR the guiding idea here is that during local execution there is only read and no write to AML datasets. This is to "guarantee" proper metadata flow and traceability. There would be an option to also upload from local runs but this would be more involved and should maybe be controlled via a CLI argument? Still probably prefer local executions not overwriting AML data.

This can easily be adopted to work with @tomasvanpottelbergh PR for AML folder datasets as well. Looking forward to some thoughts.

@fdroessler fdroessler changed the title Azureml v2-datasets local execution using fsspec [WIP] Azureml v2-datasets local execution using fsspec May 19, 2023
@tomasvanpottelbergh
Copy link
Contributor

Thanks for this @fdroessler! How do you activate or deactivate the hook?

I actually had a slightly different solution in mind:

  1. Implement the AzureMLFolderDataSet specifically for local loading
  2. In the AzurePipelinesRunner convert every AzureMLFolderDataSet to a AzureMLPipelineDataSet

I think that might be a bit easier do make the dataset work locally by default, since we fully control the AzurePipelinesRunner. What do you think?

@marrrcin
Copy link
Contributor

Yeah, we definitely need a way to conveniently enable/disable the hook.
Also: unit tests coverage.

FYI, I will be off until 2023/06/05, so please be patient with the PR 🙃

@fdroessler
Copy link
Contributor Author

Thanks for this @fdroessler! How do you activate or deactivate the hook?

Multiple ideas which are hacky so still thinking about this one.

I actually had a slightly different solution in mind:

  1. Implement the AzureMLFolderDataSet specifically for local loading

I agree with you here, this would be ideal and I think not too difficult. We can use the dataset also to resolve the issue of azureml-fsspec not yet being supported by creating the self.fs in the dataset according to the aml specifications. I have that implemented for AzureMLFolderDataSet in a branch forked from you.

For me the challenge, which I tried to solve with the hook, is that only at the time when we have the pipeline object, we know which datasets to load from AML and which are intermediate. Not sure if there is another way to make a Dataset "aware" if it's a root-input dataset for a particular pipeline run. But maybe there is another way of doing that. Any ideas?

  1. In the AzurePipelinesRunner convert every AzureMLFolderDataSet to a AzureMLPipelineDataSet

I think that might be a bit easier do make the dataset work locally by default, since we fully control the AzurePipelinesRunner. What do you think?

@tomasvanpottelbergh
Copy link
Contributor

Thanks for this @fdroessler! How do you activate or deactivate the hook?

Multiple ideas which are hacky so still thinking about this one.

Although not in the documentation, the hook should have access to the name of the runner (https://github.com/kedro-org/kedro/blob/main/kedro/framework/session/session.py#L409). So maybe this can be used to turn it off when using the AzurePipelinesRunner?

I actually had a slightly different solution in mind:

  1. Implement the AzureMLFolderDataSet specifically for local loading

I agree with you here, this would be ideal and I think not too difficult. We can use the dataset also to resolve the issue of azureml-fsspec not yet being supported by creating the self.fs in the dataset according to the aml specifications. I have that implemented for AzureMLFolderDataSet in a branch forked from you.

👍 It would also be great to use the underlying dataset definition instead of PickleDataSet when doing the replacement.

For me the challenge, which I tried to solve with the hook, is that only at the time when we have the pipeline object, we know which datasets to load from AML and which are intermediate. Not sure if there is another way to make a Dataset "aware" if it's a root-input dataset for a particular pipeline run. But maybe there is another way of doing that. Any ideas?

Good point, I didn't think too much about using the dataset for registering intermediate datasets. I can't immediately see any other robust solution, but as long as the hook can be automatically disabled on Azure ML, I think it's a nice solution!

@fdroessler
Copy link
Contributor Author

Will be included in kedro-org/kedro-plugins#60

@fdroessler fdroessler closed this Jun 7, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants