Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Please rework the pipeline interactions with azureml.data.OutputFileDatasetConfig #23565

Closed
weishengtoh opened this issue Mar 17, 2022 · 7 comments
Assignees
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. ML-Pipelines AreaPath needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Attention Workflow: This issue is responsible by Azure service team.

Comments

@weishengtoh
Copy link

Problem Description

Azure ML Python SDK documentation has provided numerous options to pass data between training pipelines, but currently the recommended option appears to be azureml.data.OutputFileDatasetConfig.

However, azureml.data.OutputFileDatasetConfig has a limitation that it cannot be accepted as a valid input for the inputs parameter for all the classes in azureml.pipeline.steps - e.g. PythonScriptStep and HyperDriveStep.

To define the OutputFileDatasetConfig as an input of a pipeline step, the function as_input() has to be called on the object, and the function is not called if the OutputFileDatasetConfig is used as an output of a pipeline step.

This is extremely convoluted, as it clearly suggests that the OutputFileDatasetConfig was originally designed only as an output to a pipeline step.

Proposed solution

  1. The name of the class should be changed - OutputFileDatasetConfig suggests that it is meant only as an output, and it is some kind of a config file to be used by internal classes (which it clearly is not). If the intention is to use it also as the input to downstream pipeline steps then the name should reflect that.
  2. Allow this class to be used in the inputs parameter for all classes in azureml.pipeline.steps. The azureml.pipeline.core.PipelineData class allows the user to specify it as both the input and output of a pipeline step. However, it is not the recommended approach. PipelineData is also a much better name for a class that transfer data between pipeline steps.
  3. Alternatively to point 2. above, please remove the inputs and outputs parameters for all classes in azureml.pipeline.steps and enforce that inputs be declared with as_input() and outputs as as_output().
@ghost ghost added needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. customer-reported Issues that are reported by GitHub users external to the Azure organization. question The issue doesn't require a change to the product in order to be resolved. Most issues start as that labels Mar 17, 2022
@azure-sdk azure-sdk added Client This issue points to a problem in the data-plane of the library. Data Factory needs-team-triage Workflow: This issue needs the team to triage. labels Mar 17, 2022
@ghost ghost removed the needs-triage Workflow: This is a new issue that needs to be triaged to the appropriate team. label Mar 17, 2022
@kashifkhan kashifkhan added ML-Pipelines AreaPath CXP Attention and removed needs-team-triage Workflow: This issue needs the team to triage. labels Mar 21, 2022
@ghost ghost added the needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team label Mar 21, 2022
@navba-MSFT navba-MSFT added Service Attention Workflow: This issue is responsible by Azure service team. and removed CXP Attention labels Mar 22, 2022
@ghost
Copy link

ghost commented Mar 22, 2022

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @Jingshu923, @zhangyd2015, @Frey-Wang.

Issue Details

Problem Description

Azure ML Python SDK documentation has provided numerous options to pass data between training pipelines, but currently the recommended option appears to be azureml.data.OutputFileDatasetConfig.

However, azureml.data.OutputFileDatasetConfig has a limitation that it cannot be accepted as a valid input for the inputs parameter for all the classes in azureml.pipeline.steps - e.g. PythonScriptStep and HyperDriveStep.

To define the OutputFileDatasetConfig as an input of a pipeline step, the function as_input() has to be called on the object, and the function is not called if the OutputFileDatasetConfig is used as an output of a pipeline step.

This is extremely convoluted, as it clearly suggests that the OutputFileDatasetConfig was originally designed only as an output to a pipeline step.

Proposed solution

  1. The name of the class should be changed - OutputFileDatasetConfig suggests that it is meant only as an output, and it is some kind of a config file to be used by internal classes (which it clearly is not). If the intention is to use it also as the input to downstream pipeline steps then the name should reflect that.
  2. Allow this class to be used in the inputs parameter for all classes in azureml.pipeline.steps. The azureml.pipeline.core.PipelineData class allows the user to specify it as both the input and output of a pipeline step. However, it is not the recommended approach. PipelineData is also a much better name for a class that transfer data between pipeline steps.
  3. Alternatively to point 2. above, please remove the inputs and outputs parameters for all classes in azureml.pipeline.steps and enforce that inputs be declared with as_input() and outputs as as_output().
Author: weishengtoh
Assignees: bandsina
Labels:

question, Data Factory, Service Attention, Client, customer-reported, needs-team-attention, ML-Pipelines

Milestone: -

@navba-MSFT
Copy link
Contributor

@weishengtoh Apologies for the late reply. Thanks for reaching out to us and sharing the feedback. Tagging the Service team to look into this ask.

@bandsina @Jingshu923 @zhangyd2015 @Frey-Wang Could you please look into this issue and provide an update once you get a chance ? Awaiting your reply.

@Frey-Wang
Copy link
Member

@navba-MSFT , this issue is about azureml, not data factory. Thanks!

@navba-MSFT navba-MSFT removed Data Factory Service Attention Workflow: This issue is responsible by Azure service team. labels Mar 23, 2022
@ghost ghost added the needs-team-triage Workflow: This issue needs the team to triage. label Mar 23, 2022
@navba-MSFT navba-MSFT added Service Attention Workflow: This issue is responsible by Azure service team. and removed needs-team-triage Workflow: This issue needs the team to triage. labels Mar 23, 2022
@ghost
Copy link

ghost commented Mar 23, 2022

Thanks for the feedback! We are routing this to the appropriate team for follow-up. cc @shbijlan.

Issue Details

Problem Description

Azure ML Python SDK documentation has provided numerous options to pass data between training pipelines, but currently the recommended option appears to be azureml.data.OutputFileDatasetConfig.

However, azureml.data.OutputFileDatasetConfig has a limitation that it cannot be accepted as a valid input for the inputs parameter for all the classes in azureml.pipeline.steps - e.g. PythonScriptStep and HyperDriveStep.

To define the OutputFileDatasetConfig as an input of a pipeline step, the function as_input() has to be called on the object, and the function is not called if the OutputFileDatasetConfig is used as an output of a pipeline step.

This is extremely convoluted, as it clearly suggests that the OutputFileDatasetConfig was originally designed only as an output to a pipeline step.

Proposed solution

  1. The name of the class should be changed - OutputFileDatasetConfig suggests that it is meant only as an output, and it is some kind of a config file to be used by internal classes (which it clearly is not). If the intention is to use it also as the input to downstream pipeline steps then the name should reflect that.
  2. Allow this class to be used in the inputs parameter for all classes in azureml.pipeline.steps. The azureml.pipeline.core.PipelineData class allows the user to specify it as both the input and output of a pipeline step. However, it is not the recommended approach. PipelineData is also a much better name for a class that transfer data between pipeline steps.
  3. Alternatively to point 2. above, please remove the inputs and outputs parameters for all classes in azureml.pipeline.steps and enforce that inputs be declared with as_input() and outputs as as_output().
Author: weishengtoh
Assignees: bandsina
Labels:

question, Service Attention, Client, customer-reported, needs-team-attention, ML-Pipelines

Milestone: -

@navba-MSFT
Copy link
Contributor

@shbijlan Could you please look into this issue and provide an update on this once you get a chance ? Awaiting your reply.

@likebupt
Copy link

@weishengtoh , Thanks for your suggestions and sorry for the inconvenience caused.
Indeed, the data flow between pipeline step are confusing sometimes, with so many concept.
We actually bring the new experience ML component, which is easier to define input and output, and to reuse among different projects. You can refer to the following document,
ML components
Create pipeline using components

We are also happy to set up a call to learn from your use case and see if components can better suit your scenario.
Please contact [email protected] if you are interested.

Thanks!

@weishengtoh
Copy link
Author

@likebupt Thanks!

The ML component does appear to be a cleaner approach thou I was initially hesitant to use it as it is still under "Preview".

Will try to reformat the codes to work with ML components and see how it goes.

Cheers :)

@github-actions github-actions bot locked and limited conversation to collaborators Apr 11, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
Client This issue points to a problem in the data-plane of the library. customer-reported Issues that are reported by GitHub users external to the Azure organization. ML-Pipelines AreaPath needs-team-attention Workflow: This issue needs attention from Azure service team or SDK team question The issue doesn't require a change to the product in order to be resolved. Most issues start as that Service Attention Workflow: This issue is responsible by Azure service team.
Projects
None yet
Development

No branches or pull requests

7 participants