Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

build_python_component does not automatically configure the file outputs such as mlpipeline-ui-metadata in the containerop #2317

Closed
gaoning777 opened this issue Oct 7, 2019 · 13 comments

Comments

@gaoning777
Copy link
Contributor

When the component python function contains the output such as mlpipeline-metrics.json or mlpipeline-ui-metadata.json, the generated containerop (by build_python_component) does not contain the corresponding file_output.

@Ark-kun
Copy link
Contributor

Ark-kun commented Oct 8, 2019

This does not seem to be a regression, since artifacts did not previously appear in ContainerOp.file_outputs.
However, this issue should be fixed as part of my efforts for improving data passing.
Thank you for reminding.
I wanted to clean up this part of ContainerOp, but forgot that it's critical for passing artifacts when using ContainerOp.

I'll fix this issue.

@gaoning777
Copy link
Contributor Author

Agreed that this is not regression but is a bug that we need to fix.
Thanks, Alexey

@Ark-kun
Copy link
Contributor

Ark-kun commented Oct 12, 2019

I discovered two problems:

  1. This is a breaking change that will break some of the existing samples: There are some component (especially GCP ones) that only have one output. So, in samples, we refer to that output as task.output. Adding mlpipeline-ui-metadata output makes the task.output stop working, because there are now multiple outputs.
  2. Artifact names. DSL-Compiler prepends the task name to every output name: task_name-output_name (probably it was easiest way for Bradley to ensure uniqueness of downstream input names which are based on upstream output names). Then the names in the compiled workflow would be my-task-3-mlpipeline-ui-metadata. And this will prevent the frontend and backend from recognizing those artifacts.

So, adding those two artifacts to file_outputs is not possible at this moment.

@Ark-kun Ark-kun changed the title build_python_component does not automatically configure the file outputs in the containerop The mlpipeline-ui-metadata output artifact is missing from ConatinerOp.outputs Oct 12, 2019
@Ark-kun
Copy link
Contributor

Ark-kun commented Oct 12, 2019

Changed the title since this is not limited/related to the components - if you try to do the same with the ContainerOp, you'll get the same result:

task = ContainerOp(
    file_outputs={
        'mlpipeline-ui-metadata': '/tmp/mlpipeline-ui-metadata.json',
    }
)
assert not task.outputs

@gaoning777
Copy link
Contributor Author

I discovered two problems:

  1. This is a breaking change that will break some of the existing samples: There are some component (especially GCP ones) that only have one output. So, in samples, we refer to that output as task.output. Adding mlpipeline-ui-metadata output makes the task.output stop working, because there are now multiple outputs.

It will not be a breaking change because the outputs are only injected if the mlpipeline-ui-metadata is written in the python codes. The current (python component build)samples, which do not output Metric/Visualization are not affected.

  1. Artifact names. DSL-Compiler prepends the task name to every output name: task_name-output_name (probably it was easiest way for Bradley to ensure uniqueness of downstream input names which are based on upstream output names). Then the names in the compiled workflow would be my-task-3-mlpipeline-ui-metadata. And this will prevent the frontend and backend from recognizing those artifacts.

AFAIK, the frontend/backend look for the metric/visualization based on the keys: https://www.kubeflow.org/docs/pipelines/sdk/output-viewer/

So, adding those two artifacts to file_outputs is not possible at this moment.

@gaoning777
Copy link
Contributor Author

Changed the title since this is not limited/related to the components - if you try to do the same with the ContainerOp, you'll get the same result:

task = ContainerOp(
    file_outputs={
        'mlpipeline-ui-metadata': '/tmp/mlpipeline-ui-metadata.json',
    }
)
assert not task.outputs

Are you sure? I tried on my machine and it worked fine.

@Ark-kun
Copy link
Contributor

Ark-kun commented Oct 14, 2019

The current (python component build)samples, which do not output Metric/Visualization are not affected.

The XGBoost sample and some other e2e sampels might be affected. Many of our 1st party components produce the mlpipeline-ui-metadata artifacts and some samples use task.output (singular).

AFAIK, the frontend/backend look for the metric/visualization based on the keys:
I'm not fully sure about that. In any case, the keys depend on the artifact name.

Are you sure? I tried on my machine and it worked fine.

It's expected. That snippet demonstrates that the mlpipeline-* artifacts are missing from task.outputs even if you just use ContainerOp (not build_python_component or func_to_container_op or load_component).

@Ark-kun
Copy link
Contributor

Ark-kun commented Oct 14, 2019

Let's take a step back and understand the high level CUJ here.

The current state of the issue is: The "mlpipeline-ui-metadata" and "mlpipeline-metrics" entries are never added to the task.outputs dictionary which is used to pass output references to dependent tasks.

What CUJs are affected by this issues? What scenarios require passing "mlpipeline-ui-metadata" to another step?

@gaoning777 gaoning777 changed the title The mlpipeline-ui-metadata output artifact is missing from ConatinerOp.outputs build_python_component does not automatically configure the file outputs such as mlpipeline-ui-metadata in the containerop Oct 15, 2019
@gaoning777
Copy link
Contributor Author

Let's take a step back and understand the high level CUJ here.

The current state of the issue is: The "mlpipeline-ui-metadata" and "mlpipeline-metrics" entries are never added to the task.outputs dictionary which is used to pass output references to dependent tasks.

What CUJs are affected by this issues? What scenarios require passing "mlpipeline-ui-metadata" to another step?

You might've misunderstood the issue. I was not referring to the task.outputs and I wasn't expecting the "mlpipeline-ui-metadata" and "mlpipeline-metrics" entries to be used by the downstream components.
The issue is when I author a python function and turn it into a component using the build_python_component(or the other functions if this one is deprecated), the output section in the argo yaml is not generated. The resulting effect is that the metric is displayed on the UI.

@Ark-kun
Copy link
Contributor

Ark-kun commented Oct 16, 2019

The issue is when I author a python function and turn it into a component using the build_python_component(or the other functions if this one is deprecated), the output section in the argo yaml is not generated. The resulting effect is that the metric is displayed on the UI.

Oh. This changes everything. I was confused by this part in the description "the output such as mlpipeline-metrics.json or mlpipeline-ui-metadata.json, the generated containerop (by build_python_component) does not contain the corresponding file_output.".

Does your component function have explicit outputs named 'mlpipeline_ui_metadata' and 'mlpipeline_metrics'? See the lightweight sample: https://github.com/kubeflow/pipelines/blob/master/samples/core/lightweight_component/lightweight_component.ipynb
All outputs need to be explicitly specified now. (See #1422 and #1250 (comment) )

The metrics and visualizations should then work.

@gaoning777
Copy link
Contributor Author

Make sense. Similar to the codes below, If I'm understanding it correctly.

from typing import NamedTuple
def func() -> NamedTuple('output', [('mlpipeline_ui_metadata', 'UI_metadata'), ('mlpipeline_metrics', 'Metrics')]):
    # Exports a sample tensorboard:
    metadata = {
      'outputs' : [{
        'type': 'tensorboard',
        'source': 'gs://ml-pipeline-dataset/tensorboard-train',
      }]
    }

    # Exports two sample metrics:
    metrics = {
      'metrics': [{
          'name': 'quotient',
          'numberValue':  float(quotient),
        },{
          'name': 'remainder',
          'numberValue':  float(remainder),
        }]}

    from collections import namedtuple
    divmod_output = namedtuple('output', ['mlpipeline_ui_metadata', 'mlpipeline_metrics'])
    return divmod_output(json.dumps(metadata), json.dumps(metrics))

@Ark-kun
Copy link
Contributor

Ark-kun commented Oct 19, 2019

Make sense. Similar to the codes below, If I'm understanding it correctly.

Yes. This code should work. Let me know if it does not.

I think I should put a notice about the mlpipeline_ui_metadata change in the README.md (e.g. Notable changes in v0.1.29). WDYT?

@Ark-kun
Copy link
Contributor

Ark-kun commented Feb 11, 2020

The issue is resolved.

@Ark-kun Ark-kun closed this as completed Feb 11, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants