Define a set of metadata specifications for v0.6 #4

zhenghuiwang · 2019-04-08T00:51:58Z

These metadata specifications should

cover metadata needed for v0.6 CUJ, including model, hyperparameters, metrics etc.
show how customized metadata can be defined in a similar way.

These metadata specifications are preloaded during server start, while customized ones are registered via endpoints.

zhenghuiwang · 2019-04-08T18:49:57Z

/cc @neuromage @jlewi @mvartakAtVerta @terrytangyuan

zhenghuiwang · 2019-04-09T01:18:43Z

Some thoughts on what format/tool should be used to define the spec.

We need to handle customized types during runtime. Protobuf is not good at this case, because it needs to compile and generate code. JSON schema seems to be a good fit:

Schema can be referenced and extended.
Many libraries are available to compile schema and validate JSON during runtime.

Following this JSON schema guideline on defining complex schema, it would be good if we can extract common fields and reuse them. It is also important to categorize metadata by semantic meaning.

Here is my attempt to define a type hierarchy of metadata of ML workflows. A child is a subtype of its parent. For example, we have a generic entity (schema draft). artifact(schema draft) is a subtype of entity; data_set(schema draft) is a subtype of artifact.

entity
├── artifact
│       ├── data_set
│       └── model
│
├── execution
│
└── container
         ├── workspace
         └── katib_experiment

entity is a piece of metadata to represent an entity in Kubeflow. There are three categories of entities:
artifact represents input data or derived data in ML workflow, such as data set and model.
executable represents an executable in ML workflow, such as training code and data transformation.
container represents a group of artifacts, executables and other containers. E.g. workspace and experiments.

terrytangyuan · 2019-04-09T07:06:07Z

+1 for using JSON schema. Two questions:

Should we use “data_set” or “dataset” (personally prefer “dataset”)?
Could you clarify what kind of data transformation is executable? In some scenarios, data transformations may be part of the model itself, e.g. a TensorFlow graph containing both preprocessing and the trained model. There are also other types of model artifacts such as PMML and PFA where people might also store preprocessing logic within the model.

zhenghuiwang · 2019-04-10T05:28:49Z

Should we use “data_set” or “dataset” (personally prefer “dataset”)?

I don't have any preference. Based on wiki and google trend, it seems "data set" is more widely used than "dataset".

Could you clarify what kind of data transformation is executable?

I meant to use the data transformation from raw data to training data as an example.

I made a mistake in the type hierarchy: An executable should be defined as an artifact that has input and output. So executable should be a subtype of artifact. I think your example is really about whether a model should be viewed as an executable or artifact, which are two different views: one can view a model as an artifact and a program/library makes predictions based model + input; one can also view a model as an executable which can make predictions itself.

What do you think?

terrytangyuan · 2019-04-10T07:30:35Z

Sounds good then. That makes more sense. Yes it depends on which way we are looking at. It should be fine as long as we clarify our definition of model. Users can also define their own customized metadata if needed.

neuromage · 2019-04-10T15:25:01Z

I made a mistake in the type hierarchy: An executable should be defined as an artifact that has input and output. So executable should be a subtype of artifact.

I think these should be two separate concepts. An artifact represents anything produced by an ML pipeline or run. An executable represents the pipeline/run itself.

zhenghuiwang · 2019-04-10T19:42:06Z

Absolutely, these two should be treated as separate concepts.

I don't mean to go into taxonomy, but point out that model is viewed as derived data in training phase but can be viewed as executable in serving phase.

jlewi · 2019-06-10T18:48:17Z

@neuromage @zhenghuiwang How is this coming? What is the remaining work for 0.6?

zhenghuiwang · 2019-06-10T19:12:28Z

The metadata definition for v0.6 has been done in #17. They are preloaded in the backend service during startup and exposed in the python SDK.

We can add more fields into these definitions for future use cases.

zhenghuiwang mentioned this issue Apr 8, 2019

Define metadata service API #5

Closed

zhenghuiwang mentioned this issue Apr 13, 2019

Use JSON schema to define metadata. #10

Merged

zhenghuiwang mentioned this issue Apr 29, 2019

Create Metadata spec for model, metrics, workspace and etc #17

Closed

jlewi added the priority/p1 label May 13, 2019

zhenghuiwang closed this as completed Jun 10, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Define a set of metadata specifications for v0.6 #4

Define a set of metadata specifications for v0.6 #4

zhenghuiwang commented Apr 8, 2019

zhenghuiwang commented Apr 8, 2019

zhenghuiwang commented Apr 9, 2019 •

edited

Loading

terrytangyuan commented Apr 9, 2019

zhenghuiwang commented Apr 10, 2019

terrytangyuan commented Apr 10, 2019

neuromage commented Apr 10, 2019

zhenghuiwang commented Apr 10, 2019

jlewi commented Jun 10, 2019

zhenghuiwang commented Jun 10, 2019

Define a set of metadata specifications for v0.6 #4

Define a set of metadata specifications for v0.6 #4

Comments

zhenghuiwang commented Apr 8, 2019

zhenghuiwang commented Apr 8, 2019

zhenghuiwang commented Apr 9, 2019 • edited Loading

terrytangyuan commented Apr 9, 2019

zhenghuiwang commented Apr 10, 2019

terrytangyuan commented Apr 10, 2019

neuromage commented Apr 10, 2019

zhenghuiwang commented Apr 10, 2019

jlewi commented Jun 10, 2019

zhenghuiwang commented Jun 10, 2019

zhenghuiwang commented Apr 9, 2019 •

edited

Loading