Skip to content
This repository was archived by the owner on Dec 2, 2021. It is now read-only.

Define a set of metadata specifications for v0.6 #4

Closed
zhenghuiwang opened this issue Apr 8, 2019 · 9 comments
Closed

Define a set of metadata specifications for v0.6 #4

zhenghuiwang opened this issue Apr 8, 2019 · 9 comments

Comments

@zhenghuiwang
Copy link
Contributor

These metadata specifications should

  1. cover metadata needed for v0.6 CUJ, including model, hyperparameters, metrics etc.
  2. show how customized metadata can be defined in a similar way.

These metadata specifications are preloaded during server start, while customized ones are registered via endpoints.

@zhenghuiwang
Copy link
Contributor Author

@zhenghuiwang
Copy link
Contributor Author

zhenghuiwang commented Apr 9, 2019

Some thoughts on what format/tool should be used to define the spec.

We need to handle customized types during runtime. Protobuf is not good at this case, because it needs to compile and generate code. JSON schema seems to be a good fit:

  1. Schema can be referenced and extended.
  2. Many libraries are available to compile schema and validate JSON during runtime.

Following this JSON schema guideline on defining complex schema, it would be good if we can extract common fields and reuse them. It is also important to categorize metadata by semantic meaning.

Here is my attempt to define a type hierarchy of metadata of ML workflows. A child is a subtype of its parent. For example, we have a generic entity (schema draft). artifact(schema draft) is a subtype of entity; data_set(schema draft) is a subtype of artifact.

entity
├── artifact
│       ├── data_set
│       └── model
│
├── execution
│
└── container
         ├── workspace
         └── katib_experiment

entity is a piece of metadata to represent an entity in Kubeflow. There are three categories of entities:
artifact represents input data or derived data in ML workflow, such as data set and model.
executable represents an executable in ML workflow, such as training code and data transformation.
container represents a group of artifacts, executables and other containers. E.g. workspace and experiments.

@terrytangyuan
Copy link
Member

+1 for using JSON schema. Two questions:

  1. Should we use “data_set” or “dataset” (personally prefer “dataset”)?
  2. Could you clarify what kind of data transformation is executable? In some scenarios, data transformations may be part of the model itself, e.g. a TensorFlow graph containing both preprocessing and the trained model. There are also other types of model artifacts such as PMML and PFA where people might also store preprocessing logic within the model.

@zhenghuiwang
Copy link
Contributor Author

  1. Should we use “data_set” or “dataset” (personally prefer “dataset”)?

I don't have any preference. Based on wiki and google trend, it seems "data set" is more widely used than "dataset".

  1. Could you clarify what kind of data transformation is executable?

I meant to use the data transformation from raw data to training data as an example.

I made a mistake in the type hierarchy: An executable should be defined as an artifact that has input and output. So executable should be a subtype of artifact. I think your example is really about whether a model should be viewed as an executable or artifact, which are two different views: one can view a model as an artifact and a program/library makes predictions based model + input; one can also view a model as an executable which can make predictions itself.

What do you think?

@terrytangyuan
Copy link
Member

Sounds good then. That makes more sense. Yes it depends on which way we are looking at. It should be fine as long as we clarify our definition of model. Users can also define their own customized metadata if needed.

@neuromage
Copy link
Contributor

I made a mistake in the type hierarchy: An executable should be defined as an artifact that has input and output. So executable should be a subtype of artifact.

I think these should be two separate concepts. An artifact represents anything produced by an ML pipeline or run. An executable represents the pipeline/run itself.

@zhenghuiwang
Copy link
Contributor Author

Absolutely, these two should be treated as separate concepts.

I don't mean to go into taxonomy, but point out that model is viewed as derived data in training phase but can be viewed as executable in serving phase.

@jlewi
Copy link
Contributor

jlewi commented Jun 10, 2019

@neuromage @zhenghuiwang How is this coming? What is the remaining work for 0.6?

@zhenghuiwang
Copy link
Contributor Author

The metadata definition for v0.6 has been done in #17. They are preloaded in the backend service during startup and exposed in the python SDK.

We can add more fields into these definitions for future use cases.

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants