Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate component registry to metadata service #2031

Closed
kiersten-stokes opened this issue Aug 11, 2021 · 2 comments · Fixed by #2083
Closed

Migrate component registry to metadata service #2031

kiersten-stokes opened this issue Aug 11, 2021 · 2 comments · Fixed by #2083
Assignees
Labels
component:metadata metadata runtime kind:enhancement New feature or request
Milestone

Comments

@kiersten-stokes
Copy link
Member

kiersten-stokes commented Aug 11, 2021

Is your feature request related to a problem? Please describe.
The component registries are currently stored in etc/config/components. Every time a change is made to the registry, a rebuild is required. Additionally, it is not very user-friendly to add or remove components from the registry JSON files, as seen in discussion #1881.

Describe the solution you'd like
As we look toward adding a GUI for managing custom components (#1880), it could make sense to move the implementation of the registries to the metadata service. This also has the added benefit of improved consistency among stored information.

Design Considerations

I think this issue can be broken down into the following 3 parts:

  1. Completely reorganizing the component registry to point to the location (directory or url) of a list of component specifications
  2. Moving the component registry to the metadata service
  3. Handling component categories

Description and design considerations and questions for each to follow.


1. Restructuring component registry

Motivation

Allows for a user to easily add more components, rather than having to define each individual component one-by-one. This will pair nicely with the ability for users to disable/not display certain components in the palette (issue #2009).

As a standalone feature, this may be of more 'moderate' importance, but if we intend to support users customizing their palette components via moving the component registries to metadata, then it would probably be best to implement this first so as to not cause confusion when we change things down the line.

High-level direction

Currently, the component registries (one for each runtime) contain a list of components and the location of each component spec in an 'unofficial' JSON format like the below.

 "components": {
    "bash-operator": {
      "location": {
        "url": "https://raw.githubusercontent.com/apache/airflow/1.10.15/airflow/operators/bash_operator.py"
      },
      "category": "airflow"
    },
    "email-operator": {
      "location": {
        "url": "https://raw.githubusercontent.com/apache/airflow/1.10.15/airflow/operators/email_operator.py"
      },
      "category": "airflow"
    }
}

Instead, we want to move to having user-defined component registries, where only one location is specified per registry "location specifier" and that specifier points to one or several component spec files to be parsed. To start, a location specifier will be one of three types filename, url, and directory, as shown below. See comment below for more details.

{
  "display_name": "KFP Preloaded Components",
  "metadata": {
    "description": "Preloaded components that are supported by Kubeflow Pipelines",
    "runtime": "kfp",
    "locations": [
      {
        "location_path": "kfp",    *this is a relative directory for our preloaded components
        "location_type": "directory",
        "categories": ["Preloaded KFP"]
      },
      {
        "location_path": "https://some/url/here/component.yaml",
        "location_type": "url",
        "categories": ["Preloaded KFP"]
      },
      {
        "location_path": "some/path/to/component.yaml", *this is a relative path for our preloaded components
        "location_type": "filename",
        "categories": ["Preloaded KFP"]
      }
    ]
  },
  "schema_name": "component-registry"
}

Requirements:

  • Remove current component catalogs (airflow_component_catalog.json and kfp_component_catalog.json) and instead create a component-registry metadata instance using a schema like the above JSON, with a directory location of where the preconfigured component specs reside (probably an ENV_JUPYTER_PATH in our case)
    • More schema info in next section
    • Maybe named something like elyra-preconfigured-kfp and elyra-preconfigured-airflow
  • Removal of the component catalogs will push assigning the id until after spec content read, rather than getting it from the catalog directly as it is done now
    • We will need a new heuristic for assigning component ids: maybe [filename w/o extension]_[component name], which should provide enough of an assurance against repeats
  • Will probably have to remove the ability to parse just a single Airflow class in order to support the multi-valued type registries (directory-based or github repo-based, e.g.)
    • This shouldn't cause much of a latency issue since we can expect an operator file to only define a handful of classes
    • This will also remove the catalog_entry_id attribute on Component among other minor changes to existing functions, such as the function that retrieves only a singular component (again, due to caching and how get_component currently works, this shouldn't cause a huge latency issue)
  • Assignment of the required reader (based on registry location type) will need to be pushed into component_registry.py and out of the parser classes (this helps to clear up the function of the parsers as well, which is simply "to parse" given content)

Questions:

  • What is the best way to grab the details/location of the individual component specs from a given multi-valued registry location?
  • How will we handle what a user could enter for a directory for file-based registries? Will the directory have to be somewhere accessible from the JL root dir?
    • Handling our preconfigured components will be different from handling the user-created registries (same as with preconfigured runtime-images, but functionality with parsing registries is much more involved)
  • Will all components in given registry be required to be only-KFP or only-Airflow? If not, how to determine which parser will be used to parse the component specs? (allowing this sort of defeats the purpose of having per-processor parsing)

2. Moving component registry to metadata

Motivation

Allows a user to add/modify/delete their own component registries, as explained in the above section. This is fairly high importance/impact as it will significantly improve the user experience for customizing their component list.

(If we decide to not implement component registry definitions and instead stick to defining individual components and their locations, most things in this section still apply with minor changes to the schema and other details.)

High-level direction

The potential design raises some questions because components (and the preprocessing required to define them) are quite different from the other metadata namespaces that we currently store and access.

At high level, I think it might be best to keep much of the component parsing logic as-is. The metadata service has its own fetch endpoints, but we can't really use these due to the difference mentioned above. I think the best way to proceed would be to continue to use the existing palette and properties API endpoints (in pipeline/handlers) that invoke the processors to parse and return their component details. The majority of changes will occur in the component_registry.py _read_component_registry function, which will call MetadataManager.get_all() to retrieve registry details and loop through specs to construct Component objects as it does currently. And then the Metadata API endpoints to add/modify (PUT method), or delete (DELETE method) can be used for these additional operations.

Requirements:

  • Create a component-registries namespace
  • Create a component-registries schema and add to metadata/schemas
    • Proposal:
    {
      "display_name": "Built-in KFP Components",
      "metadata": {
        "runtime":  "kfp",       *any supported runtime processor
        "description": "Some KFP components",
        "locations": {
             "location_path": "some/path/to/kfp/registry/filename.yaml",
             "location_type": "file",     *one of ['file', 'url', 'directory'],
             "categories": ["category_name1", "category_name2"],  *any arbitrary string names that will translate into category names
        }        
      },
      "schema_name": "component-registry"
    }
    
  • Add JSON schema instances for each predefined component registries and move to etc/config/metadata/component-registries, as explained in the above section (these will be loaded into the share folder on build as they are currently)
  • Add a GUI (not strictly necessary to start, but obviously is the big user-experience piece)
  • We can also handle cache update considerations in a future PR, but we will eventually want to ensure that when a user adds/updates a component registry that the new components populate in the palette ASAP

Questions/Considerations:

  • The given proposal confines users to assign the same category to every component for a the given location -> is this acceptable?
  • The given proposal confines users to assign the same processor/runtime to every component in the given registry -> is this acceptable? If not, further questions are raised as to how runtime would have to be determined and how to do per-processor parsing

3. Handling component categories

Motivation

Allows a user to assign multiple categories to a single component or set of components.

Moderate importance. Would allow users to organize components in a way that makes sense to them.

High-level direction

We want to highly simplify the way that categories are fetched and rendered today, but also include the option for the user to enter some list of categories (that will be translated to a list of strings).

Currently, we only have two component categories: KFP and Airflow. We can have the preconfigured components assigned to these categories, with the option for the user to edit the registry definition to change the category.

Requirements:

  • Remove the ComponentCategory class and all functions that fetch or create such an instance
  • Change the categories attribute on a Component object to be a list of strings
  • Change how the palette is rendered in to_canvas_palette and in the jinja template to organize components by category (components that have more than one category will therefore be rendered multiple times)

Questions/Considerations:

  • Removing the ComponentCategory class will result in losing the ability to add a description to the category (viewable when hovering over a component in the palette)
  • What category name do we want to assign to components that have no assigned category?

Additional changes required

  • Documentation will have to be updated (see discussion linked above)
  • Tests will have to be updated; major changes will be required for the component parser tests, and minor changes to the handlers and other pipeline tests may be required based on the new approach
  • The validation service will have to be tweaked so as not to refer to ComponentCategory objects (details to be added)
  • processor_kfp.py may require minor changes to ensure that filename-based components are loaded correctly according to their stored path
@kevin-bates
Copy link
Member

@kiersten-stokes and I met yesterday to talk about some approaches and thought the following might satisfy a number of requirements (although we also realized we might not have a crystal clear understanding of the requirements). (All names can be changed and currently exist to help with the communication)...

  • A Component Registry consists of a list of location specifiers
  • A Location Specifier is one of three types file, url, and directory.
  • File and Url location specifiers identify a specific component definition (are single-valued)
  • Directory location specifiers identify a set of component definitions (are multi-valued)
  • Application-based properties like category are associated with location specifiers - not components. As a result, they apply to all components in a multi-valued location (like directory).
  • Category is multi-valued and should be named Categories in order to enable the ability for a given component to appear in different categories.
  • A Component Registry specifies a runtime type (i.e., a platform - Kubeflow Pipelines, Apache Airflow, etc)

Notes:
If we wanted to have a multi-valued URL-based location specifier, we could add one (like to support a github repo), but we'd probably want to identify a sub-directory in the repo as the location of the component definitions - otherwise we can't really identity component definition files from other files.

We can also hang location-specific attributes/properties within a location specifier - they are essentially object-valued relative to the schema. So something like a COS location-specifier could include specific attributes like credentials, bucket name, etc.

Although location-specifier properties like categories reside on the specifier, what the front-end receives could include those values on the component. I.e., location specifier (and even component registry) properties can be distributed onto the component definition upon the component definition's retrieval.

@kiersten-stokes - please add additional comments if I missed or misrepresented anything.

@kiersten-stokes
Copy link
Member Author

I have the groundwork for implementation of this issue laid out now on a local branch. Planning on moving forward with the design as laid out in the issue description above. @lresende or anyone else, let me know if you think another design discussion is needed before I dive in

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component:metadata metadata runtime kind:enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants