[PLANNING] Open-Catalog project #699

YANG-DB · 2022-07-16T00:11:16Z

Category: Enhancement
Scope: Planning

Open-Catalog - Planning (metadata catalog)

General

The purpose of this document is to create a high level plan for a metadata catalog capability as specified in issue 685

The work will be incremental in such way that the external API will be created first to fulfill the different requirements that arrive from existing
metadata needs such as:

PPL connectivity metadata discovery API described in 561
PPL query validation API described in 561
Query Cost based Optimizer statistics API described in 612

Setup Stage

In This process we need to create an open-search new project named open-knowledge

Stage 1:

This part will include creating the external API that will be in use by PPL and additional clients.
In order to be agile (and allow simple evolution) in the API spec definition and also capable of sharing a common language we will use a Generic DSL.

Here we have few alternatives:

Smithy DSL (https://github.com/awslabs/smithy) will allow us to decouple the specification from the actual implementation.
GraphQL DSL (https://graphql.org/learn/schema/) will also allow us to create an independent specification from the actual implementation and allow it
to evolve according to requirements, see RFC 698
YangDB's DSL

This proprietary DSL has support for both logical entity-relation topology and for lower level aspects of storage usability concerns.

If we select this option to represent the API we need to implement an API generation mechanism since it currently has no API auto-generation capability.

This step will result in 3 (or more) DSL for different needs -

validation API
statistics API
connectivity API

The resulting artifact will also be able to auto-generate a java implementation of the client/server API spec to function as a stub for external usage in system tests.

references

Stage 2:

Once the external API is created with some level of confident, the next phase will be to implement the internal storage catalog Ontology that will
allow representing these aspects with the ability to evolve.

We will take advantage of the knowledge graph concepts to create a generic Ontology which will reflect the API common
Essential entities that will be common to the Ontology:

Example proposed Entities

Dataset
The DATASET entity represent collections of data that are typically represented as Tables or Views , Streams in a stream-processing environment, bundles of data found as Files or Folders in data lake systems .
Table (Index)
The TABLE entity represent collections of columns that are typically represented as logical unit with a business meaning or significant.
Dashboard
The DASHBOARD entity represents a collection of Tables or Queries for visualization.
Role
The ROLE entity represents a logical action that can be performed upon another asset (resource)

These entities list if partial and is to be considered an example only

See suggested GraphQL partial schema RFC 698

Ontology

This Ontology will support versioning for backward and forward compatibility, it will be maintained in a dedicated location.
Since we are intending that the data-store that is holding the metadata registry and content will be open-search, this topology must have an index generator
capability that would allow for schema creation in the underlying open-search engine.

Currently, the only available option representing such a general purpose topology (and index generation) will be the YangDB's DSL.
With respect to the choice we made on which DSL we use to generate the API we may have to implement a DSL to DSL translator from the chosen API DSL to YangDB's Ontology.

The resulting artifact will also be as follows:

Ontology representation
API DSL to Ontology Converter (currently one direction only
- This tool will need to support multi-API conversion in the future
Ontology Index-Generator Support

Stage 3:

This stage will focus on the general purpose query language to allow asking cross-domain question on the metadata catalog.
We will select a cross domain general purpose (graph) query language that will allow us to reflect multiple aspects we need to collect in a single query.

Currently, the most likely language will be open-cypher or GQL with its preliminary release.

In contrast with the specific API, this query is untyped and will return a collection of rows that may be later transformed into meaningful logical entities.
Supporting a general purpose graph language is a very large task and we will take advantage of YangDB's existing query translators to convert the test query into
an open-search specific query.

This stage will be heavily depended on the progress being done on the open-graph feature and will be in-fact a joined effort.

The resulting artifact will also be as follows:

General purpose graph Query language that will be validated by the previously created metadata-catalog ontology.
Executing such query against already existing catalog indices and returning data from these tables.
Performing cross-domain queries (joins) across the catalog indices

YANG-DB · 2023-05-23T18:46:10Z

use existing project for these concepts https://github.com/opensearch-project/opensearch-catalog

YANG-DB added enhancement New feature or request untriaged labels Jul 16, 2022

vamsimanohar assigned vamsimanohar and YANG-DB and unassigned vamsimanohar Jul 18, 2022

vamsimanohar removed the untriaged label Jul 18, 2022

YANG-DB removed their assignment Jul 19, 2022

YANG-DB closed this as completed May 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PLANNING] Open-Catalog project #699

[PLANNING] Open-Catalog project #699

YANG-DB commented Jul 16, 2022 •

edited

Loading

YANG-DB commented May 23, 2023

[PLANNING] Open-Catalog project #699

[PLANNING] Open-Catalog project #699

Comments

YANG-DB commented Jul 16, 2022 • edited Loading

Open-Catalog - Planning (metadata catalog)

General

Setup Stage

Stage 1:

references

Stage 2:

Example proposed Entities

Ontology

Stage 3:

YANG-DB commented May 23, 2023

YANG-DB commented Jul 16, 2022 •

edited

Loading