Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PLANNING] Open-Catalog project #699

Closed
YANG-DB opened this issue Jul 16, 2022 · 1 comment
Closed

[PLANNING] Open-Catalog project #699

YANG-DB opened this issue Jul 16, 2022 · 1 comment
Labels
enhancement New feature or request

Comments

@YANG-DB
Copy link
Member

YANG-DB commented Jul 16, 2022

  • Category: Enhancement
  • Scope: Planning

Open-Catalog - Planning (metadata catalog)

General

The purpose of this document is to create a high level plan for a metadata catalog capability as specified in issue 685

The work will be incremental in such way that the external API will be created first to fulfill the different requirements that arrive from existing
metadata needs such as:

  • PPL connectivity metadata discovery API described in 561
  • PPL query validation API described in 561
  • Query Cost based Optimizer statistics API described in 612

Setup Stage

In This process we need to create an open-search new project named open-knowledge

Stage 1:

This part will include creating the external API that will be in use by PPL and additional clients.
In order to be agile (and allow simple evolution) in the API spec definition and also capable of sharing a common language we will use a Generic DSL.

Here we have few alternatives:

This proprietary DSL has support for both logical entity-relation topology and for lower level aspects of storage usability concerns.

If we select this option to represent the API we need to implement an API generation mechanism since it currently has no API auto-generation capability.


This step will result in 3 (or more) DSL for different needs -

  • validation API
  • statistics API
  • connectivity API

The resulting artifact will also be able to auto-generate a java implementation of the client/server API spec to function as a stub for external usage in system tests.

references

Stage 2:

Once the external API is created with some level of confident, the next phase will be to implement the internal storage catalog Ontology that will
allow representing these aspects with the ability to evolve.

We will take advantage of the knowledge graph concepts to create a generic Ontology which will reflect the API common
Essential entities that will be common to the Ontology:

Example proposed Entities

  • Dataset
    The DATASET entity represent collections of data that are typically represented as Tables or Views , Streams in a stream-processing environment, bundles of data found as Files or Folders in data lake systems .

  • Table (Index)
    The TABLE entity represent collections of columns that are typically represented as logical unit with a business meaning or significant.

  • Dashboard
    The DASHBOARD entity represents a collection of Tables or Queries for visualization.

  • Role
    The ROLE entity represents a logical action that can be performed upon another asset (resource)

These entities list if partial and is to be considered an example only

See suggested GraphQL partial schema RFC 698


Ontology

This Ontology will support versioning for backward and forward compatibility, it will be maintained in a dedicated location.
Since we are intending that the data-store that is holding the metadata registry and content will be open-search, this topology must have an index generator
capability that would allow for schema creation in the underlying open-search engine.

Currently, the only available option representing such a general purpose topology (and index generation) will be the YangDB's DSL.
With respect to the choice we made on which DSL we use to generate the API we may have to implement a DSL to DSL translator from the chosen API DSL to YangDB's Ontology.

The resulting artifact will also be as follows:

  • Ontology representation
  • API DSL to Ontology Converter (currently one direction only
    • This tool will need to support multi-API conversion in the future
  • Ontology Index-Generator Support

Stage 3:

This stage will focus on the general purpose query language to allow asking cross-domain question on the metadata catalog.
We will select a cross domain general purpose (graph) query language that will allow us to reflect multiple aspects we need to collect in a single query.

Currently, the most likely language will be open-cypher or GQL with its preliminary release.

In contrast with the specific API, this query is untyped and will return a collection of rows that may be later transformed into meaningful logical entities.
Supporting a general purpose graph language is a very large task and we will take advantage of YangDB's existing query translators to convert the test query into
an open-search specific query.

This stage will be heavily depended on the progress being done on the open-graph feature and will be in-fact a joined effort.

The resulting artifact will also be as follows:

  • General purpose graph Query language that will be validated by the previously created metadata-catalog ontology.
  • Executing such query against already existing catalog indices and returning data from these tables.
  • Performing cross-domain queries (joins) across the catalog indices
@YANG-DB YANG-DB added enhancement New feature or request untriaged labels Jul 16, 2022
@YANG-DB YANG-DB removed their assignment Jul 19, 2022
@YANG-DB
Copy link
Member Author

YANG-DB commented May 23, 2023

use existing project for these concepts https://github.com/opensearch-project/opensearch-catalog

@YANG-DB YANG-DB closed this as completed May 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants