You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
The purpose of this document is to create a high level plan for a metadata catalog capability as specified in issue 685
The work will be incremental in such way that the external API will be created first to fulfill the different requirements that arrive from existing
metadata needs such as:
PPL connectivity metadata discovery API described in 561
Query Cost based Optimizer statistics API described in 612
Setup Stage
In This process we need to create an open-search new project named open-knowledge
Stage 1:
This part will include creating the external API that will be in use by PPL and additional clients.
In order to be agile (and allow simple evolution) in the API spec definition and also capable of sharing a common language we will use a Generic DSL.
GraphQL DSL (https://graphql.org/learn/schema/) will also allow us to create an independent specification from the actual implementation and allow it
to evolve according to requirements, see RFC 698
This proprietary DSL has support for both logical entity-relation topology and for lower level aspects of storage usability concerns.
If we select this option to represent the API we need to implement an API generation mechanism since it currently has no API auto-generation capability.
This step will result in 3 (or more) DSL for different needs -
validation API
statistics API
connectivity API
The resulting artifact will also be able to auto-generate a java implementation of the client/server API spec to function as a stub for external usage in system tests.
Once the external API is created with some level of confident, the next phase will be to implement the internal storage catalog Ontology that will
allow representing these aspects with the ability to evolve.
We will take advantage of the knowledge graph concepts to create a generic Ontology which will reflect the API common
Essential entities that will be common to the Ontology:
Example proposed Entities
Dataset
The DATASET entity represent collections of data that are typically represented as Tables or Views , Streams in a stream-processing environment, bundles of data found as Files or Folders in data lake systems .
Table (Index)
The TABLE entity represent collections of columns that are typically represented as logical unit with a business meaning or significant.
Dashboard
The DASHBOARD entity represents a collection of Tables or Queries for visualization.
Role
The ROLE entity represents a logical action that can be performed upon another asset (resource)
These entities list if partial and is to be considered an example only
This Ontology will support versioning for backward and forward compatibility, it will be maintained in a dedicated location.
Since we are intending that the data-store that is holding the metadata registry and content will be open-search, this topology must have an index generator
capability that would allow for schema creation in the underlying open-search engine.
Currently, the only available option representing such a general purpose topology (and index generation) will be the YangDB's DSL.
With respect to the choice we made on which DSL we use to generate the API we may have to implement a DSL to DSL translator from the chosen API DSL to YangDB's Ontology.
The resulting artifact will also be as follows:
Ontology representation
API DSL to Ontology Converter (currently one direction only
This tool will need to support multi-API conversion in the future
Ontology Index-Generator Support
Stage 3:
This stage will focus on the general purpose query language to allow asking cross-domain question on the metadata catalog.
We will select a cross domain general purpose (graph) query language that will allow us to reflect multiple aspects we need to collect in a single query.
Currently, the most likely language will be open-cypher or GQL with its preliminary release.
In contrast with the specific API, this query is untyped and will return a collection of rows that may be later transformed into meaningful logical entities.
Supporting a general purpose graph language is a very large task and we will take advantage of YangDB's existing query translators to convert the test query into
an open-search specific query.
This stage will be heavily depended on the progress being done on the open-graph feature and will be in-fact a joined effort.
The resulting artifact will also be as follows:
General purpose graph Query language that will be validated by the previously created metadata-catalog ontology.
Executing such query against already existing catalog indices and returning data from these tables.
Performing cross-domain queries (joins) across the catalog indices
The text was updated successfully, but these errors were encountered:
Open-Catalog - Planning (metadata catalog)
General
The purpose of this document is to create a high level plan for a metadata catalog capability as specified in issue 685
The work will be incremental in such way that the external API will be created first to fulfill the different requirements that arrive from existing
metadata needs such as:
Setup Stage
In This process we need to create an open-search new project named open-knowledge
Stage 1:
This part will include creating the external API that will be in use by PPL and additional clients.
In order to be agile (and allow simple evolution) in the API spec definition and also capable of sharing a common language we will use a Generic DSL.
Here we have few alternatives:
Smithy DSL (https://github.com/awslabs/smithy) will allow us to decouple the specification from the actual implementation.
GraphQL DSL (https://graphql.org/learn/schema/) will also allow us to create an independent specification from the actual implementation and allow it
to evolve according to requirements, see RFC 698
YangDB's DSL
This proprietary DSL has support for both logical entity-relation topology and for lower level aspects of storage usability concerns.
If we select this option to represent the API we need to implement an API generation mechanism since it currently has no API auto-generation capability.
This step will result in 3 (or more) DSL for different needs -
The resulting artifact will also be able to auto-generate a java implementation of the client/server API spec to function as a stub for external usage in system tests.
references
https://github.com/netflix/dgs-framework
https://graphql.org/learn/schema/
https://github.com/awslabs/smithy
https://github.com/YANG-DB/yang-db/blob/dev-opensearch/docs/info/components/ontology.md
https://linkedin.github.io/rest.li/pdl_schema
https://cwiki.apache.org/confluence/display/HCATALO/Design+Document+-+Java+APIs+for+HCatalog+DDL+Commands
https://calcite.apache.org/javadocAggregate/org/apache/calcite/schema/Statistics.html
https://calcite.apache.org/javadocAggregate/org/apache/calcite/schema/Table.html
Stage 2:
Once the external API is created with some level of confident, the next phase will be to implement the internal storage catalog Ontology that will
allow representing these aspects with the ability to evolve.
We will take advantage of the knowledge graph concepts to create a generic Ontology which will reflect the API common
Essential entities that will be common to the Ontology:
Example proposed Entities
Dataset
The DATASET entity represent collections of data that are typically represented as Tables or Views , Streams in a stream-processing environment, bundles of data found as Files or Folders in data lake systems .
Table (Index)
The TABLE entity represent collections of columns that are typically represented as logical unit with a business meaning or significant.
Dashboard
The DASHBOARD entity represents a collection of Tables or Queries for visualization.
Role
The ROLE entity represents a logical action that can be performed upon another asset (resource)
These entities list if partial and is to be considered an example only
See suggested GraphQL partial schema RFC 698
Ontology
This Ontology will support versioning for backward and forward compatibility, it will be maintained in a dedicated location.
Since we are intending that the data-store that is holding the metadata registry and content will be open-search, this topology must have an index generator
capability that would allow for schema creation in the underlying open-search engine.
Currently, the only available option representing such a general purpose topology (and index generation) will be the YangDB's DSL.
With respect to the choice we made on which DSL we use to generate the API we may have to implement a DSL to DSL translator from the chosen API DSL to YangDB's Ontology.
The resulting artifact will also be as follows:
Stage 3:
This stage will focus on the general purpose query language to allow asking cross-domain question on the metadata catalog.
We will select a cross domain general purpose (graph) query language that will allow us to reflect multiple aspects we need to collect in a single query.
Currently, the most likely language will be open-cypher or GQL with its preliminary release.
In contrast with the specific API, this query is untyped and will return a collection of rows that may be later transformed into meaningful logical entities.
Supporting a general purpose graph language is a very large task and we will take advantage of YangDB's existing query translators to convert the test query into
an open-search specific query.
This stage will be heavily depended on the progress being done on the open-graph feature and will be in-fact a joined effort.
The resulting artifact will also be as follows:
The text was updated successfully, but these errors were encountered: