The Schema catalog for OpenSearch brings the concept of organized and structured catalog data. A catalog of schemas is a comprehensive collection of all the possible data schemas or structures that can be used to represent information.
It provides a standardized way of organizing and describing the structure of data, making it easier to analyze, compare, and share data across different systems and applications. Structured data refers to any data that is organized in a specific format or schema - for example Observability data or security data...
By using a catalog of schemas, data analysts and scientists can easily identify and understand the different structures of data, allowing them to correlate and analyze information more effectively. One of the key benefits of a catalog of schemas is that it promotes interoperability between different systems and applications. By using standardized schema descriptions, data can be shared and exchanged more easily, regardless of the system or application being used.
The use of a catalog of schemas can also improve data quality by ensuring that data is consistent and accurate. This is because the schema provides a clear definition of the data structure and the rules for how data should be entered, validated, and stored.
Opensearch supports out of the box the following schemas
Simple Schema for Observability allows ingestion of both (OTEL/ECS) formats and internally consolidate them to best of its capabilities for presenting a unified Observability platform.
OpenSearch Security is a plugin for OpenSearch that offers encryption, authentication and authorization. When combined with OpenSearch Security-Advanced Modules, it supports authentication via Active Directory, LDAP, Kerberos, JSON web tokens, SAML, OpenID and more. It includes fine grained role-based access control to indices, documents and fields. It also provides multi-tenancy support in OpenSearch Dashboards.
For a catalog schema be enabled for supporting usage of dashboards / queries / Alerts - the catalogs needs to generate the appropriate templates representing them.
This will allow any type of structured related assets using these catalogs without the need to explicitly defining them thus maintaining a unified common schema.
Each catalog may support semantic versioning so that it may evolve its schema as needed.
In the future, the catalog will enable to associate domains with catalogs and allow externally importing catalogs into Opensearch for additional collaboration.
A catalog is structured in the following way:
- Catalog named folder:
Observability
- Categories named folder :
Logs
,Traces
,Metrics
- Component named file :
http
,communication
,traces
,metrics
- Component named file :
- Categories named folder :
Each level encapsulates additional internal structure that allows a greater level of composability and agility. The details of each catalog structure is described in the catalog.json file that resides in the root level of each catalog folder.
Component The component is the leaf level definition of the catalog hierarchy, it details the actual building blocks of the catalog's types and fields.
Each component has two flavours:
$component.mapping
- describes how the type is physically stored in the underlying index$component.schema
- describing the actual json schema for this component type
A component may be classified as a container
which has the ability to group / combine multiple components inside.
For example, we can examine the logs
component that has the capacity to combine additional components (such as http
, communication
and more)
...
"composed_of": [
"http_template",
"communication_template"
],
...
A component also has a list of tags
which are aliases for the component name which can be used to reference it directly by an integration components list.
...
{
"component": "communication",
"version": "1.0",
"url": "https://github.com/opensearch-project/opensearch-catalog/tree/docs/schema/observability/logs/communication",
"tags": ["web"],
"container": false
}
...
In order to be able to correlate information across different signal (represented in different indices) we introduced the notion of correlation into the schema. This information is represented explicitly in both the declarative schema file and the physical mapping file
This information will enable the knowledge to be projected and allow for analytic engine to produce a join query that will take advantage of these relationships. The correlation metadata info is exported in the following way:
In JSON Schema, there is no built-in way to represent relationships directly between multiple schemas, like you would find in a relational database. However, you can establish relationships indirectly by using a combination of $id
, $ref
, and consistent property naming across your schemas.
For example the logs.schema
file contains the next $ref
references for the traceId
& spanId
fields that belong to the traces.schema
.
...
"traceId": {
"$ref": "https://opensearch.org/schemas/observability/Span#/properties/traceId"
},
"spanId": {
"$ref": "https://opensearch.org/schemas/observability/Span#/properties/spanId"
},
...
We can observe that the traceId
field is defined by referencing to the Span schema and explicitly to the #/properties/spanId
field reference location.
Each mapping template will contain the foreign schemas that are referenced to in that specific mapping file. For example the logs.mapping
file will contain the next correlation object in the mapping _meta
section:
"_meta": {
"description": "Simple Schema For Observability",
"catalog": "observability",
"type": "logs",
"correlations": [
{
"field": "spanId",
"foreign-schema": "traces",
"foreign-field": "spanId"
},
{
"field": "traceId",
"foreign-schema": "traces",
"foreign-field": "traceId"
}
]
}
Each correlations
field contains the F.K field name - spanId
, the referenced schema - traces
and the source field name in that schema spanId
This information can be used to generate the correct join queries on a contextual basis.