[RFC] OpenSearch Data Format #8639

penghuo · 2023-07-11T18:50:24Z

User Pain point

Not Open Format Dataset: Currently, all the data indexed in the OpenSearch service. Users rely on the OpenSearch service to access and retrieve the dataset.
Lacking segregation of read and write operations : using single cluster serve Index traffic and query traffic.
Complex data management workflow: usually, user need to setup pipeline to (1) ingest data to S3, and then (2) ingest to OpenSearch and (3) then manage index lifecycle with customized rules.

The index themselves are encoded with Lucene format of each shard.
The metadata is in object store. The metadata also include skipping index such as mix/max for each shard.
No server need to be running on to maintain OpenSearch Index. Transactions are achieve using optimistic concurrency protocol.
User only need to launch server when run queries, and benefits of separately scaling compute and storage.

penghuo · 2023-07-11T18:52:51Z

Demo Streaming application
- Write 1 shard to fs every 5s
- Write skipping index for each shard.
Reuse searchable snapshot interface
- restore unassigned shard from fs every 10s.
Rewrite DSL query WITH skipping index

dai-chen · 2023-07-11T20:33:44Z

Here is the demo video that covers the following topic:

OpenSearch Data Format proposed in this issue that remove hard dependency on OpenSearch cluster and separate read and write path
Virtual / External Index that makes data set on object store accessible to OpenSearch. Please find more details in [FEATURE] Materialized views (aka virtual indexes) on object stores sql#1080
Skipping Index that avoids unnecessary shard load and scan. Please find more details in Add data skipping index support opensearch-spark#2

OpenSearch-data-format-demo.mov

schenksj · 2023-07-17T19:48:35Z

this is very cool! has any progress been made on the spark sql execution datasources side?

penghuo added enhancement Enhancement or improvement to existing feature or request untriaged labels Jul 11, 2023

penghuo mentioned this issue Jul 11, 2023

OpenSearch on Spark (without an OpenSearch cluster) - has this been contemplated? #8566

Open

penghuo changed the title ~~[RFC] OpenSearch Storage Format~~ [RFC] OpenSearch Data Format Jul 11, 2023

minalsha removed the untriaged label Jul 12, 2023

andrross added the Roadmap:Search Project-wide roadmap label label May 31, 2024

getsaurabh02 added this to OpenSearch Roadmap May 31, 2024

github-project-automation bot moved this to Planned work items in OpenSearch Roadmap May 31, 2024