Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] OpenSearch Data Format #8639

Open
penghuo opened this issue Jul 11, 2023 · 3 comments
Open

[RFC] OpenSearch Data Format #8639

penghuo opened this issue Jul 11, 2023 · 3 comments
Labels
enhancement Enhancement or improvement to existing feature or request Roadmap:Search Project-wide roadmap label

Comments

@penghuo
Copy link
Contributor

penghuo commented Jul 11, 2023

User Pain point

  • Not Open Format Dataset: Currently, all the data indexed in the OpenSearch service. Users rely on the OpenSearch service to access and retrieve the dataset.
  • Lacking segregation of read and write operations : using single cluster serve Index traffic and query traffic.
  • Complex data management workflow: usually, user need to setup pipeline to (1) ingest data to S3, and then (2) ingest to OpenSearch and (3) then manage index lifecycle with customized rules.

Proposed Solution

  • The index themselves are encoded with Lucene format of each shard.
  • The metadata is in object store. The metadata also include skipping index such as mix/max for each shard.
  • No server need to be running on to maintain OpenSearch Index. Transactions are achieve using optimistic concurrency protocol.
  • User only need to launch server when run queries, and benefits of separately scaling compute and storage.

image

Technical Challenge

  • OpenSearch Data Format on Object Store
    • OpenSearch Data Format structure on object store
    • Metadata specification. Using transaction log to record actions. The actions include (1) Add/Remove (2) Metadata Change
    • Access Protocols
      • Optimistic concurrency protocols
      • Serializable isolation
  • Implement OpenSearch Data Format Writer/Reader as library
  • Implement Virtual Index in OpenSearch which attached to OpenSearch Data Format
  • Implement DSL query rewrite logic with skipping index.
@penghuo penghuo added enhancement Enhancement or improvement to existing feature or request untriaged labels Jul 11, 2023
@penghuo
Copy link
Contributor Author

penghuo commented Jul 11, 2023

Demo setup

  • Demo Streaming application
    • Write 1 shard to fs every 5s
    • Write skipping index for each shard.
  • Reuse searchable snapshot interface
    • restore unassigned shard from fs every 10s.
  • Rewrite DSL query WITH skipping index

Screenshot 2023-06-20 at 7 21 23 AM

@dai-chen
Copy link

dai-chen commented Jul 11, 2023

Here is the demo video that covers the following topic:

  1. OpenSearch Data Format proposed in this issue that remove hard dependency on OpenSearch cluster and separate read and write path
  2. Virtual / External Index that makes data set on object store accessible to OpenSearch. Please find more details in [FEATURE] Materialized views (aka virtual indexes) on object stores sql#1080
  3. Skipping Index that avoids unnecessary shard load and scan. Please find more details in Add data skipping index support opensearch-spark#2
OpenSearch-data-format-demo.mov

@penghuo penghuo changed the title [RFC] OpenSearch Storage Format [RFC] OpenSearch Data Format Jul 11, 2023
@schenksj
Copy link

this is very cool! has any progress been made on the spark sql execution datasources side?

@andrross andrross added the Roadmap:Search Project-wide roadmap label label May 31, 2024
@github-project-automation github-project-automation bot moved this to Planned work items in OpenSearch Roadmap May 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement Enhancement or improvement to existing feature or request Roadmap:Search Project-wide roadmap label
Projects
Status: New
Development

No branches or pull requests

5 participants