Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] OpenSearch and Apache Spark Integration #1875

Open
penghuo opened this issue Jul 17, 2023 · 0 comments
Open

[RFC] OpenSearch and Apache Spark Integration #1875

penghuo opened this issue Jul 17, 2023 · 0 comments
Labels
enhancement New feature or request

Comments

@penghuo
Copy link
Collaborator

penghuo commented Jul 17, 2023

Introduction

We received a feature request for query execution on object stores in OpenSearch.

We have investigated the possibility to build a new solution for OpenSearch uses and leverage object store as storage. Which includes

We found the challenges are

  • OpenSearch aggregation framework is the simplified MPP frameworks and does not support shuffle stage.
  • OpenSearch query framework missing key feature support, E.g. JOIN, Subquery.

We found these work have been solved by general purpose data preprocessing system, E.g. Presto, Spark, Trino. And build such a platform require years to mature.

Idea

The initial idea is

  1. Using SQL as interface.
  2. Leverage spark as query/compute execution engine.

High level diagram:

Screenshot 2023-06-16 at 8 21 37 AM

User Experience

  1. User configure SPARK cluster as computation resource, E.g. https://SPARK:7707.
  2. User submit SQL to OpenSearch cluster use _plugins/_sql REST API.
    1. SQL engine parse and analysis the SQL query.
    2. SQL engine decide whether route the query to SPARK cluster or run query locally.
  3. In phase-1, we provide interface to let user create derived dataset from data on object store and store in OpenSearch. Then query will be optimized based derived dataset automatically during query time.
  4. In phase-2, we provide opt-in optimization choice for user. The derived dataset will be create automatically based on query pattern.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants