[RFC] OpenSearch and Apache Spark Integration #1875

penghuo · 2023-07-17T18:17:12Z

Introduction

We received a feature request for query execution on object stores in OpenSearch.

[FEATURE] Materialized views (aka virtual indexes) on object stores #1080

We have investigated the possibility to build a new solution for OpenSearch uses and leverage object store as storage. Which includes

We found the challenges are

OpenSearch aggregation framework is the simplified MPP frameworks and does not support shuffle stage.
OpenSearch query framework missing key feature support, E.g. JOIN, Subquery.

We found these work have been solved by general purpose data preprocessing system, E.g. Presto, Spark, Trino. And build such a platform require years to mature.

Idea

The initial idea is

Using SQL as interface.
Leverage spark as query/compute execution engine.

High level diagram:

User Experience

User configure SPARK cluster as computation resource, E.g. https://SPARK:7707.
User submit SQL to OpenSearch cluster use _plugins/_sql REST API.
1. SQL engine parse and analysis the SQL query.
2. SQL engine decide whether route the query to SPARK cluster or run query locally.
In phase-1, we provide interface to let user create derived dataset from data on object store and store in OpenSearch. Then query will be optimized based derived dataset automatically during query time.
In phase-2, we provide opt-in optimization choice for user. The derived dataset will be create automatically based on query pattern.

penghuo added enhancement New feature or request untriaged and removed untriaged labels Jul 17, 2023

github-actions bot added the untriaged label Jul 17, 2023

penghuo mentioned this issue Jul 17, 2023

OpenSearch on Spark (without an OpenSearch cluster) - has this been contemplated? opensearch-project/OpenSearch#8566

Open

derek-ho removed the untriaged label Aug 3, 2023

brijos mentioned this issue Aug 23, 2023

[META] Spark Support - Dashboard Connections/Sources, Materialized Views, and Covering Indexes #2027

Closed

YANG-DB mentioned this issue Aug 30, 2023

EXPERIMENTAL: PPL to catalyst plan translator #2041

Closed

6 tasks

brijos mentioned this issue Sep 21, 2023

[DOC] Spark Support - Dashboard Sources, Materialized Views, and Covering Indexes opensearch-project/documentation-website#5061

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] OpenSearch and Apache Spark Integration #1875

[RFC] OpenSearch and Apache Spark Integration #1875

penghuo commented Jul 17, 2023

[RFC] OpenSearch and Apache Spark Integration #1875

[RFC] OpenSearch and Apache Spark Integration #1875

Comments

penghuo commented Jul 17, 2023

Introduction

Idea

User Experience