[META] Bucket of Improvements to the Spark Integration for 2.13 #277

brijos · 2024-03-11T19:29:19Z

Is your feature request related to a problem?
We released our Spark integration starting in 2.9 with skipping index support and have received feedback that we need to focus on making it easier for customers to connect to their data in object stores. In 2.13, we will improve the admin UX in Data Sources by reinforcing OpenSearch Integrations as the out-of-the-box way to easily connect to your data using Hive, Spark, and your object store. We'll also make it easier for customers to create manage accelerations from within data sources and include Bloom Filters for high cardinality skipping indexes such as hostnames and IP addresses. For querying, we will include better error reporting and will extend JOIN functionality to PPL.

Below is a comprehensive view of what other optimizations and bugs will be addressed in 2.13:

Features
Exclude logical deleted Flint index from query rewrite
Show Flint indexes support
Creating an index fails with OpenSearch Carbon domain, and the error message is not meaningful
Standardize Flint log output format
Avoid expensive S3 listing by using file list in skipping index
Checkpoint folder data cleanup limitation
Skipping index and materialized view refresh synchronization
Large backlog processing when materialized view cold start
Support partial indexing for skipping and covering index
Covering index and materialized view refresh idempotency
Improve validation for SQL statement
Add stateCode if job executed failed

Bugs
Stop Spark context and JVM explicitly in FlintJob
Manual cancel REPL session in EMR-S console, Shutting down hock does not update state
Create MV should expose error message properly
Flint index stuck in transient state
Gracefully terminate index refresh job when Flint index deleted accidentally
Flint data source cannot read nested field value
Remove unnecessary locking when commit and rollback transaction
If user provide invalid s3 location as checkpoint, streaming query still execute, but using local disk as checkpoint
Remove 2s delay between result write and statement state update
Missing errors when connecting to a Parquet table with incorrect types
Find correct catalog name in query rewriter
Field with null value is omitted when write back to result index
Flint not supporting Complex schema

Improve Error Reporting
Error message is not meaning full, "Fail to run query, cause: Failed to refresh Flint index"

anirudha · 2024-04-15T20:06:42Z

moved major 2.13 items to release/ we need to rebucket the low hanging items here

brijos added enhancement New feature or request untriaged labels Mar 11, 2024

bbarani added the v2.13.0 label Mar 11, 2024

getsaurabh02 removed the untriaged label Mar 18, 2024

anirudha closed this as completed Apr 15, 2024

github-project-automation bot added this to OpenSearch Project Roadmap Aug 30, 2024

github-project-automation bot moved this to 2.13.0 (Launched ) in OpenSearch Project Roadmap Aug 30, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[META] Bucket of Improvements to the Spark Integration for 2.13 #277

[META] Bucket of Improvements to the Spark Integration for 2.13 #277

brijos commented Mar 11, 2024 •

edited

Loading

anirudha commented Apr 15, 2024 •

edited

Loading

[META] Bucket of Improvements to the Spark Integration for 2.13 #277

[META] Bucket of Improvements to the Spark Integration for 2.13 #277

Comments

brijos commented Mar 11, 2024 • edited Loading

anirudha commented Apr 15, 2024 • edited Loading

brijos commented Mar 11, 2024 •

edited

Loading

anirudha commented Apr 15, 2024 •

edited

Loading