Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[META] Bucket of Improvements to the Spark Integration for 2.13 #277

Closed
brijos opened this issue Mar 11, 2024 · 1 comment
Closed

[META] Bucket of Improvements to the Spark Integration for 2.13 #277

brijos opened this issue Mar 11, 2024 · 1 comment
Labels
enhancement New feature or request v2.13.0

Comments

@brijos
Copy link

brijos commented Mar 11, 2024

Is your feature request related to a problem?
We released our Spark integration starting in 2.9 with skipping index support and have received feedback that we need to focus on making it easier for customers to connect to their data in object stores. In 2.13, we will improve the admin UX in Data Sources by reinforcing OpenSearch Integrations as the out-of-the-box way to easily connect to your data using Hive, Spark, and your object store. We'll also make it easier for customers to create manage accelerations from within data sources and include Bloom Filters for high cardinality skipping indexes such as hostnames and IP addresses. For querying, we will include better error reporting and will extend JOIN functionality to PPL.

Below is a comprehensive view of what other optimizations and bugs will be addressed in 2.13:

Features
Exclude logical deleted Flint index from query rewrite
Show Flint indexes support
Creating an index fails with OpenSearch Carbon domain, and the error message is not meaningful
Standardize Flint log output format
Avoid expensive S3 listing by using file list in skipping index
Checkpoint folder data cleanup limitation
Skipping index and materialized view refresh synchronization
Large backlog processing when materialized view cold start
Support partial indexing for skipping and covering index
Covering index and materialized view refresh idempotency
Improve validation for SQL statement
Add stateCode if job executed failed

Bugs
Stop Spark context and JVM explicitly in FlintJob
Manual cancel REPL session in EMR-S console, Shutting down hock does not update state
Create MV should expose error message properly
Flint index stuck in transient state
Gracefully terminate index refresh job when Flint index deleted accidentally
Flint data source cannot read nested field value
Remove unnecessary locking when commit and rollback transaction
If user provide invalid s3 location as checkpoint, streaming query still execute, but using local disk as checkpoint
Remove 2s delay between result write and statement state update
Missing errors when connecting to a Parquet table with incorrect types
Find correct catalog name in query rewriter
Field with null value is omitted when write back to result index
Flint not supporting Complex schema

Improve Error Reporting
Error message is not meaning full, "Fail to run query, cause: Failed to refresh Flint index"

@brijos brijos added enhancement New feature or request untriaged labels Mar 11, 2024
@anirudha
Copy link
Collaborator

anirudha commented Apr 15, 2024

moved major 2.13 items to release/ we need to rebucket the low hanging items here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request v2.13.0
Projects
Status: 2.13.0 (Launched )
Development

No branches or pull requests

4 participants