Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Roadmap 2022 H1 (discussion) #920

Closed
dennyglee opened this issue Feb 3, 2022 · 18 comments
Closed

Roadmap 2022 H1 (discussion) #920

dennyglee opened this issue Feb 3, 2022 · 18 comments
Assignees
Labels
enhancement New feature or request

Comments

@dennyglee
Copy link
Contributor

dennyglee commented Feb 3, 2022

This is the proposed Delta Lake 2022 H1 roadmap discussion thread. Below are the initially proposed items for the roadmap to be completed by June 2022. We will also be sending out a survey (we will update this issue with the survey) to get more feedback from the Delta Lake community!

Performance Optimizations

Based on the overwhelming feedback from the Delta Users Slack, Google Groups, Community AMAs (on Delta Lake YouTube), Delta Lake 2021H2 survey, and 2021H2 roadmap, we propose the following Delta Lake performance enhancements in the next two quarters.

Issue Description Target CY2022
927 OPTIMIZE (file compaction): Table optimize is an operation to rearrange the data and/or metadata to speed up queries and/or reduce the metadata size Released in 1.2
923 File skipping using columns stats: This is a performance optimization that aims at speeding up queries that contain filters (WHERE clauses) on non-partitionBy columns. Released in 1.2
931 Automatic data skipping using generated columns: Enhance generated columns to include automatic data skipping Released in 1.2
1134 OPTIMIZE ZORDER: Data clustering via multi-column locality-preserving space-filling curves with offline sorting. Q3/Q4
MERGE Performance Improvements: We will be providing a project improvement plan (PIP) document shortly on the proposed design for discussion. Q2/Q3

Schema Operations

For this year, our focus will be on columnar mappings.

Issue Description Target CY2022
958 Support for renaming column: Rename column with ALTER TABLE Released in 1.2
957 Support for arbitrary column names: Support characters in column names not allowed by Parquet Released in 1.2
1064 Support for dropping columns: Drop column with ALTER TABLE Released in 2.0
348 Support for dynamic partition overwrite: Currently you can overwrite using the replaceWhere option but in various scenarios, it is more convenient to specify overwrite partition. Q2

Integrations

Extending from the recent releases of PrestoDB, Hive 3, and Delta Sink for Apache Flink Streams API, we have additional integrations planned.

Issue Description Target CY2022
112 Delta Source for Apache Pulsar: Build a Pulsar/Delta reader leveraging Delta Standalone. Join us via the Delta Users Slack #connector-pulsar channel. Q2
238 Flink Sink on Table API: Build a Flink/Delta sink (i.e., Flink writes to Delta Lake) using the Apache Flink Table API. Join us via the Delta Users Slack #flink-delta-connector channel and we have bi-weekly meetings on Tuesdays. Q2/Q3
110 Delta Source for Apache Flink: Build a Flink/Delta source (i.e., Flink reads from Delta Lake) leveraging Delta Standalone. Join us via the Delta Users Slack #flink-delta-connector channel and we have bi-weekly meetings on Tuesdays. Q2/Q3
82 Delta Source for Trino: Joint Delta Lake and Trino community collaboration on the following PRs: 10987, 10300. This is a community effort and all are welcome! Join us via the Delta User Slack channel #trino channel and we will have bi-weekly meetings on Thursdays. Released
Delta Source for Big Query: Allows Big Query to natively read Delta Lake tables. Q2/Q3
523, 566 Delta Rust Writer: Extending Delta Rust API to write to Delta Lake. Q2/Q3
Hive/Delta writer: Extending Hive to write to Delta Lake Q3

Operations Enhancements

Two very popular requests are planned for this semester: Table Restore, S3 multi-cluster writes.

Issue Description Target CY2022
903, 863 Table Restore: Rollback to a previous version of a Delta table using Python, Scala, and/or SQL APIs. Released in 1.2
41 S3 multi-cluster writes: Allows multiple clusters/drivers/JVMs to concurrently write to S3 using DynanoDB as the lock store. Please refer to this PIP: [2021-12-22] Delta OSS S3 Multi-Cluster Writes Released in 1.2
747 delta.io.Guide: Enhance the Delta Lake documentation by creating a new guide (PIP will follow soon) Q2/Q3
Iceberg to Delta Converter: Ability to convert Iceberg table to Delta table without a rewrite. Q3
Table Cloning: Clones a source Delta table to a target destination at a specific version. A clone can be either deep or shallow: deep clones copy over the data from the source and shallow clones do not. Q3
1105 Change Data Feed: The Delta change data feed represents row-level changes between versions of a Delta table. When enabled on a Delta table, the runtime records “change events” for all the data written into the table. Q2

Updates

  • 2022-05-18: Include Issue 348 for the dynamic partition overwrite feature request
  • 2022-05-03: Updated tables with Delta Lake 1.2 release.
  • 2022-03-08: Based on community feedback, we are also prioritizing Hive/Delta writer, clones, and CDF

If there are other issue that should be considered within this roadmap, let's have a discussion here or via the Delta Users Slack #deltalake-oss channel.

@dennyglee dennyglee added the enhancement New feature or request label Feb 3, 2022
@dennyglee dennyglee self-assigned this Feb 3, 2022
@dennyglee dennyglee pinned this issue Feb 3, 2022
@subramanianmohan
Copy link

Coincidentally good timing, together with The Onehouse announcement yesterday!!!

@Kimahriman
Copy link
Contributor

Auto optimize would be a great addition too. From the Databricks docs it looks like there are two types:

  • Optimize writes: does some kind of smart repartition before writing. Is this possible in current Spark? Or does this involve something special in the Databricks Spark fork/Delta engine? The tricky part to me seems to be not grouping too much data into a single partition and getting a single file per partition, while dealing with potentially skewed data.
  • Auto compaction: this seems straightforward after the OPTIMIZE is implemented. My main question is is this (or should it be) a two commit process (commit original files then just trigger a compaction and commit the compaction result), or in this case does it do the compaction before committing at all?

@vinijaiswal
Copy link
Contributor

Created an item for Big Query connector: https://github.com/delta-io/connectors/issues/282

This was referenced Feb 23, 2022
@novemberdude
Copy link

It would be great if the CDF was open source on the latest date. I really interest with this feature!

@Shadlezzz
Copy link

@novemberdude agree, add cdf please!

@dennyglee
Copy link
Contributor Author

Thanks for your feedback @novemberdude and @Shadlezzz - we’ll definitely take this into consideration!

@sliu4
Copy link

sliu4 commented Mar 8, 2022

Would love to see a built-in solution for implementing a retention policy / archiving delta data on append-only tables - this would be a huge help for my team!

@dennyglee
Copy link
Contributor Author

It would be great if the CDF was open source on the latest date. I really interest with this feature!

Based on your feedback from here and Slack, we've added CDF, Cloning, and Hive/Delta writer to the roadmap. HTH!

@dennyglee
Copy link
Contributor Author

Would love to see a built-in solution for implementing a retention policy / archiving delta data on append-only tables - this would be a huge help for my team!

Hi @sliu4 - there are a number of possible solutions to what you're facing. Could you provide the context here and/or do not hesitate to chime in the Delta Users slack. This may be worthy of us developing a PIP (project improvement proposal) so we can get more feedback on design.

@sliu4
Copy link

sliu4 commented Mar 14, 2022

hi @dennyglee - we have several very large prod level delta tables that we would like to gradually archive/glacierize and then we also have some delta data that we just want to delete outright after a certain period of time.

I know from speaking to our Databricks reps that there are solutions we can implement ourselves. For the first case, we can set up a view on the prod level data with a filter based on our glaciering schedule and advise users against querying the s3 path directly. For the second case, we can set up an automated job to do a DELETE and VACUUM.

Implementing these solutions is possible but currently has to be done on a case by case basis. Ideally we'd like a built in feature that can address this in a systematic way. We had hoped to use lifecycle policies and apply them across multiple buckets, but we know this doesn't play nicely with delta and the transaction log.

@dennyglee
Copy link
Contributor Author

dennyglee commented Mar 15, 2022

hi @dennyglee - we have several very large prod level delta tables that we would like to gradually archive/glacierize and then we also have some delta data that we just want to delete outright after a certain period of time.

I know from speaking to our Databricks reps that there are solutions we can implement ourselves. For the first case, we can set up a view on the prod level data with a filter based on our glaciering schedule and advise users against querying the s3 path directly. For the second case, we can set up an automated job to do a DELETE and VACUUM.

Implementing these solutions is possible but currently has to be done on a case by case basis. Ideally we'd like a built in feature that can address this in a systematic way. We had hoped to use lifecycle policies and apply them across multiple buckets, but we know this doesn't play nicely with delta and the transaction log.

Hi @sliu4 - this is super interesting and while I do think the "devil is in the details", the concept of a lifecycle policy may in fact work well with the context of Delta's transaction log since we would be able to use the transaction log to determine what to DELETE/VACUUM, etc. It implies that the lifecycle policy itself would categorize different tables with different policy granular levels (e.g. HBI, MBI, LBI or GDPR compliance etc.), read the Delta transaction log, and then initiate the process within that context to ensure transactional consistency when running the lifecycle policy. Adding to this, you could probably utilize user metadata as a way to track a single life cycle policy across multiple tables and/or create a policy table that includes the table/version numbers for the associated application of policy. It would be worth diving into more - if you're up for it, ping me on Slack and we can find a time to dive in further, eh?! HTH! Denny

@skhurana333
Copy link

Following .

@hoffrocket
Copy link

@dennyglee I see that you have CLONE support as a Q3 priority. I'm wondering if that's still on track? and is the goal to fully migrate the closed source databricks implementation into this code base?

@dennyglee
Copy link
Contributor Author

Hey @hoffrocket - let me get back to you on the timing of this - thanks!

@Qiuzhuang
Copy link

@dennyglee I see that you have CLONE support as a Q3 priority. I'm wondering if that's still on track? and is the goal to fully migrate the closed source databricks implementation into this code base?

This feature is very important for data DR.

@p2bauer
Copy link

p2bauer commented Jun 30, 2022

Not to pile on, but also very keen on the CLONE functionality :-). It's the last blocker for us to guild out our full DR plan and it would really be great to keep this one on track for Q3 if possible.

@dennyglee
Copy link
Contributor Author

We're trying our best @p2bauer - we're going to send out the proposed priorities in the next week or so for all of us to review and help prioritize. Thanks!

@dennyglee
Copy link
Contributor Author

Closing this issue as we're working on #1307

@dennyglee dennyglee unpinned this issue Aug 2, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests