Roadmap 2022 H1 (discussion) #920

dennyglee · 2022-02-03T03:57:52Z

This is the proposed Delta Lake 2022 H1 roadmap discussion thread. Below are the initially proposed items for the roadmap to be completed by June 2022. We will also be sending out a survey (we will update this issue with the survey) to get more feedback from the Delta Lake community!

Performance Optimizations

Based on the overwhelming feedback from the Delta Users Slack, Google Groups, Community AMAs (on Delta Lake YouTube), Delta Lake 2021H2 survey, and 2021H2 roadmap, we propose the following Delta Lake performance enhancements in the next two quarters.

Issue	Description	Target CY2022
927	OPTIMIZE (file compaction): Table optimize is an operation to rearrange the data and/or metadata to speed up queries and/or reduce the metadata size	Released in 1.2
923	File skipping using columns stats: This is a performance optimization that aims at speeding up queries that contain filters (WHERE clauses) on non-partitionBy columns.	Released in 1.2
931	Automatic data skipping using generated columns: Enhance generated columns to include automatic data skipping	Released in 1.2
1134	OPTIMIZE ZORDER: Data clustering via multi-column locality-preserving space-filling curves with offline sorting.	Q3/Q4
	MERGE Performance Improvements: We will be providing a project improvement plan (PIP) document shortly on the proposed design for discussion.	Q2/Q3

Schema Operations

For this year, our focus will be on columnar mappings.

Issue	Description	Target CY2022
958	Support for renaming column: Rename column with ALTER TABLE	Released in 1.2
957	Support for arbitrary column names: Support characters in column names not allowed by Parquet	Released in 1.2
1064	Support for dropping columns: Drop column with ALTER TABLE	Released in 2.0
348	Support for dynamic partition overwrite: Currently you can overwrite using the `replaceWhere` option but in various scenarios, it is more convenient to specify overwrite partition.	Q2

Integrations

Extending from the recent releases of PrestoDB, Hive 3, and Delta Sink for Apache Flink Streams API, we have additional integrations planned.

Issue	Description	Target CY2022
112	Delta Source for Apache Pulsar: Build a Pulsar/Delta reader leveraging Delta Standalone. Join us via the Delta Users Slack #connector-pulsar channel.	Q2
238	Flink Sink on Table API: Build a Flink/Delta sink (i.e., Flink writes to Delta Lake) using the Apache Flink Table API. Join us via the Delta Users Slack #flink-delta-connector channel and we have bi-weekly meetings on Tuesdays.	Q2/Q3
110	Delta Source for Apache Flink: Build a Flink/Delta source (i.e., Flink reads from Delta Lake) leveraging Delta Standalone. Join us via the Delta Users Slack #flink-delta-connector channel and we have bi-weekly meetings on Tuesdays.	Q2/Q3
82	Delta Source for Trino: Joint Delta Lake and Trino community collaboration on the following PRs: 10987, 10300. This is a community effort and all are welcome! Join us via the Delta User Slack channel #trino channel and we will have bi-weekly meetings on Thursdays.	Released
	Delta Source for Big Query: Allows Big Query to natively read Delta Lake tables.	Q2/Q3
523, 566	Delta Rust Writer: Extending Delta Rust API to write to Delta Lake.	Q2/Q3
	Hive/Delta writer: Extending Hive to write to Delta Lake	Q3

Operations Enhancements

Two very popular requests are planned for this semester: Table Restore, S3 multi-cluster writes.

Issue	Description	Target CY2022
903, 863	Table Restore: Rollback to a previous version of a Delta table using Python, Scala, and/or SQL APIs.	Released in 1.2
41	S3 multi-cluster writes: Allows multiple clusters/drivers/JVMs to concurrently write to S3 using DynanoDB as the lock store. Please refer to this PIP: [2021-12-22] Delta OSS S3 Multi-Cluster Writes	Released in 1.2
747	delta.io.Guide: Enhance the Delta Lake documentation by creating a new guide (PIP will follow soon)	Q2/Q3
	Iceberg to Delta Converter: Ability to convert Iceberg table to Delta table without a rewrite.	Q3
	Table Cloning: Clones a source Delta table to a target destination at a specific version. A clone can be either deep or shallow: deep clones copy over the data from the source and shallow clones do not.	Q3
1105	Change Data Feed: The Delta change data feed represents row-level changes between versions of a Delta table. When enabled on a Delta table, the runtime records “change events” for all the data written into the table.	Q2

Updates

2022-05-18: Include Issue 348 for the dynamic partition overwrite feature request
2022-05-03: Updated tables with Delta Lake 1.2 release.
2022-03-08: Based on community feedback, we are also prioritizing Hive/Delta writer, clones, and CDF

If there are other issue that should be considered within this roadmap, let's have a discussion here or via the Delta Users Slack #deltalake-oss channel.

The text was updated successfully, but these errors were encountered:

subramanianmohan · 2022-02-03T14:51:15Z

Coincidentally good timing, together with The Onehouse announcement yesterday!!!

Kimahriman · 2022-02-18T11:57:54Z

Auto optimize would be a great addition too. From the Databricks docs it looks like there are two types:

Optimize writes: does some kind of smart repartition before writing. Is this possible in current Spark? Or does this involve something special in the Databricks Spark fork/Delta engine? The tricky part to me seems to be not grouping too much data into a single partition and getting a single file per partition, while dealing with potentially skewed data.
Auto compaction: this seems straightforward after the OPTIMIZE is implemented. My main question is is this (or should it be) a two commit process (commit original files then just trigger a compaction and commit the compaction result), or in this case does it do the compaction before committing at all?

vinijaiswal · 2022-02-22T06:18:38Z

Created an item for Big Query connector: https://github.com/delta-io/connectors/issues/282

novemberdude · 2022-02-26T17:03:56Z

It would be great if the CDF was open source on the latest date. I really interest with this feature!

Shadlezzz · 2022-03-04T20:19:00Z

@novemberdude agree, add cdf please!

dennyglee · 2022-03-06T05:14:10Z

Thanks for your feedback @novemberdude and @Shadlezzz - we’ll definitely take this into consideration!

sliu4 · 2022-03-08T21:36:04Z

Would love to see a built-in solution for implementing a retention policy / archiving delta data on append-only tables - this would be a huge help for my team!

dennyglee · 2022-03-09T03:02:38Z

It would be great if the CDF was open source on the latest date. I really interest with this feature!

Based on your feedback from here and Slack, we've added CDF, Cloning, and Hive/Delta writer to the roadmap. HTH!

dennyglee · 2022-03-09T03:04:10Z

Would love to see a built-in solution for implementing a retention policy / archiving delta data on append-only tables - this would be a huge help for my team!

Hi @sliu4 - there are a number of possible solutions to what you're facing. Could you provide the context here and/or do not hesitate to chime in the Delta Users slack. This may be worthy of us developing a PIP (project improvement proposal) so we can get more feedback on design.

sliu4 · 2022-03-14T15:43:28Z

hi @dennyglee - we have several very large prod level delta tables that we would like to gradually archive/glacierize and then we also have some delta data that we just want to delete outright after a certain period of time.

I know from speaking to our Databricks reps that there are solutions we can implement ourselves. For the first case, we can set up a view on the prod level data with a filter based on our glaciering schedule and advise users against querying the s3 path directly. For the second case, we can set up an automated job to do a DELETE and VACUUM.

Implementing these solutions is possible but currently has to be done on a case by case basis. Ideally we'd like a built in feature that can address this in a systematic way. We had hoped to use lifecycle policies and apply them across multiple buckets, but we know this doesn't play nicely with delta and the transaction log.

dennyglee · 2022-03-15T01:35:32Z

hi @dennyglee - we have several very large prod level delta tables that we would like to gradually archive/glacierize and then we also have some delta data that we just want to delete outright after a certain period of time.

I know from speaking to our Databricks reps that there are solutions we can implement ourselves. For the first case, we can set up a view on the prod level data with a filter based on our glaciering schedule and advise users against querying the s3 path directly. For the second case, we can set up an automated job to do a DELETE and VACUUM.

Implementing these solutions is possible but currently has to be done on a case by case basis. Ideally we'd like a built in feature that can address this in a systematic way. We had hoped to use lifecycle policies and apply them across multiple buckets, but we know this doesn't play nicely with delta and the transaction log.

Hi @sliu4 - this is super interesting and while I do think the "devil is in the details", the concept of a lifecycle policy may in fact work well with the context of Delta's transaction log since we would be able to use the transaction log to determine what to DELETE/VACUUM, etc. It implies that the lifecycle policy itself would categorize different tables with different policy granular levels (e.g. HBI, MBI, LBI or GDPR compliance etc.), read the Delta transaction log, and then initiate the process within that context to ensure transactional consistency when running the lifecycle policy. Adding to this, you could probably utilize user metadata as a way to track a single life cycle policy across multiple tables and/or create a policy table that includes the table/version numbers for the associated application of policy. It would be worth diving into more - if you're up for it, ping me on Slack and we can find a time to dive in further, eh?! HTH! Denny

skhurana333 · 2022-03-22T15:11:33Z

Following .

hoffrocket · 2022-06-17T18:34:07Z

@dennyglee I see that you have CLONE support as a Q3 priority. I'm wondering if that's still on track? and is the goal to fully migrate the closed source databricks implementation into this code base?

dennyglee · 2022-06-21T03:56:48Z

Hey @hoffrocket - let me get back to you on the timing of this - thanks!

Qiuzhuang · 2022-06-21T04:01:38Z

@dennyglee I see that you have CLONE support as a Q3 priority. I'm wondering if that's still on track? and is the goal to fully migrate the closed source databricks implementation into this code base?

This feature is very important for data DR.

p2bauer · 2022-06-30T22:03:54Z

Not to pile on, but also very keen on the CLONE functionality :-). It's the last blocker for us to guild out our full DR plan and it would really be great to keep this one on track for Q3 if possible.

dennyglee · 2022-07-11T22:29:58Z

We're trying our best @p2bauer - we're going to send out the proposed priorities in the next week or so for all of us to review and help prioritize. Thanks!

dennyglee · 2022-08-02T04:58:08Z

Closing this issue as we're working on #1307

dennyglee added the enhancement New feature or request label Feb 3, 2022

dennyglee self-assigned this Feb 3, 2022

dennyglee mentioned this issue Feb 3, 2022

Roadmap 2021 H2 (discussion) #748

Closed

dennyglee pinned this issue Feb 3, 2022

scottsand-db mentioned this issue Feb 3, 2022

Support for per-column statistics generation on Delta Lake #923

Closed

vkorukanti mentioned this issue Feb 7, 2022

Support for Optimizing (file compaction) Delta Lake tables #927

Closed

scottsand-db mentioned this issue Feb 8, 2022

Support for file skipping using column stats on Delta Lake #931

Closed

This was referenced Feb 23, 2022

Support arbitrary column names #957

Closed

Support column rename #958

Closed

vkorukanti mentioned this issue Mar 22, 2022

Add Scala API for optimize #961

Closed

liwensun mentioned this issue Apr 11, 2022

Support dropping columns from a Delta table #1064

Closed

allisonport-db mentioned this issue Apr 15, 2022

Add support for GENERATED ALWAYS AS IDENTITY in DeltaTableBuilder #1072

Closed

scottsand-db mentioned this issue Apr 29, 2022

Support for Change Data Feed in Delta Lake #1105

Closed

dennyglee mentioned this issue May 10, 2022

Update Delta Lake roadmap per 920 delta-io/website#78

Closed

vkorukanti mentioned this issue May 16, 2022

Support multi-dimensional clustering of data using Z-Order curves #1134

Closed

This was referenced May 31, 2022

Delta Sharing API for easy cloning as Delta delta-io/delta-sharing#143

Closed

[Feature Request] Allow generated IDENTITY column to be reset #1157

Open

dennyglee mentioned this issue Aug 2, 2022

Roadmap 2022 H2 (discussion) #1307

Open

dennyglee closed this as completed Aug 2, 2022

dennyglee unpinned this issue Aug 2, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roadmap 2022 H1 (discussion) #920

Roadmap 2022 H1 (discussion) #920

dennyglee commented Feb 3, 2022 •

edited by scottsand-db

Loading

subramanianmohan commented Feb 3, 2022

Kimahriman commented Feb 18, 2022

vinijaiswal commented Feb 22, 2022

novemberdude commented Feb 26, 2022

Shadlezzz commented Mar 4, 2022

dennyglee commented Mar 6, 2022

sliu4 commented Mar 8, 2022

dennyglee commented Mar 9, 2022

dennyglee commented Mar 9, 2022

sliu4 commented Mar 14, 2022

dennyglee commented Mar 15, 2022 •

edited

Loading

skhurana333 commented Mar 22, 2022

hoffrocket commented Jun 17, 2022

dennyglee commented Jun 21, 2022

Qiuzhuang commented Jun 21, 2022

p2bauer commented Jun 30, 2022

dennyglee commented Jul 11, 2022

dennyglee commented Aug 2, 2022

Roadmap 2022 H1 (discussion) #920

Roadmap 2022 H1 (discussion) #920

Comments

dennyglee commented Feb 3, 2022 • edited by scottsand-db Loading

Performance Optimizations

Schema Operations

Integrations

Operations Enhancements

Updates

subramanianmohan commented Feb 3, 2022

Kimahriman commented Feb 18, 2022

vinijaiswal commented Feb 22, 2022

novemberdude commented Feb 26, 2022

Shadlezzz commented Mar 4, 2022

dennyglee commented Mar 6, 2022

sliu4 commented Mar 8, 2022

dennyglee commented Mar 9, 2022

dennyglee commented Mar 9, 2022

sliu4 commented Mar 14, 2022

dennyglee commented Mar 15, 2022 • edited Loading

skhurana333 commented Mar 22, 2022

hoffrocket commented Jun 17, 2022

dennyglee commented Jun 21, 2022

Qiuzhuang commented Jun 21, 2022

p2bauer commented Jun 30, 2022

dennyglee commented Jul 11, 2022

dennyglee commented Aug 2, 2022

dennyglee commented Feb 3, 2022 •

edited by scottsand-db

Loading

dennyglee commented Mar 15, 2022 •

edited

Loading