From 6ecd3152da0a464b6ddfd875258c1505e11dfb52 Mon Sep 17 00:00:00 2001 From: Dhruv Arya Date: Mon, 16 Dec 2024 23:36:13 +0530 Subject: [PATCH 1/6] doc changes for ICT --- docs/source/delta-batch.md | 7 +++++++ docs/source/delta-drop-feature.md | 2 +- docs/source/table-properties.md | 12 +++++++++++- docs/source/versioning.md | 2 ++ 4 files changed, 21 insertions(+), 2 deletions(-) diff --git a/docs/source/delta-batch.md b/docs/source/delta-batch.md index b3168f05307..34db66f69cf 100644 --- a/docs/source/delta-batch.md +++ b/docs/source/delta-batch.md @@ -742,6 +742,13 @@ Each time a checkpoint is written, Delta automatically cleans up log entries old .. note:: Due to log entry cleanup, instances can arise where you cannot time travel to a version that is less than the retention interval. requires all consecutive log entries since the previous checkpoint to time travel to a particular version. For example, with a table initially consisting of log entries for versions [0, 19] and a checkpoint at verison 10, if the log entry for version 0 is cleaned up, then you cannot time travel to versions [1, 9]. Increasing the table property `delta.logRetentionDuration` can help avoid these situations. +### In-Commit Timestamps + +Historically, Delta has relied on file modification timetamps to be the source of truth for when +the table was modified. This becomes problematic when tables are moved from one storage location to another since the file modification timestamps change in such scenarios. To ensure that the timestamps +used for time travel don't change in such scenarios and that timestamp-based time travel queries produce +consistent results, the [In-Commit Timestamps](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#in-commit-timestamps) table feature was introduced in Delta 3.3. This feature can be enabled by setting the table property `delta.enableInCommitTimestamps` to `true`. See the [Versioning](./versioning) section for more details around compatibility. + ## Write to a table diff --git a/docs/source/delta-drop-feature.md b/docs/source/delta-drop-feature.md index 1189199c1f3..343ef7be819 100644 --- a/docs/source/delta-drop-feature.md +++ b/docs/source/delta-drop-feature.md @@ -27,7 +27,7 @@ You can drop the following Delta table features: - `deletionVectors`. See [_](delta-deletion-vectors.md). - `typeWidening-preview`. See [_](delta-type-widening.md). Type widening is available in preview in 3.2.0 and above. - `v2Checkpoint`. See [V2 Checkpoint Spec](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#v2-spec). Drop support for V2 Checkpoints is available in 3.1.0 and above. - +- `inCommitTimestamp`. See [In-Commit Timestamps Spec](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#in-commit-timestamps) You cannot drop other [Delta table features](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#valid-feature-names-in-table-features). ## How are Delta table features dropped? diff --git a/docs/source/table-properties.md b/docs/source/table-properties.md index 01a58269d94..43377d41916 100644 --- a/docs/source/table-properties.md +++ b/docs/source/table-properties.md @@ -169,6 +169,16 @@ properties are set. Available Delta table properties include: | | | Default: `classic` | +-------------------------------------------------------------------------------------------+ - +| `delta.enableInCommitTimestamps` | +| | +| `true` for enabling the InCommitTimestamps table feature. | +| | +| | +| See [_](/presto-integration.md#step-3-update-manifests). | +| | +| Data type: `Boolean` | +| | +| Default: `false` | ++-------------------------------------------------------------------------------------------+ .. replace:: Delta Lake .. replace:: Apache Spark \ No newline at end of file diff --git a/docs/source/versioning.md b/docs/source/versioning.md index 2135a6d5b0d..bf6df741a0a 100644 --- a/docs/source/versioning.md +++ b/docs/source/versioning.md @@ -29,6 +29,7 @@ The following features break forward compatibility. Features are enabled Row Tracking, [Delta Lake 3.2.0](https://github.com/delta-io/delta/releases/tag/v3.2.0),[_](/delta-row-tracking.md) Type widening (Preview),[Delta Lake 3.2.0](https://github.com/delta-io/delta/releases/tag/v3.2.0),[_](/delta-type-widening.md) Identity columns, [Delta Lake 3.3.0](https://github.com/delta-io/delta/releases/tag/v3.3.0),[_](/delta-batch.md#use-identity-columns) + In-Commit Timestamps, [Delta Lake 3.3.0](https://github.com/delta-io/delta/releases/tag/v3.3.0),[_](/delta-batch.md#use-identity-columns) @@ -113,6 +114,7 @@ The following table shows minimum protocol versions required for feature Vacuum Protocol Check,7,3,[Vacuum Protocol Check Spec](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#vacuum-protocol-check) Row Tracking,7,3,[_](/delta-row-tracking.md) Type widening (Preview),7,3,[_](/delta-type-widening.md) + In-Commit Timestamps,7,3,[In-Commit Timestamps Spec](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#in-commit-timestamps) From e6fdda3f14764286d7e57c98978b4ce7101a88bd Mon Sep 17 00:00:00 2001 From: Dhruv Arya Date: Mon, 16 Dec 2024 23:50:27 +0530 Subject: [PATCH 2/6] improvements --- docs/source/delta-batch.md | 33 +++++++++++++++++++++++++++++---- 1 file changed, 29 insertions(+), 4 deletions(-) diff --git a/docs/source/delta-batch.md b/docs/source/delta-batch.md index 34db66f69cf..a9353635aec 100644 --- a/docs/source/delta-batch.md +++ b/docs/source/delta-batch.md @@ -744,10 +744,35 @@ Each time a checkpoint is written, Delta automatically cleans up log entries old ### In-Commit Timestamps -Historically, Delta has relied on file modification timetamps to be the source of truth for when -the table was modified. This becomes problematic when tables are moved from one storage location to another since the file modification timestamps change in such scenarios. To ensure that the timestamps -used for time travel don't change in such scenarios and that timestamp-based time travel queries produce -consistent results, the [In-Commit Timestamps](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#in-commit-timestamps) table feature was introduced in Delta 3.3. This feature can be enabled by setting the table property `delta.enableInCommitTimestamps` to `true`. See the [Versioning](./versioning) section for more details around compatibility. +#### Overview +Delta Lake 3.3 introduced [In-Commit Timestamps](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#in-commit-timestamps) to provide a more reliable and consistent way to track table modifications. This feature addresses limitations of the traditional approach that relied on file modification timestamps, particularly in scenarios involving data migration or replication. + +#### Background +Previously, Delta Lake used file modification timestamps as the source of truth for table modifications. This approach presented several challenges: + +1. Data Migration Issues: When tables were moved between storage locations, file modification timestamps would change, potentially disrupting historical tracking +2. Replication Scenarios: Timestamp inconsistencies could arise when replicating data across different environments +3. Time Travel Reliability: These timestamp changes could affect the accuracy and consistency of time travel queries + +#### Feature Details +In-Commit Timestamps stores modification timestamps within the commit itself, ensuring they remain unchanged regardless of file system operations. This provides several benefits: + +- **Immutable History**: Timestamps become part of the table's permanent commit history +- **Consistent Time Travel**: Queries using timestamp-based time travel produce reliable results even after table migration + +### Enabling the Feature +This feature can be enabled by setting the table property `delta.enableInCommitTimestamps` to `true`: + +```sql +ALTER TABLE +SET TBLPROPERTIES ('delta.enableInCommitTimestamps' = 'true'); +``` + +After enabling In-Commit Timestamps: +- Only new write operations will include the embedded timestamps +- File modification timestamps will continued to be used for historical commits performed before enablement + +See the [Versioning](./versioning) section for more details around compatibility. From e65f4491064f46e0aa85c7097d323dc9ebd9a80a Mon Sep 17 00:00:00 2001 From: Dhruv Arya Date: Mon, 16 Dec 2024 23:51:05 +0530 Subject: [PATCH 3/6] fix heading --- docs/source/delta-batch.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/docs/source/delta-batch.md b/docs/source/delta-batch.md index a9353635aec..7479408376a 100644 --- a/docs/source/delta-batch.md +++ b/docs/source/delta-batch.md @@ -760,7 +760,7 @@ In-Commit Timestamps stores modification timestamps within the commit itself, en - **Immutable History**: Timestamps become part of the table's permanent commit history - **Consistent Time Travel**: Queries using timestamp-based time travel produce reliable results even after table migration -### Enabling the Feature +#### Enabling the Feature This feature can be enabled by setting the table property `delta.enableInCommitTimestamps` to `true`: ```sql From f32f9ed28ccbb08c0178ba0779195f16cd489018 Mon Sep 17 00:00:00 2001 From: Dhruv Arya Date: Mon, 16 Dec 2024 23:52:46 +0530 Subject: [PATCH 4/6] fix spacing --- docs/source/table-properties.md | 1 + 1 file changed, 1 insertion(+) diff --git a/docs/source/table-properties.md b/docs/source/table-properties.md index 43377d41916..35c20da29b2 100644 --- a/docs/source/table-properties.md +++ b/docs/source/table-properties.md @@ -180,5 +180,6 @@ properties are set. Available Delta table properties include: | | | Default: `false` | +-------------------------------------------------------------------------------------------+ + .. replace:: Delta Lake .. replace:: Apache Spark \ No newline at end of file From 4ae240594ab27878669712fd60cb54dddfe50296 Mon Sep 17 00:00:00 2001 From: Dhruv Arya Date: Mon, 16 Dec 2024 23:56:04 +0530 Subject: [PATCH 5/6] fix --- docs/source/delta-batch.md | 4 ++-- docs/source/table-properties.md | 2 +- 2 files changed, 3 insertions(+), 3 deletions(-) diff --git a/docs/source/delta-batch.md b/docs/source/delta-batch.md index 7479408376a..dcdfbb3b9e0 100644 --- a/docs/source/delta-batch.md +++ b/docs/source/delta-batch.md @@ -745,10 +745,10 @@ Each time a checkpoint is written, Delta automatically cleans up log entries old ### In-Commit Timestamps #### Overview -Delta Lake 3.3 introduced [In-Commit Timestamps](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#in-commit-timestamps) to provide a more reliable and consistent way to track table modifications. This feature addresses limitations of the traditional approach that relied on file modification timestamps, particularly in scenarios involving data migration or replication. + 3.3 introduced [In-Commit Timestamps](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#in-commit-timestamps) to provide a more reliable and consistent way to track table modifications. This feature addresses limitations of the traditional approach that relied on file modification timestamps, particularly in scenarios involving data migration or replication. #### Background -Previously, Delta Lake used file modification timestamps as the source of truth for table modifications. This approach presented several challenges: +Previously, used file modification timestamps as the source of truth for table modifications. This approach presented several challenges: 1. Data Migration Issues: When tables were moved between storage locations, file modification timestamps would change, potentially disrupting historical tracking 2. Replication Scenarios: Timestamp inconsistencies could arise when replicating data across different environments diff --git a/docs/source/table-properties.md b/docs/source/table-properties.md index 35c20da29b2..318173d3cdf 100644 --- a/docs/source/table-properties.md +++ b/docs/source/table-properties.md @@ -174,7 +174,7 @@ properties are set. Available Delta table properties include: | `true` for enabling the InCommitTimestamps table feature. | | | | | -| See [_](/presto-integration.md#step-3-update-manifests). | +| See [_](delta-batch.md#in--commit-timestamps). | | | | Data type: `Boolean` | | | From 6bc530af6306adc5fd8961286a420c37fc9a501a Mon Sep 17 00:00:00 2001 From: Dhruv Arya Date: Tue, 17 Dec 2024 08:48:58 +0530 Subject: [PATCH 6/6] update as per feedback --- docs/source/delta-batch.md | 15 +++++++-------- 1 file changed, 7 insertions(+), 8 deletions(-) diff --git a/docs/source/delta-batch.md b/docs/source/delta-batch.md index dcdfbb3b9e0..bed70f5e7e8 100644 --- a/docs/source/delta-batch.md +++ b/docs/source/delta-batch.md @@ -745,14 +745,7 @@ Each time a checkpoint is written, Delta automatically cleans up log entries old ### In-Commit Timestamps #### Overview - 3.3 introduced [In-Commit Timestamps](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#in-commit-timestamps) to provide a more reliable and consistent way to track table modifications. This feature addresses limitations of the traditional approach that relied on file modification timestamps, particularly in scenarios involving data migration or replication. - -#### Background -Previously, used file modification timestamps as the source of truth for table modifications. This approach presented several challenges: - -1. Data Migration Issues: When tables were moved between storage locations, file modification timestamps would change, potentially disrupting historical tracking -2. Replication Scenarios: Timestamp inconsistencies could arise when replicating data across different environments -3. Time Travel Reliability: These timestamp changes could affect the accuracy and consistency of time travel queries + 3.3 introduced [In-Commit Timestamps](https://github.com/delta-io/delta/blob/master/PROTOCOL.md#in-commit-timestamps) to provide a more reliable and consistent way to track table modification timestamps. These modification timestamps are needed for various usecases e.g. time-travel to a specific time in the past. This feature addresses limitations of the traditional approach that relied on file modification timestamps, particularly in scenarios involving data migration or replication. #### Feature Details In-Commit Timestamps stores modification timestamps within the commit itself, ensuring they remain unchanged regardless of file system operations. This provides several benefits: @@ -760,6 +753,12 @@ In-Commit Timestamps stores modification timestamps within the commit itself, en - **Immutable History**: Timestamps become part of the table's permanent commit history - **Consistent Time Travel**: Queries using timestamp-based time travel produce reliable results even after table migration +Without the In-Commit Timestamp feature, uses file modification timestamps as the commit timestamp. This approach has various limitations: + +1. Data Migration Issues: When tables were moved between storage locations, file modification timestamps would change, potentially disrupting historical tracking +2. Replication Scenarios: Timestamp inconsistencies could arise when replicating data across different environments +3. Time Travel Reliability: These timestamp changes could affect the accuracy and consistency of time travel queries + #### Enabling the Feature This feature can be enabled by setting the table property `delta.enableInCommitTimestamps` to `true`: