Populate stats when missing in transaction log #16743

pajaks · 2023-03-27T11:30:57Z

Description

Relates to #15967
In case transaction log does not have statistics for some files we want to add this information.
After this change statistics are collected during ANALYZE per each file and new transaction log entry is created wit results.
For now collection includes:

row_count
null values for all columns
min/max for all columns except VARCHAR types

Right now it work only for initial ANALYZE.

Additional context and related issues

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x ) Release notes are required, with the following suggested text:

# Delta Lake
* Populate file level statistics when missing in transaction log

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java

plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeAnalyze.java

findinpath · 2023-03-28T12:14:30Z

general question: Is this comment still valid?

trino/plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/metastore/HiveMetastoreBackedDeltaLakeMetastore.java

Line 284 in cfe9ac9

// Open source Delta Lake does not collect stats

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java

plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeAnalyze.java

pajaks · 2023-03-31T10:05:38Z

1st push with comments addressed, partition handling and various types handling
2nd rebase to resolve conflicts

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java

plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeAnalyze.java

...uct-tests/src/main/java/io/trino/tests/product/deltalake/TestDeltaLakeColumnMappingMode.java

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java

...src/main/java/io/trino/plugin/deltalake/transactionlog/DeltaLakeComputedStatisticsUtils.java

findinpath

LGTM % comments

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java

plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeAnalyze.java

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java

findepi

"Add handling for grouped statistics in delta lake"

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java

findepi · 2023-04-14T07:41:37Z

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java

+        return computedStatistics.stream()
+                .map(ComputedStatistics::getColumnStatistics)
+                .map(Map::entrySet)
+                .flatMap(Collection::stream)
                .filter(entry -> entry.getKey().getColumnName().equals(FILE_MODIFIED_TIME_COLUMN_NAME))


after filtering this is FILE_MODIFIED_TIME_COLUMN_NAME, verify that singleStatistics.getGroupingColumns() is empty

Grouping is defined for whole table, so each columns will have grouping (including FILE_MODIFIED_TIME_COLUMN_NAME). In case of grouping by $path in following commit we receive $file_modified_time for each file and calculate max value.

findepi · 2023-04-14T07:45:50Z

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java

+        // Collect file statistics only when performing ANALYZE on a table without extended statistics
+        boolean collectFileStatistics = !areExtendedStatisticsPresent && !isCollectionOnWrite;


Why?
Someone could have extended statistics (created by Trino 413) and want to ANALYZE table to collect also file-level stats.

Let's discuss and improve explanation in the code.

The idea was to collect file-level statistics only for initial ANALYZE in this PR. Checking if extended statistics are empty is currently used to determine if it's initial statistics collection.

For mentioned case maybe drop_extended_stats before ANALYZE (or force_recalculate_statistics with #16634) would be a easiest solution?

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java

findepi

didn't review the main commit yet

findepi · 2023-04-14T07:56:18Z

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java

+        if (analyzeHandle.isInitialAnalyze()) {
+            generateMissingFileStatistics(session, tableHandle, computedStatistics);


Why the condition?
With incremental analyze we could do this as well. It's just that we would fill min/max for subset of files only.

(we need to revisit

trino/plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeSplitManager.java

Lines 163 to 165 in 4a6c08a

if (tableHandle.getAnalyzeHandle().isPresent() && !tableHandle.getAnalyzeHandle().get().isInitialAnalyze() && !addAction.isDataChange()) {

// skip files which do not introduce data change on non-initial ANALYZE

return Stream.empty();

as well.
that code assumes ANALYZE covers data only but now it became aware of file boundaries)

I would like to exclude incremental ANALYZE as separate PR if that's ok.

Can we leave a TODO comment?

findepi · 2023-04-14T09:28:43Z

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java

+                            columnStatistics.add(new ColumnStatisticMetadata(columnMetadata.getName(), MAX_VALUE));
+                            columnStatistics.add(new ColumnStatisticMetadata(columnMetadata.getName(), MIN_VALUE));
+                        }
+                        columnStatistics.add(new ColumnStatisticMetadata(columnMetadata.getName(), NUMBER_OF_NON_NULL_VALUES));


We collect those stats for all columns in the table. and then write back to transaction log.
This will inflate metadata for wide tables and affect query planning times and coordinator memory. I think we should follow Databricks's approach where they analyze some initial columns only.

cc @alexjo2144

How can we know which columns are initial? Is it related to property delta.dataSkippingNumIndexedCols?
https://docs.delta.io/latest/optimizations-oss.html#data-skipping
The idea of this PR was to generate stats regardless of this property #15135
I cannot also find any check for this property in code so Trino collects statistics regardless during write.

This is preexisting issue as currently Trino analyses all columns during write. Issue for improvement: #17057

ebyhr · 2023-04-26T07:48:31Z

findinpath added the release-notes label 4 hours ago

@findinpath We use release-notes label for a PR to add release note documentation like #17002

...lake/src/main/java/io/trino/plugin/deltalake/transactionlog/DeltaLakeComputedStatistics.java

plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeAnalyze.java

ebyhr

Still reviewing the last commit.

plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeBasic.java

plugin/trino-delta-lake/src/test/resources/databricks/no_column_stats/README.md

...lake/src/main/java/io/trino/plugin/deltalake/transactionlog/DeltaLakeComputedStatistics.java

plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeAnalyze.java

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java

pajaks · 2023-05-09T09:10:20Z

rebase to resolve conflicts

plugin/trino-delta-lake/src/test/resources/databricks/column_mapping_id/README.md

plugin/trino-delta-lake/src/test/resources/databricks/no_stats/README.md

plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeAnalyze.java

...lake/src/main/java/io/trino/plugin/deltalake/transactionlog/DeltaLakeComputedStatistics.java

...uct-tests/src/main/java/io/trino/tests/product/deltalake/TestDeltaLakeColumnMappingMode.java

plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeAnalyze.java

...lake/src/main/java/io/trino/plugin/deltalake/transactionlog/DeltaLakeComputedStatistics.java

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java

ebyhr · 2023-05-16T10:10:55Z

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java

+        if (analyzeHandle.isInitialAnalyze()) {
+            generateMissingFileStatistics(session, tableHandle, computedStatistics);


Can we leave a TODO comment?

pajaks · 2023-07-03T10:37:49Z

First push is rebase, second addresses comments

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java

ebyhr · 2023-07-11T05:53:00Z

Could you rebase on master to resolve conflicts?

pajaks · 2023-07-11T11:38:07Z

First push to resolve conflicts, second with addressed comments.

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java

...lake/src/main/java/io/trino/plugin/deltalake/transactionlog/DeltaLakeComputedStatistics.java

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java

...uct-tests/src/main/java/io/trino/tests/product/deltalake/TestDeltaLakeColumnMappingMode.java

...lake/src/main/java/io/trino/plugin/deltalake/transactionlog/DeltaLakeComputedStatistics.java

findepi · 2023-07-21T12:15:13Z

@alexjo2144 can you ptal?

This is dead code so no test changes.

pajaks · 2023-08-09T11:53:24Z

First push -> rebase
Second -> adaptation for case sensitivity

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java

ebyhr · 2023-08-10T09:09:38Z

/test-with-secrets sha=1422b6a4102e9cc424432fcc37e4ab4698af8aa9

findepi

(fmt)

plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeAnalyze.java

github-actions · 2023-08-10T15:42:54Z

The CI workflow run with tests that require additional secrets has been started: https://github.com/trinodb/trino/actions/runs/5820602635

cla-bot bot added the cla-signed label Mar 27, 2023

github-actions bot added the delta-lake Delta Lake connector label Mar 27, 2023

pajaks marked this pull request as ready for review March 28, 2023 08:46

pajaks requested review from findepi, ebyhr, findinpath and alexjo2144 and removed request for findepi March 28, 2023 08:46

krvikash reviewed Mar 28, 2023

View reviewed changes

findinpath reviewed Mar 28, 2023

View reviewed changes

github-actions bot added the tests:hive label Apr 3, 2023

findinpath reviewed Apr 4, 2023

View reviewed changes

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java Show resolved Hide resolved

findinpath reviewed Apr 6, 2023

View reviewed changes

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java Outdated Show resolved Hide resolved

findinpath reviewed Apr 6, 2023

View reviewed changes

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java Outdated Show resolved Hide resolved

alexjo2144 reviewed Apr 10, 2023

View reviewed changes

findinpath reviewed Apr 13, 2023

View reviewed changes

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java Outdated Show resolved Hide resolved

findinpath reviewed Apr 13, 2023

View reviewed changes

...src/main/java/io/trino/plugin/deltalake/transactionlog/DeltaLakeComputedStatisticsUtils.java Outdated Show resolved Hide resolved

findinpath reviewed Apr 13, 2023

View reviewed changes

...src/main/java/io/trino/plugin/deltalake/transactionlog/DeltaLakeComputedStatisticsUtils.java Outdated Show resolved Hide resolved

findinpath reviewed Apr 13, 2023

View reviewed changes

...src/main/java/io/trino/plugin/deltalake/transactionlog/DeltaLakeComputedStatisticsUtils.java Outdated Show resolved Hide resolved

findinpath approved these changes Apr 13, 2023

View reviewed changes

findinpath reviewed Apr 13, 2023

View reviewed changes

plugin/trino-delta-lake/src/test/java/io/trino/plugin/deltalake/TestDeltaLakeAnalyze.java Outdated Show resolved Hide resolved

findinpath approved these changes Apr 14, 2023

View reviewed changes

findepi reviewed Apr 14, 2023

View reviewed changes

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java Show resolved Hide resolved

findepi reviewed Apr 14, 2023

View reviewed changes

ebyhr removed the release-notes label Apr 26, 2023

findinpath approved these changes Apr 26, 2023

View reviewed changes

ebyhr reviewed Apr 26, 2023

View reviewed changes

ebyhr reviewed Apr 28, 2023

View reviewed changes

ebyhr approved these changes May 16, 2023

View reviewed changes

pajaks mentioned this pull request Jul 3, 2023

Populate stats when missing in transaction log for incremental ANALYZE in Delta Lake #18110

Open

findepi approved these changes Jul 4, 2023

View reviewed changes

ebyhr reviewed Jul 12, 2023

View reviewed changes

...uct-tests/src/main/java/io/trino/tests/product/deltalake/TestDeltaLakeColumnMappingMode.java Outdated Show resolved Hide resolved

ebyhr reviewed Jul 12, 2023

View reviewed changes

...lake/src/main/java/io/trino/plugin/deltalake/transactionlog/DeltaLakeComputedStatistics.java Outdated Show resolved Hide resolved

pajaks mentioned this pull request Jul 13, 2023

Populate missing stats in Delta for absolute paths in transaction log #18277

Open

pajaks added 4 commits August 9, 2023 13:42

Move data description to README file

32671c1

Fix interpretation of null block for BIGINT value

69f1db1

This is dead code so no test changes.

Change parameter name in statistics collection metadata

f58ffbb

Move copy directory function to utils class

bfa8e8a

ebyhr reviewed Aug 10, 2023

View reviewed changes

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java Outdated Show resolved Hide resolved

plugin/trino-delta-lake/src/main/java/io/trino/plugin/deltalake/DeltaLakeMetadata.java Outdated Show resolved Hide resolved

findepi reviewed Aug 10, 2023

View reviewed changes

Populate stats when missing in transaction log

13704c2

ebyhr merged commit 16a730c into trinodb:master Aug 21, 2023

github-actions bot added this to the 425 milestone Aug 21, 2023

colebow mentioned this pull request Aug 23, 2023

Add Trino 425 release notes #18782

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Populate stats when missing in transaction log #16743

Populate stats when missing in transaction log #16743

pajaks commented Mar 27, 2023 •

edited

Loading

findinpath commented Mar 28, 2023

pajaks commented Mar 31, 2023

findinpath left a comment

findepi left a comment

findepi Apr 14, 2023

pajaks Apr 17, 2023

findepi Apr 14, 2023

pajaks Apr 14, 2023 •

edited

Loading

pajaks Apr 17, 2023

findepi left a comment

findepi Apr 14, 2023

pajaks Apr 17, 2023

ebyhr May 16, 2023

findepi Apr 14, 2023

pajaks Apr 14, 2023

pajaks Apr 17, 2023

ebyhr commented Apr 26, 2023

ebyhr left a comment

pajaks commented May 9, 2023

ebyhr May 16, 2023

pajaks commented Jul 3, 2023

ebyhr commented Jul 11, 2023

pajaks commented Jul 11, 2023

findepi commented Jul 21, 2023

pajaks commented Aug 9, 2023

ebyhr commented Aug 10, 2023

findepi left a comment

github-actions bot commented Aug 10, 2023

		// Collect file statistics only when performing ANALYZE on a table without extended statistics
		boolean collectFileStatistics = !areExtendedStatisticsPresent && !isCollectionOnWrite;

		if (analyzeHandle.isInitialAnalyze()) {
		generateMissingFileStatistics(session, tableHandle, computedStatistics);

	if (tableHandle.getAnalyzeHandle().isPresent() && !tableHandle.getAnalyzeHandle().get().isInitialAnalyze() && !addAction.isDataChange()) {
	// skip files which do not introduce data change on non-initial ANALYZE
	return Stream.empty();

Populate stats when missing in transaction log #16743

Populate stats when missing in transaction log #16743

Conversation

pajaks commented Mar 27, 2023 • edited Loading

Description

Additional context and related issues

Release notes

findinpath commented Mar 28, 2023

pajaks commented Mar 31, 2023

findinpath left a comment

Choose a reason for hiding this comment

findepi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pajaks Apr 14, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

findepi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ebyhr commented Apr 26, 2023

ebyhr left a comment

Choose a reason for hiding this comment

pajaks commented May 9, 2023

Choose a reason for hiding this comment

pajaks commented Jul 3, 2023

ebyhr commented Jul 11, 2023

pajaks commented Jul 11, 2023

findepi commented Jul 21, 2023

pajaks commented Aug 9, 2023

ebyhr commented Aug 10, 2023

findepi left a comment

Choose a reason for hiding this comment

github-actions bot commented Aug 10, 2023

pajaks commented Mar 27, 2023 •

edited

Loading

pajaks Apr 14, 2023 •

edited

Loading