Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Populate stats when missing in transaction log #16743

Merged
merged 5 commits into from
Aug 21, 2023
Merged

Populate stats when missing in transaction log #16743

merged 5 commits into from
Aug 21, 2023

Conversation

pajaks
Copy link
Member

@pajaks pajaks commented Mar 27, 2023

Description

Relates to #15967
In case transaction log does not have statistics for some files we want to add this information.
After this change statistics are collected during ANALYZE per each file and new transaction log entry is created wit results.
For now collection includes:

  1. row_count
  2. null values for all columns
  3. min/max for all columns except VARCHAR types

Right now it work only for initial ANALYZE.

Additional context and related issues

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
(x ) Release notes are required, with the following suggested text:

# Delta Lake
* Populate file level statistics when missing in transaction log

@cla-bot cla-bot bot added the cla-signed label Mar 27, 2023
@github-actions github-actions bot added the delta-lake Delta Lake connector label Mar 27, 2023
@pajaks pajaks marked this pull request as ready for review March 28, 2023 08:46
@pajaks pajaks requested review from findepi, ebyhr, findinpath and alexjo2144 and removed request for findepi March 28, 2023 08:46
@findinpath
Copy link
Contributor

general question: Is this comment still valid?

@pajaks
Copy link
Member Author

pajaks commented Mar 31, 2023

1st push with comments addressed, partition handling and various types handling
2nd rebase to resolve conflicts

Copy link
Contributor

@findinpath findinpath left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM % comments

Copy link
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"Add handling for grouped statistics in delta lake"

return computedStatistics.stream()
.map(ComputedStatistics::getColumnStatistics)
.map(Map::entrySet)
.flatMap(Collection::stream)
.filter(entry -> entry.getKey().getColumnName().equals(FILE_MODIFIED_TIME_COLUMN_NAME))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

after filtering this is FILE_MODIFIED_TIME_COLUMN_NAME, verify that singleStatistics.getGroupingColumns() is empty

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Grouping is defined for whole table, so each columns will have grouping (including FILE_MODIFIED_TIME_COLUMN_NAME). In case of grouping by $path in following commit we receive $file_modified_time for each file and calculate max value.

Comment on lines 2268 to 2567
// Collect file statistics only when performing ANALYZE on a table without extended statistics
boolean collectFileStatistics = !areExtendedStatisticsPresent && !isCollectionOnWrite;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why?
Someone could have extended statistics (created by Trino 413) and want to ANALYZE table to collect also file-level stats.

Let's discuss and improve explanation in the code.

Copy link
Member Author

@pajaks pajaks Apr 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The idea was to collect file-level statistics only for initial ANALYZE in this PR. Checking if extended statistics are empty is currently used to determine if it's initial statistics collection.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For mentioned case maybe drop_extended_stats before ANALYZE (or force_recalculate_statistics with #16634) would be a easiest solution?

Copy link
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

didn't review the main commit yet

Comment on lines 2332 to 2984
if (analyzeHandle.isInitialAnalyze()) {
generateMissingFileStatistics(session, tableHandle, computedStatistics);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the condition?
With incremental analyze we could do this as well. It's just that we would fill min/max for subset of files only.

(we need to revisit

if (tableHandle.getAnalyzeHandle().isPresent() && !tableHandle.getAnalyzeHandle().get().isInitialAnalyze() && !addAction.isDataChange()) {
// skip files which do not introduce data change on non-initial ANALYZE
return Stream.empty();
as well.
that code assumes ANALYZE covers data only but now it became aware of file boundaries)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would like to exclude incremental ANALYZE as separate PR if that's ok.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we leave a TODO comment?

columnStatistics.add(new ColumnStatisticMetadata(columnMetadata.getName(), MAX_VALUE));
columnStatistics.add(new ColumnStatisticMetadata(columnMetadata.getName(), MIN_VALUE));
}
columnStatistics.add(new ColumnStatisticMetadata(columnMetadata.getName(), NUMBER_OF_NON_NULL_VALUES));
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We collect those stats for all columns in the table. and then write back to transaction log.
This will inflate metadata for wide tables and affect query planning times and coordinator memory. I think we should follow Databricks's approach where they analyze some initial columns only.

cc @alexjo2144

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How can we know which columns are initial? Is it related to property delta.dataSkippingNumIndexedCols?
https://docs.delta.io/latest/optimizations-oss.html#data-skipping
The idea of this PR was to generate stats regardless of this property #15135
I cannot also find any check for this property in code so Trino collects statistics regardless during write.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is preexisting issue as currently Trino analyses all columns during write. Issue for improvement: #17057

@ebyhr
Copy link
Member

ebyhr commented Apr 26, 2023

findinpath added the release-notes label 4 hours ago

@findinpath We use release-notes label for a PR to add release note documentation like #17002

Copy link
Member

@ebyhr ebyhr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Still reviewing the last commit.

@pajaks
Copy link
Member Author

pajaks commented May 9, 2023

rebase to resolve conflicts

Comment on lines 2332 to 2984
if (analyzeHandle.isInitialAnalyze()) {
generateMissingFileStatistics(session, tableHandle, computedStatistics);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we leave a TODO comment?

@pajaks
Copy link
Member Author

pajaks commented Jul 3, 2023

First push is rebase, second addresses comments

@ebyhr
Copy link
Member

ebyhr commented Jul 11, 2023

Could you rebase on master to resolve conflicts?

@pajaks
Copy link
Member Author

pajaks commented Jul 11, 2023

First push to resolve conflicts, second with addressed comments.

@findepi
Copy link
Member

findepi commented Jul 21, 2023

@alexjo2144 can you ptal?

@pajaks
Copy link
Member Author

pajaks commented Aug 9, 2023

First push -> rebase
Second -> adaptation for case sensitivity

@ebyhr
Copy link
Member

ebyhr commented Aug 10, 2023

/test-with-secrets sha=1422b6a4102e9cc424432fcc37e4ab4698af8aa9

Copy link
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(fmt)

@github-actions
Copy link

The CI workflow run with tests that require additional secrets has been started: https://github.com/trinodb/trino/actions/runs/5820602635

@ebyhr ebyhr merged commit 16a730c into trinodb:master Aug 21, 2023
@github-actions github-actions bot added this to the 425 milestone Aug 21, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed delta-lake Delta Lake connector
Development

Successfully merging this pull request may close these issues.

6 participants