Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add statistics collection for added column #16109

Merged
merged 2 commits into from
Mar 20, 2023
Merged

Add statistics collection for added column #16109

merged 2 commits into from
Mar 20, 2023

Conversation

pajaks
Copy link
Member

@pajaks pajaks commented Feb 14, 2023

Description

Currently when column is added when statistics are present total size is not collected for newly added column.
NDV statistics works fine.

Currently:

create table test (varchar_1 VARCHAR);
insert into test values ('a');
analyze test;
alter table test add column varchar_2 varchar;
insert into test values ('b','c');
analyze test;
show stats for test;

 column_name | data_size | distinct_values_count | nulls_fraction | row_count | low_value | high_value
-------------+-----------+-----------------------+----------------+-----------+-----------+------------
 varchar_1   |       2.0 |                   2.0 |            0.0 |      NULL | NULL      | NULL
 varchar_2   |      NULL |                   1.0 |            0.1 |      NULL | NULL      | NULL
 NULL        |      NULL |                  NULL |           NULL |       2.0 | NULL      | NULL

After change:

 column_name | data_size | distinct_values_count | nulls_fraction | row_count | low_value | high_value
-------------+-----------+-----------------------+----------------+-----------+-----------+------------
 varchar_1   |       2.0 |                   2.0 |            0.0 |      NULL | NULL      | NULL
 varchar_2   |      1.0  |                   1.0 |            0.1 |      NULL | NULL      | NULL
 NULL        |      NULL |                  NULL |           NULL |       2.0 | NULL      | NULL

Additional context and related issues

Release notes

( ) This is not user-visible or docs only and no release notes are required.
( ) Release notes are required, please propose a release note for me.
( ) Release notes are required, with the following suggested text:

# Delta Lake
* Collect statistics for newly added column

@cla-bot cla-bot bot added the cla-signed label Feb 14, 2023
@@ -2215,9 +2212,7 @@ private TableStatisticsMetadata getStatisticsCollectionMetadata(
.filter(columnMetadata -> analyzeColumnNames.contains(columnMetadata.getName()))
.forEach(columnMetadata -> {
if (!(columnMetadata.getType() instanceof FixedWidthType)) {
if (existingStatistics.isEmpty() || totalSizeStatisticsExists(existingStatistics.get().getColumnStatistics(), columnMetadata.getName())) {
Copy link
Member Author

@pajaks pajaks Feb 14, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Check for previous statistics of total size is probably redundant. If column was not selected in ANALYZE WITH query it should be already filtered by .filter(columnMetadata -> analyzeColumnNames.contains(columnMetadata.getName()))
I could not find any other case when those statistics are not present except for case when new column is added.

Copy link
Member

@alexjo2144 alexjo2144 Mar 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The other situation would be if you analyzed a table before we collected TOTAL_SIZE_IN_BYTES, then upgraded Trino, and then analyzed the table again/did an insert.

That was added in 387 (June 2022), @ebyhr @findepi think that's long enough?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good point, Alex. let's assume this is fine.

@pajaks pajaks marked this pull request as ready for review February 15, 2023 06:54
@pajaks pajaks self-assigned this Feb 15, 2023
@pajaks pajaks added the delta-lake Delta Lake connector label Feb 22, 2023
@alexjo2144
Copy link
Member

alexjo2144 commented Mar 2, 2023

What do you think about instead modifying the addColumn code to add the new column to the extended stats file with a data size of zero?

@pajaks
Copy link
Member Author

pajaks commented Mar 3, 2023

What do you think about instead modifying the addColumn code to add the new column to the extended stats file with a data size of zero?

That's an option. But if there is no scenario when totalSizeStatisticsExists is useful removing it will leave code less complex.

@pajaks
Copy link
Member Author

pajaks commented Mar 3, 2023

What do you think about instead modifying the addColumn code to add the new column to the extended stats file with a data size of zero?

That's an option. But if there is no scenario when totalSizeStatisticsExists is useful removing it will leave code less complex.

It will also start showing 0 instead of null in case of using ANALYZE x WITH (columns = ...). I'm not sure it's a big issue but it can be misleading.

Copy link
Member

@findepi findepi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

@alexjo2144 alexjo2144 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If Yuya and Piotr are okay with this: #16109 (comment) then this looks good to me.

@pajaks
Copy link
Member Author

pajaks commented Mar 17, 2023

Rebased to include fix for skipped tests in CI

@ebyhr ebyhr merged commit 842b7b2 into trinodb:master Mar 20, 2023
@github-actions github-actions bot added this to the 411 milestone Mar 20, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cla-signed delta-lake Delta Lake connector
Development

Successfully merging this pull request may close these issues.

5 participants