Feature Request - Auto Analyze Table #581

devlucasc · 2021-01-13T21:10:13Z

It is possible to enable an option to auto analyze a delta table?

For example: ANALYZE TABLE x COMPUTE STATISTICS FOR ALL COLUMNS

The text was updated successfully, but these errors were encountered:

rubenssoto · 2021-01-13T22:28:04Z

It would be very good if delta lake OSS have columns statistics for parquet file prunning.

scottsand-db · 2021-10-07T21:23:26Z

We are currently reviewing this issue and will follow up shortly.

felipepessoto · 2022-05-26T16:42:22Z

@scottsand-db do you have any updates on this? Is it expected for the next release?
Thanks

scottsand-db · 2022-05-26T16:48:18Z

Hi @felipepessoto, thanks for following up. Delta Lake 1.1 included per-column file stats collection + data skipping. Does this meet your needs?

felipepessoto · 2022-05-26T22:08:27Z

I need to test it. In my experiments with Parquet and Delta, the ANALYZE TABLE made the queries ~40% faster than both Parquet (without ANALYZE TABLE) and Delta.

felipepessoto · 2022-05-26T22:10:42Z

BTW, you mean Delta 1.2? In 1.1 changelog I don't see these changes

scottsand-db · 2022-05-26T22:43:16Z

Yup, my bad. I meant 1.2.

felipepessoto · 2022-05-26T23:50:55Z

@scottsand-db in my test with 1.2 it didn't improve performance. Looking the query plan, they are the same as 1.1, except by PreparedDeltaFileIndex instead of TahoeLogFileIndex.

Stats are expected to improve performance for queries like this? https://github.com/Agirish/tpcds/blob/master/query93.sql
(with inner queries, joins)

zsxwing · 2022-05-27T22:39:47Z

@felipepessoto did you re-generate the data for your tests? Stats are only written by Delta 1.2, so you would need to re-generate them in order to leverage the data skipping improvements added by Delta 1.2.

felipepessoto · 2022-05-27T23:27:49Z

UPDATE: I found the stats (min, max, null count) in delta log, but not sure why they are not being used during query

Yes, I regenerate it. Do you know how I can check the stats?

DESCRIBE EXTENDED doesn't show any:

If I do the same using parquet, after running the ANALYZE Table:

BTW, do you know the differences between both approaches? Delta stats and ANALYZE TABLE?

Thanks

scottsand-db · 2022-05-28T00:47:17Z

Hi @felipepessoto. The main advantage of ANALYZE is that it computes different stats than "column stats + data skipping". It computes stats on unique/distinct values in a table. These stats are useful in improving the performance of JOINs.

Column stats on the other hand are used to help skip files during scans. So, if you perform a filter and then a JOIN, that is when column stats will help.

The reason Delta Lake currently doesn't support ANALYZE is that it currently looks directly at parquet files. We would instead need to read the transaction log, and use that metadata to tell us which parquet files to read.

If I'm missing any details I'm sure that @zsxwing can fill them in.

felipepessoto · 2022-05-28T03:53:56Z

Some more questions, please:

Would be correct to say that Delta stats are file-wise while ANALYZE are table-wise?
ANALYZE is a Spark feature, while Delta stats is part of Delta spec?
There is any work in progress to support Delta ANALYZE?

I'm wondering how much effort would be to change ANALYZE to work with Delta, reading from transaction log like you said
Or alternatively, change Delta specs to add the other stats (if it makes sense).

zsxwing · 2022-05-31T05:32:20Z

Would be correct to say that Delta stats are file-wise while ANALYZE are table-wise?

Yep.

2. ANALYZE is a Spark feature, while Delta stats is part of Delta spec?

Yep. ANALYZE is a Spark feature and technically it only supports built-in file formats. Today Spark doesn't provide an interface to non built-in data sources. In addition, it requires users to run ANALYZE by themselves, and it's easy to return incorrect answers when users forgot to run ANALYZE.

3. There is any work in progress to support Delta ANALYZE?

IIRC, you are basically asking whether Delta can support table-wise stats. If so, this is not on our roadmap. This won't be an easy project. Unlike parquet, Delta Lake needs to provide ACID, and updating the table-wise stats for each write with ACID guarantee is challenging.

An alternative solution but no ideal is computing the table-wise stats based on the file-wise stats when reading a Delta table. It would take a bit time to compute the table-wise stats and won't be as fast as parquet. But it would provide ACID which is critical for Delta Lake users. This is just a brainstorm. Feel free to post your suggestions!

felipepessoto · 2022-06-23T00:50:34Z

@zsxwing, I was looking at this PR: #840 which fallback to v1 to implement ANALYZE TABLE.

It seems the problem with analyzeTable is only to calculate the total size, because the calculateTotalSize method relies on catalogTable.storage.locationUri and scan everything inside it (or similarly when data is partitioned).

For count it seems fine to me:

sparkSession.table(tableIdentWithDB).count()

And computeColumnStats also looks good.

Do you think the same?

zsxwing · 2022-06-24T06:33:48Z

@felipepessoto yep. You are right. Do you know how Spark uses these stats? Only for optimization, or also use stats to return answers?

felipepessoto · 2022-06-24T19:32:08Z

From my experiment, it seems it is only for optimization.
I executed a SELECT COUNT(*) FROM TABLE query, Spark still scan the parquet files, instead of using the data available in the stats to return the results.
The same for MAX/MIN

felipepessoto · 2023-05-20T19:33:12Z

apache/spark#41111

scottsand-db added the enhancement New feature or request label Oct 7, 2021

scottsand-db added the acknowledged This issue has been read and acknowledged by Delta admins label Oct 7, 2021

khwj mentioned this issue Aug 28, 2022

Roadmap 2022 H2 (discussion) #1307

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature Request - Auto Analyze Table #581

Feature Request - Auto Analyze Table #581

devlucasc commented Jan 13, 2021 •

edited

Loading

rubenssoto commented Jan 13, 2021

scottsand-db commented Oct 7, 2021

felipepessoto commented May 26, 2022

scottsand-db commented May 26, 2022

felipepessoto commented May 26, 2022

felipepessoto commented May 26, 2022

scottsand-db commented May 26, 2022

felipepessoto commented May 26, 2022

zsxwing commented May 27, 2022

felipepessoto commented May 27, 2022 •

edited

Loading

scottsand-db commented May 28, 2022

felipepessoto commented May 28, 2022

zsxwing commented May 31, 2022

felipepessoto commented Jun 23, 2022 •

edited

Loading

zsxwing commented Jun 24, 2022

felipepessoto commented Jun 24, 2022

felipepessoto commented May 20, 2023

Feature Request - Auto Analyze Table #581

Feature Request - Auto Analyze Table #581

Comments

devlucasc commented Jan 13, 2021 • edited Loading

rubenssoto commented Jan 13, 2021

scottsand-db commented Oct 7, 2021

felipepessoto commented May 26, 2022

scottsand-db commented May 26, 2022

felipepessoto commented May 26, 2022

felipepessoto commented May 26, 2022

scottsand-db commented May 26, 2022

felipepessoto commented May 26, 2022

zsxwing commented May 27, 2022

felipepessoto commented May 27, 2022 • edited Loading

scottsand-db commented May 28, 2022

felipepessoto commented May 28, 2022

zsxwing commented May 31, 2022

felipepessoto commented Jun 23, 2022 • edited Loading

zsxwing commented Jun 24, 2022

felipepessoto commented Jun 24, 2022

felipepessoto commented May 20, 2023

devlucasc commented Jan 13, 2021 •

edited

Loading

felipepessoto commented May 27, 2022 •

edited

Loading

felipepessoto commented Jun 23, 2022 •

edited

Loading