Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Delta Lake analyze improvements #15967

Open
alexjo2144 opened this issue Feb 3, 2023 · 1 comment
Open

Delta Lake analyze improvements #15967

alexjo2144 opened this issue Feb 3, 2023 · 1 comment
Labels
enhancement New feature or request

Comments

@alexjo2144
Copy link
Member

The Delta connector currently only collects statistics for NDVs, using an HLL, and column sizes. This was done because in most situations the transaction log contains the rest of the statistics we need on a per file basis. We could improve on that in a couple ways

1: Add file level stats to files without them

Files written by older versions of Delta's Spark writer may not have min/max/null count stats attached to the files in the transaction log manifest. We could add them when analyzing the table.

2: Collect partition level min/max/null count stats

Accumulating statistics from all data files during planning is relatively expensive, but partition level stats could be a good enough estimate while reducing the cost of reading them.

3: Reconcile the Trino stats storage with Spark's

Trino and Spark have two completely separate stats storage solutions, but it would be nice if analyzing a table in one produced stats readable by the other, similarly to the Iceberg Puffin stats files.

cc: @findepi @pajaks @findinpath

@alexjo2144 alexjo2144 added the enhancement New feature or request label Feb 3, 2023
@alexjo2144
Copy link
Member Author

Relates to: #15135

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Development

No branches or pull requests

1 participant