-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Cache collected file statistics #3649
Conversation
JYI, collecting statistics is disabled when creating table from SQL ( https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/src/execution/context.rs#L507 -> https://github.com/apache/arrow-datafusion/blob/master/datafusion/core/src/execution/options.rs#L195 -> I'm not sure if it's intended. |
Codecov Report
@@ Coverage Diff @@
## master #3649 +/- ##
=======================================
Coverage 86.01% 86.01%
=======================================
Files 300 300
Lines 56543 56595 +52
=======================================
+ Hits 48633 48682 +49
- Misses 7910 7913 +3
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
bc4386f
to
9bee83b
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me -- thank you @mateuszkj
Benchmark runs are scheduled for baseline = f706902 and contender = 85c11c1. 85c11c1 is a master commit associated with this PR. Results will be available as each benchmark for each run completes. |
Which issue does this PR close?
Closes #871 (but I'm not sure about that).
Rationale for this change
Reduce I/O by collecting statistics for files (parquet) only once in
ListingTable
.What changes are included in this PR?
Store collected statistics in cache per file location.
Cache is invalided when:
Are there any user-facing changes?
No.
Or maybe mention that sometimes when
collect_stats
is enabled first query can be much slower due to increased I/O while collecting statistics. Cached statistics are invalidated in next query when table file has changed.