-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make StatisticsCache share in session level #7556
Comments
@alamb @Dandandan @yahoNanJing PTAL Long time not involved in the community 🤣 |
https://duckdb.org/docs/sql/configuration |
Another bottleneck we found is list thousands files under a remote storage path. |
I think keeping a metadata cache on the RuntimeEnv is reasonable as long as
The rationale for something simple built in but a configurable API is that the exact caching strategy is likely to vary tremendously from system to system (for example, if there is a local file based parquet cache, storing metadata in memory might not make sense, or how to do cache eviction or enforce limits, etc). Therefore it is unlikely that anything in DataFusion will cover all usecases, so what is built in should be simple and allow users to add whatever specific caching policy they want Does that makes sense @Ted-Jiang ? |
This suggestion is very important
@alamb |
@alamb Thanks for the suggestions ! This make sense. I will extract a trait like As the where should call |
IOx caches (effectively) the list of files and a (very small) subset of the statistics (in our case just the min/max timestamp values). Our metadata catalog (see below) did not have space to store the entire parquet file metadata (with per-row group statistics) Also, at the moment IOx has an in memory cache of the actual parquet data, which means it effectively always reads the entire objects from storage (though we may change this at some point)
IOx has its own, separate, metadata catalog that stores information about the schema and what files store data for each table as well as what partition (IOx keeps the data segregated in daily partitions typically). Thus we never use THe hgih level architecture is described here: https://www.influxdata.com/blog/influxdb-3-0-system-architecture/ |
Thanks for your detailed comments, I will take a look this blogs. |
Is your feature request related to a problem or challenge?
In our systems try to pass logical plan to datafusion with enable collect statics. The source table is from remote storage, sometimes it cost a few seconds to read parquet metadata to collect statics.
From log
So i check the code see there is a cache called
StatisticsCache
construct here:https://github.com/apache/arrow-datafusion/blob/abea8938b571a4aecddc7185b3acacadcc7dd854/datafusion/core/src/datasource/listing/table.rs#L656
It seems every time build a plan then insert an empty cache, only infer same file statistics in same plan can get benefit.
So I want to share the statics cache in session level 😄 to solve fetch remote file statistics not stable. I think many others query engine did this too.
Describe the solution you'd like
Add a cache manager to deal with all cache during the session lifetime.
https://github.com/apache/arrow-datafusion/blob/a38480951f40abce7ee2d5919251a1d1607f1dee/datafusion/execution/src/runtime_env.rs#L44-L50
Using the SessionState to pass cache result to each plan.
Describe alternatives you've considered
No response
Additional context
No response
The text was updated successfully, but these errors were encountered: