Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Make StatisticsCache share in session level #7556

Closed
Ted-Jiang opened this issue Sep 14, 2023 · 8 comments · Fixed by #7570
Closed

Make StatisticsCache share in session level #7556

Ted-Jiang opened this issue Sep 14, 2023 · 8 comments · Fixed by #7570
Assignees
Labels
enhancement New feature or request

Comments

@Ted-Jiang
Copy link
Member

Ted-Jiang commented Sep 14, 2023

Is your feature request related to a problem or challenge?

In our systems try to pass logical plan to datafusion with enable collect statics. The source table is from remote storage, sometimes it cost a few seconds to read parquet metadata to collect statics.
From log

 datafusion::datasource::listing::table: Not hit cache infer_stats ObjectMeta { location: Path { raw: "working-dir/..-examples-test_case_data-reusemeta-metadata/ddltest/parquet/17d6373b-57ef-f370-34a6-1bd37d156a76/fa2ccb1e-5470-88a2-2fcb-b19779597e96/1/part-00000-f0bfae88-e929-4af1-99be-2599f2b51b3c-c000.snappy.parquet" }, last_modified: 2023-05-18T09:53:04.716427232Z, e_tag: None }, cost 1.5161s 

So i check the code see there is a cache called StatisticsCache construct here:
https://github.com/apache/arrow-datafusion/blob/abea8938b571a4aecddc7185b3acacadcc7dd854/datafusion/core/src/datasource/listing/table.rs#L656
It seems every time build a plan then insert an empty cache, only infer same file statistics in same plan can get benefit.

So I want to share the statics cache in session level 😄 to solve fetch remote file statistics not stable. I think many others query engine did this too.

Describe the solution you'd like

Add a cache manager to deal with all cache during the session lifetime.
https://github.com/apache/arrow-datafusion/blob/a38480951f40abce7ee2d5919251a1d1607f1dee/datafusion/execution/src/runtime_env.rs#L44-L50

Using the SessionState to pass cache result to each plan.

Describe alternatives you've considered

No response

Additional context

No response

@Ted-Jiang Ted-Jiang added the enhancement New feature or request label Sep 14, 2023
@Ted-Jiang
Copy link
Member Author

@alamb @Dandandan @yahoNanJing PTAL Long time not involved in the community 🤣

@Ted-Jiang
Copy link
Member Author

Ted-Jiang commented Sep 14, 2023

https://duckdb.org/docs/sql/configuration
Seems duckdb supports cache Whether or not object cache is used to cache e.g., Parquet metadata

@Ted-Jiang
Copy link
Member Author

Another bottleneck we found is list thousands files under a remote storage path.
One examples is : One query run in 3 seconds, sometimes it goes to 7 seconds due to long time listing files on remote storage.

@alamb
Copy link
Contributor

alamb commented Sep 14, 2023

I think keeping a metadata cache on the RuntimeEnv is reasonable as long as

  1. There is a way to extend / disable the default behavior (as there is with the DiskManager and MemoryPool).
  2. The default implementation in DataFusion is simple

The rationale for something simple built in but a configurable API is that the exact caching strategy is likely to vary tremendously from system to system (for example, if there is a local file based parquet cache, storing metadata in memory might not make sense, or how to do cache eviction or enforce limits, etc).

Therefore it is unlikely that anything in DataFusion will cover all usecases, so what is built in should be simple and allow users to add whatever specific caching policy they want

Does that makes sense @Ted-Jiang ?

@liukun4515
Copy link
Contributor

I think keeping a metadata cache on the RuntimeEnv is reasonable as long as

  1. There is a way to extend / disable the default behavior (as there is with the DiskManager and MemoryPool).
  2. The default implementation in DataFusion is simple

The rationale for something simple built in but a configurable API is that the exact caching strategy is likely to vary tremendously from system to system (for example, if there is a local file based parquet cache, storing metadata in memory might not make sense, or how to do cache eviction or enforce limits, etc).

This suggestion is very important

Therefore it is unlikely that anything in DataFusion will cover all usecases, so what is built in should be simple and allow users to add whatever specific caching policy they want

Does that makes sense @Ted-Jiang ?

@alamb
Does influx io has the file statis cache or the list files cache when?
How does influx io resolve the issue that node need to visit the remote storage when generating the execution plan?

@Ted-Jiang Ted-Jiang self-assigned this Sep 15, 2023
@Ted-Jiang
Copy link
Member Author

Ted-Jiang commented Sep 15, 2023

@alamb Thanks for the suggestions ! This make sense.

I will extract a trait like MemoryPool for user extension their specific caching policy.

As the where should call get or set in execution: i think there is only one way hard code in physical plan,which means user can not customize where they call get or set cache.

@alamb
Copy link
Contributor

alamb commented Sep 15, 2023

@liukun4515

Does influx io has the file statis cache or the list files cache when?

IOx caches (effectively) the list of files and a (very small) subset of the statistics (in our case just the min/max timestamp values). Our metadata catalog (see below) did not have space to store the entire parquet file metadata (with per-row group statistics)

Also, at the moment IOx has an in memory cache of the actual parquet data, which means it effectively always reads the entire objects from storage (though we may change this at some point)

How does influx io resolve the issue that node need to visit the remote storage when generating the execution plan?

IOx has its own, separate, metadata catalog that stores information about the schema and what files store data for each table as well as what partition (IOx keeps the data segregated in daily partitions typically).

Thus we never use LIST operations on object store

THe hgih level architecture is described here: https://www.influxdata.com/blog/influxdb-3-0-system-architecture/

@liukun4515
Copy link
Contributor

@liukun4515

Does influx io has the file statis cache or the list files cache when?

IOx caches (effectively) the list of files and a (very small) subset of the statistics (in our case just the min/max timestamp values). Our metadata catalog (see below) did not have space to store the entire parquet file metadata (with per-row group statistics)

Also, at the moment IOx has an in memory cache of the actual parquet data, which means it effectively always reads the entire objects from storage (though we may change this at some point)

How does influx io resolve the issue that node need to visit the remote storage when generating the execution plan?

IOx has its own, separate, metadata catalog that stores information about the schema and what files store data for each table as well as what partition (IOx keeps the data segregated in daily partitions typically).

Thus we never use LIST operations on object store

THe hgih level architecture is described here: https://www.influxdata.com/blog/influxdb-3-0-system-architecture/

Thanks for your detailed comments, I will take a look this blogs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
3 participants