Make StatisticsCache share in session level #7556

Ted-Jiang · 2023-09-14T04:00:39Z

Is your feature request related to a problem or challenge?

In our systems try to pass logical plan to datafusion with enable collect statics. The source table is from remote storage, sometimes it cost a few seconds to read parquet metadata to collect statics.
From log

 datafusion::datasource::listing::table: Not hit cache infer_stats ObjectMeta { location: Path { raw: "working-dir/..-examples-test_case_data-reusemeta-metadata/ddltest/parquet/17d6373b-57ef-f370-34a6-1bd37d156a76/fa2ccb1e-5470-88a2-2fcb-b19779597e96/1/part-00000-f0bfae88-e929-4af1-99be-2599f2b51b3c-c000.snappy.parquet" }, last_modified: 2023-05-18T09:53:04.716427232Z, e_tag: None }, cost 1.5161s

So i check the code see there is a cache called StatisticsCache construct here:
https://github.com/apache/arrow-datafusion/blob/abea8938b571a4aecddc7185b3acacadcc7dd854/datafusion/core/src/datasource/listing/table.rs#L656
It seems every time build a plan then insert an empty cache, only infer same file statistics in same plan can get benefit.

So I want to share the statics cache in session level 😄 to solve fetch remote file statistics not stable. I think many others query engine did this too.

Describe the solution you'd like

Add a cache manager to deal with all cache during the session lifetime.
https://github.com/apache/arrow-datafusion/blob/a38480951f40abce7ee2d5919251a1d1607f1dee/datafusion/execution/src/runtime_env.rs#L44-L50

Using the SessionState to pass cache result to each plan.

Describe alternatives you've considered

No response

Additional context

No response

The text was updated successfully, but these errors were encountered:

Ted-Jiang · 2023-09-14T04:02:32Z

@alamb @Dandandan @yahoNanJing PTAL Long time not involved in the community 🤣

Ted-Jiang · 2023-09-14T06:37:15Z

https://duckdb.org/docs/sql/configuration
Seems duckdb supports cache Whether or not object cache is used to cache e.g., Parquet metadata

Ted-Jiang · 2023-09-14T06:40:23Z

Another bottleneck we found is list thousands files under a remote storage path.
One examples is : One query run in 3 seconds, sometimes it goes to 7 seconds due to long time listing files on remote storage.

alamb · 2023-09-14T19:42:28Z

I think keeping a metadata cache on the RuntimeEnv is reasonable as long as

There is a way to extend / disable the default behavior (as there is with the DiskManager and MemoryPool).
The default implementation in DataFusion is simple

The rationale for something simple built in but a configurable API is that the exact caching strategy is likely to vary tremendously from system to system (for example, if there is a local file based parquet cache, storing metadata in memory might not make sense, or how to do cache eviction or enforce limits, etc).

Therefore it is unlikely that anything in DataFusion will cover all usecases, so what is built in should be simple and allow users to add whatever specific caching policy they want

Does that makes sense @Ted-Jiang ?

liukun4515 · 2023-09-15T02:52:27Z

I think keeping a metadata cache on the RuntimeEnv is reasonable as long as

There is a way to extend / disable the default behavior (as there is with the DiskManager and MemoryPool).

The default implementation in DataFusion is simple

The rationale for something simple built in but a configurable API is that the exact caching strategy is likely to vary tremendously from system to system (for example, if there is a local file based parquet cache, storing metadata in memory might not make sense, or how to do cache eviction or enforce limits, etc).

This suggestion is very important

Therefore it is unlikely that anything in DataFusion will cover all usecases, so what is built in should be simple and allow users to add whatever specific caching policy they want

Does that makes sense @Ted-Jiang ?

@alamb
Does influx io has the file statis cache or the list files cache when?
How does influx io resolve the issue that node need to visit the remote storage when generating the execution plan?

Ted-Jiang · 2023-09-15T03:05:29Z

@alamb Thanks for the suggestions ! This make sense.

I will extract a trait like MemoryPool for user extension their specific caching policy.

As the where should call get or set in execution: i think there is only one way hard code in physical plan，which means user can not customize where they call get or set cache.

alamb · 2023-09-15T10:51:11Z

@liukun4515

Does influx io has the file statis cache or the list files cache when?

IOx caches (effectively) the list of files and a (very small) subset of the statistics (in our case just the min/max timestamp values). Our metadata catalog (see below) did not have space to store the entire parquet file metadata (with per-row group statistics)

Also, at the moment IOx has an in memory cache of the actual parquet data, which means it effectively always reads the entire objects from storage (though we may change this at some point)

How does influx io resolve the issue that node need to visit the remote storage when generating the execution plan?

IOx has its own, separate, metadata catalog that stores information about the schema and what files store data for each table as well as what partition (IOx keeps the data segregated in daily partitions typically).

Thus we never use LIST operations on object store

THe hgih level architecture is described here: https://www.influxdata.com/blog/influxdb-3-0-system-architecture/

liukun4515 · 2023-09-17T14:20:52Z

@liukun4515

Does influx io has the file statis cache or the list files cache when?

IOx caches (effectively) the list of files and a (very small) subset of the statistics (in our case just the min/max timestamp values). Our metadata catalog (see below) did not have space to store the entire parquet file metadata (with per-row group statistics)

Also, at the moment IOx has an in memory cache of the actual parquet data, which means it effectively always reads the entire objects from storage (though we may change this at some point)

How does influx io resolve the issue that node need to visit the remote storage when generating the execution plan?

IOx has its own, separate, metadata catalog that stores information about the schema and what files store data for each table as well as what partition (IOx keeps the data segregated in daily partitions typically).

Thus we never use LIST operations on object store

THe hgih level architecture is described here: https://www.influxdata.com/blog/influxdb-3-0-system-architecture/

Thanks for your detailed comments, I will take a look this blogs.

Ted-Jiang added the enhancement New feature or request label Sep 14, 2023

Ted-Jiang self-assigned this Sep 15, 2023

Ted-Jiang mentioned this issue Sep 15, 2023

[feat] Introduce cacheManager in session ctx and make StatisticsCache share in session #7570

Merged

Ted-Jiang closed this as completed in #7570 Sep 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Make StatisticsCache share in session level #7556

Make StatisticsCache share in session level #7556

Ted-Jiang commented Sep 14, 2023 •

edited

Loading

Ted-Jiang commented Sep 14, 2023

Ted-Jiang commented Sep 14, 2023 •

edited

Loading

Ted-Jiang commented Sep 14, 2023

alamb commented Sep 14, 2023

liukun4515 commented Sep 15, 2023

Ted-Jiang commented Sep 15, 2023 •

edited

Loading

alamb commented Sep 15, 2023

liukun4515 commented Sep 17, 2023

Make StatisticsCache share in session level #7556

Make StatisticsCache share in session level #7556

Comments

Ted-Jiang commented Sep 14, 2023 • edited Loading

Is your feature request related to a problem or challenge?

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Ted-Jiang commented Sep 14, 2023

Ted-Jiang commented Sep 14, 2023 • edited Loading

Ted-Jiang commented Sep 14, 2023

alamb commented Sep 14, 2023

liukun4515 commented Sep 15, 2023

Ted-Jiang commented Sep 15, 2023 • edited Loading

alamb commented Sep 15, 2023

liukun4515 commented Sep 17, 2023

Ted-Jiang commented Sep 14, 2023 •

edited

Loading

Ted-Jiang commented Sep 14, 2023 •

edited

Loading

Ted-Jiang commented Sep 15, 2023 •

edited

Loading