Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[feat] Introduce cacheManager in session ctx and make StatisticsCache share in session #7570

Merged
merged 15 commits into from
Sep 18, 2023

Conversation

Ted-Jiang
Copy link
Member

Which issue does this PR close?

Closes #7556 .

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added the core Core DataFusion crate label Sep 15, 2023
@@ -1092,6 +1092,95 @@ mod tests {
Ok(())
}

#[tokio::test]
async fn load_table_stats_with_session_level_cache() -> Result<()> {
let testdata = crate::test_util::parquet_test_data();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add test to check cache share in session level

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice -- thank you. Since this is an end to end test, I recommend moving it to somewhere in core_integration: datafusion/core/tests/core_integration.rs perhaps

datafusion/execution/src/cache/cache_manager.rs Outdated Show resolved Hide resolved
datafusion/execution/src/cache/cache_unit.rs Show resolved Hide resolved
Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @Ted-Jiang -- this is looking really nice. I left some various clean up comments but I think this PR looks very nice and is well commented and structured 🏆

datafusion/execution/src/cache/mod.rs Outdated Show resolved Hide resolved
// The cache accessor, users usually working on this interface while manipulating caches
pub trait CacheAccessor<K, V>: Send + Sync {
// Extra info but not part of the cache key or cache value.
type Extra: Clone;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain what the usecase for Extra is? Specifically I wonder why such information could not be added as a field to the Value

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Like in default FileStatisticsCache, get func need check last_modified from ObjectMeta which not impl Hash so can not be part of the key, we need put this info in Extra.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see -- that makes sense -- it might help to document the rationale in statistics

}

/// Get `Statistics` for file location. Returns None if file has changed or not found.
fn get_with_extra(&self, k: &Path, e: &Self::Extra) -> Option<Statistics> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as written this is going to copy the statistics (though I realize that is what this PR did previously) -- maybe we could use something like Arc<Statistics> to store the statistics.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes , thanks for point this out

datafusion/core/src/datasource/listing/table.rs Outdated Show resolved Hide resolved
@@ -1092,6 +1092,95 @@ mod tests {
Ok(())
}

#[tokio::test]
async fn load_table_stats_with_session_level_cache() -> Result<()> {
let testdata = crate::test_util::parquet_test_data();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice -- thank you. Since this is an end to end test, I recommend moving it to somewhere in core_integration: datafusion/core/tests/core_integration.rs perhaps

datafusion/execution/src/cache/cache_manager.rs Outdated Show resolved Hide resolved
}

impl CacheManagerConfig {
pub fn enable_table_files_statistics_cache(mut self, cache: FileStaticCache) -> Self {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we possibly have the field names match -- here it is called table_files_statistics_cache but on the CacheManager it is called file_statistics_cache -- I think they should be the same in both places (I like file_statistics_cache best as it matches the type name)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ops, this make sense

pub mod cache_unit;

// The cache accessor, users usually working on this interface while manipulating caches
pub trait CacheAccessor<K, V>: Send + Sync {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another classic API to add here would be "clear()" to clear all the values

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

need this

datafusion/execution/src/cache/mod.rs Outdated Show resolved Hide resolved
@@ -229,8 +229,14 @@ impl TableProviderFactory for ListingTableFactory {
let config = ListingTableConfig::new(table_path)
.with_listing_options(options)
.with_schema(resolved_schema);
let table =
ListingTable::try_new(config)?.with_definition(cmd.definition.clone());
let provider;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you implement the with_cache API, this can look like

let provider = ListingTable::try_new(config)?
  .with_cache(state.runtime_env().cache_manager.get_file_statistic_cache())

Copy link
Member Author

@Ted-Jiang Ted-Jiang Sep 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More clear way ! 👍

@Ted-Jiang
Copy link
Member Author

@alamb PTAL

Copy link
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Ted-Jiang -- I think this looks great -- I had a suggestion to improve the comments but I think we can do that as a follow on PR as well. Nice work!

datafusion/core/src/datasource/listing/table.rs Outdated Show resolved Hide resolved
@@ -40,6 +40,7 @@ use std::sync::Arc;
use tempfile::NamedTempFile;

mod custom_reader;
mod file_statistics;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

@alamb
Copy link
Contributor

alamb commented Sep 17, 2023

cc @Dandandan / @thinkharderdev @liukun4515 and @mateuszkj who appears to have added this feature originally in 85c11c1 / #3649

/// locking via internal mutability. It can be accessed via multiple concurrent queries
/// during planning and execution.

pub trait CacheAccessor<K, V>: Send + Sync {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice abstract/trait

Copy link
Contributor

@liukun4515 liukun4515 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM for file statistics trait api

@Ted-Jiang Ted-Jiang merged commit 678d27a into apache:main Sep 18, 2023
@Ted-Jiang
Copy link
Member Author

@alamb @liukun4515 Thanks for the review.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Core DataFusion crate
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Make StatisticsCache share in session level
3 participants