Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Aggressive S3 metadata caching #556

Merged
2 commits merged into from
Mar 20, 2024
Merged

Aggressive S3 metadata caching #556

2 commits merged into from
Mar 20, 2024

Conversation

sergiimk
Copy link
Member

@sergiimk sergiimk commented Mar 19, 2024

Description

Related to: kamu-data/kamu-node#61

Adds two layers of caching:

  • ObjectRepositoryCachingLocalFS - a read-through cache that stores objects in a local directory, which I use for metadata blocks
  • S3RegistryCache - a singleton in-memory cache of datasets contained in S3 bucket

Together they turned some of my sample queries from taking 4.5s to taking just 9ms.

Checklist before requesting a review

  • CHANGELOG.md updated
  • API changes are backwards-compatible
  • Workspace layout changes include a migration
  • Documentation update PR: <link or N/A>
  • Dataset pipelines update scheduled if needed
  • Unit-tests added

@sergiimk
Copy link
Member Author

I will release API server based on this branch as I need good perf for tomorrow's demo.

@@ -82,11 +82,25 @@ impl DatasetRepositoryLocalFs {
))
}

fn build_dataset(layout: DatasetLayout, event_bus: Arc<EventBus>) -> Arc<dyn Dataset> {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Notice that I have severed ties between DatasetRepositorys and DatasetFactory.

After making a bunch of changes to make DatasetFactory super configurable, I realized that DatasetFactory should be used only when talking to remote repositories, and DatasetRepository implementations should have independent low-level control over how they create dataset instances. Attempting to combine the two just results in configuration and dependency nightmare.

@@ -415,3 +564,27 @@ impl DatasetRepository for DatasetRepositoryS3 {
}

/////////////////////////////////////////////////////////////////////////////////////////

pub struct S3RegistryCache {
state: Arc<Mutex<State>>,
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is far from elegant but I think we will factor it out into a proper registry component in future.
I'd rather save this refactoring until #342.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a note to #342

@@ -415,3 +564,27 @@ impl DatasetRepository for DatasetRepositoryS3 {
}

/////////////////////////////////////////////////////////////////////////////////////////

pub struct S3RegistryCache {
state: Arc<Mutex<State>>,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a note to #342

@sergiimk sergiimk closed this pull request by merging all changes into master in 318ee8b Mar 20, 2024
@sergiimk sergiimk deleted the feature/lfs-metadata-cache branch March 20, 2024 00:32
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants