Persistent storage of dataset dependencies graph #854

zaychenko-sergei · 2024-09-26T08:24:47Z

Currently we are using an in-memory representation of dataset dependency graph in DependencyGraphService.
The model is based on explicit graph representation implemented with petgraph crate.

The dependencies data for this graph is lazily loaded just once via repository trait DependencyGraphRepository, implemented by DependencyGraphRepositoryInMemory component, The invocation happens on first use of the graph queries, which in API-server case happens when the first user after server startup opens the search page, as it provides data on count of downstream datasets:

Later, the consistency of the in-memory dependencies is maintained via dataset events (created, deleted, dependencies updated).

The measurements show that for ~120 datasets hosted in S3, this initialization takes over 3s.

This can be explained with full scan of dataset repository in DependencyGraphRepositoryInMemory, doing:

full listing of dataset keys in S3 bucket (ListObjectsV2)
resolving alias for each dataset (GetObject))
reading summary for each dataset (GetObject)

So, totally we are doing about ~250 S3 calls on demo environment just for this operation, and it happens each time the API server restarts.

This should be improved via:

storing dependencies in database, and loading them into a graph within just 1 SQL transaction
using scanning of S3 bucket like now only to initially fill the data into the database, and for recovery purposes in future
the updates of dependencies (dataset events) should take care of both in-memory graph model and persistent dependency records

Consider moving the graph initialization to pre_run() phase as well, since postponing the load makes little sense, as all users start from the Search page.

The text was updated successfully, but these errors were encountered:

zaychenko-sergei added enhancement New feature or request rust Pull requests that update Rust code performance labels Sep 26, 2024

zaychenko-sergei self-assigned this Sep 26, 2024

zaychenko-sergei mentioned this issue Sep 26, 2024

Metadata scanning performance kamu-data/kamu-node#61

Open

zaychenko-sergei linked a pull request Nov 27, 2024 that will close this issue

854 persistent storage of dataset dependencies graph #973

Open

16 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Persistent storage of dataset dependencies graph #854

Persistent storage of dataset dependencies graph #854

zaychenko-sergei commented Sep 26, 2024

Persistent storage of dataset dependencies graph #854

Persistent storage of dataset dependencies graph #854

Comments

zaychenko-sergei commented Sep 26, 2024