Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Persistent storage of dataset dependencies graph #854

Open
zaychenko-sergei opened this issue Sep 26, 2024 · 0 comments · May be fixed by #973
Open

Persistent storage of dataset dependencies graph #854

zaychenko-sergei opened this issue Sep 26, 2024 · 0 comments · May be fixed by #973
Assignees
Labels
enhancement New feature or request performance rust Pull requests that update Rust code

Comments

@zaychenko-sergei
Copy link
Contributor

Currently we are using an in-memory representation of dataset dependency graph in DependencyGraphService.
The model is based on explicit graph representation implemented with petgraph crate.

The dependencies data for this graph is lazily loaded just once via repository trait DependencyGraphRepository, implemented by DependencyGraphRepositoryInMemory component, The invocation happens on first use of the graph queries, which in API-server case happens when the first user after server startup opens the search page, as it provides data on count of downstream datasets:

image

Later, the consistency of the in-memory dependencies is maintained via dataset events (created, deleted, dependencies updated).

The measurements show that for ~120 datasets hosted in S3, this initialization takes over 3s.

This can be explained with full scan of dataset repository in DependencyGraphRepositoryInMemory, doing:

  • full listing of dataset keys in S3 bucket (ListObjectsV2)
  • resolving alias for each dataset (GetObject))
  • reading summary for each dataset (GetObject)

So, totally we are doing about ~250 S3 calls on demo environment just for this operation, and it happens each time the API server restarts.

This should be improved via:

  • storing dependencies in database, and loading them into a graph within just 1 SQL transaction
  • using scanning of S3 bucket like now only to initially fill the data into the database, and for recovery purposes in future
  • the updates of dependencies (dataset events) should take care of both in-memory graph model and persistent dependency records

Consider moving the graph initialization to pre_run() phase as well, since postponing the load makes little sense, as all users start from the Search page.

@zaychenko-sergei zaychenko-sergei added enhancement New feature or request rust Pull requests that update Rust code performance labels Sep 26, 2024
@zaychenko-sergei zaychenko-sergei self-assigned this Sep 26, 2024
@zaychenko-sergei zaychenko-sergei linked a pull request Nov 27, 2024 that will close this issue
16 tasks
@zaychenko-sergei zaychenko-sergei linked a pull request Nov 27, 2024 that will close this issue
16 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance rust Pull requests that update Rust code
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant