You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently we are using an in-memory representation of dataset dependency graph in DependencyGraphService.
The model is based on explicit graph representation implemented with petgraph crate.
The dependencies data for this graph is lazily loaded just once via repository trait DependencyGraphRepository, implemented by DependencyGraphRepositoryInMemory component, The invocation happens on first use of the graph queries, which in API-server case happens when the first user after server startup opens the search page, as it provides data on count of downstream datasets:
Later, the consistency of the in-memory dependencies is maintained via dataset events (created, deleted, dependencies updated).
The measurements show that for ~120 datasets hosted in S3, this initialization takes over 3s.
This can be explained with full scan of dataset repository in DependencyGraphRepositoryInMemory, doing:
full listing of dataset keys in S3 bucket (ListObjectsV2)
resolving alias for each dataset (GetObject))
reading summary for each dataset (GetObject)
So, totally we are doing about ~250 S3 calls on demo environment just for this operation, and it happens each time the API server restarts.
This should be improved via:
storing dependencies in database, and loading them into a graph within just 1 SQL transaction
using scanning of S3 bucket like now only to initially fill the data into the database, and for recovery purposes in future
the updates of dependencies (dataset events) should take care of both in-memory graph model and persistent dependency records
Consider moving the graph initialization to pre_run() phase as well, since postponing the load makes little sense, as all users start from the Search page.
The text was updated successfully, but these errors were encountered:
Currently we are using an in-memory representation of dataset dependency graph in
DependencyGraphService
.The model is based on explicit graph representation implemented with
petgraph
crate.The dependencies data for this graph is lazily loaded just once via repository trait
DependencyGraphRepository
, implemented byDependencyGraphRepositoryInMemory
component, The invocation happens on first use of the graph queries, which in API-server case happens when the first user after server startup opens the search page, as it provides data on count of downstream datasets:Later, the consistency of the in-memory dependencies is maintained via dataset events (created, deleted, dependencies updated).
The measurements show that for ~120 datasets hosted in S3, this initialization takes over 3s.
This can be explained with full scan of dataset repository in
DependencyGraphRepositoryInMemory
, doing:ListObjectsV2
)GetObject
))GetObject
)So, totally we are doing about ~250 S3 calls on
demo
environment just for this operation, and it happens each time the API server restarts.This should be improved via:
Consider moving the graph initialization to
pre_run()
phase as well, since postponing the load makes little sense, as all users start from the Search page.The text was updated successfully, but these errors were encountered: