-
Notifications
You must be signed in to change notification settings - Fork 997
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle rapid deletion and recreation of subgraphs more gracefully #4044
Conversation
This also fixes #4005 |
async fn stop_subgraph(&self, loc: &DeploymentLocator) -> Result<(), StoreError> { | ||
// Remove the writable from the cache and stop it | ||
let deployment = loc.id.into(); | ||
let writable = self.writables.lock().unwrap().remove(&deployment); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If I got it right docs right, this would only panic due to a programming error, so it is relatively safe to unwrap here.
Is that correct?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would panic if the lock is poisoned, i.e., if some other thread paniced while holding the lock. In that case, it's anybody's guess what's happening and panicing here is really the only thing we can do.
enum ExecResult { | ||
Continue, | ||
Stop, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just a suggestion, so feel free to ignore this comment.
This enum somewhat similar to the ControlFlow from std
.
While we don't need to return any data out of this, maybe we could redefine it as
type ExecResult = ControlFlow<()>
I know this won't change anything, but it would make room if we want to return data from those contexts in the future.
@@ -499,6 +513,8 @@ impl SubgraphStoreInner { | |||
#[cfg(not(debug_assertions))] | |||
assert!(!replace); | |||
|
|||
self.evict(&schema.id)?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is the reason we call evict
on deployment creation so we have more confidence that the cache will be cleared?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, to make sure that we don't have a stale entry for the same hash in the cache (which could happen if a previous deployment of the same hash was stopped and then redeployed) This is really redundant with the eviction in stop_subgraph
but since it's a cheap operation seemed safer to do it in both places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
I left a few questions for me to understand a little more about how deployment handling goes.
…oyment Previously, the query would treat a non-existant deployment like one that had the block pointer set already.
Evict deployments from internal caches when * a new one is created, which covers the case where a deployment is deleted and then quickly recreated * when a subgraph is stopped Note that these evictions will only affect index nodes; query nodes will keep cache entries until the entry expires Fixes #4005
In development and similar settings, often a subgraph gets deleted, completely removed and then immediately redeployed. This usually fails because of some internal caches that
graph-node
keeps. This PR removes these failures on indexing nodes. If there are dedicated query nodes, they will still not pick up the deployment until their internal caches get updated (by default in at most 5 minutes) Installations that use a single node, or when the main concern is the index node, will no longer be affected by this cache issue.