Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle rapid deletion and recreation of subgraphs more gracefully #4044

Merged
merged 4 commits into from
Oct 14, 2022

Conversation

lutter
Copy link
Collaborator

@lutter lutter commented Oct 7, 2022

In development and similar settings, often a subgraph gets deleted, completely removed and then immediately redeployed. This usually fails because of some internal caches that graph-node keeps. This PR removes these failures on indexing nodes. If there are dedicated query nodes, they will still not pick up the deployment until their internal caches get updated (by default in at most 5 minutes) Installations that use a single node, or when the main concern is the index node, will no longer be affected by this cache issue.

@lutter
Copy link
Collaborator Author

lutter commented Oct 13, 2022

This also fixes #4005

async fn stop_subgraph(&self, loc: &DeploymentLocator) -> Result<(), StoreError> {
// Remove the writable from the cache and stop it
let deployment = loc.id.into();
let writable = self.writables.lock().unwrap().remove(&deployment);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If I got it right docs right, this would only panic due to a programming error, so it is relatively safe to unwrap here.
Is that correct?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would panic if the lock is poisoned, i.e., if some other thread paniced while holding the lock. In that case, it's anybody's guess what's happening and panicing here is really the only thing we can do.

Comment on lines +445 to +447
enum ExecResult {
Continue,
Stop,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a suggestion, so feel free to ignore this comment.

This enum somewhat similar to the ControlFlow from std.

While we don't need to return any data out of this, maybe we could redefine it as

type ExecResult = ControlFlow<()>

I know this won't change anything, but it would make room if we want to return data from those contexts in the future.

@@ -499,6 +513,8 @@ impl SubgraphStoreInner {
#[cfg(not(debug_assertions))]
assert!(!replace);

self.evict(&schema.id)?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is the reason we call evict on deployment creation so we have more confidence that the cache will be cleared?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, to make sure that we don't have a stale entry for the same hash in the cache (which could happen if a previous deployment of the same hash was stopped and then redeployed) This is really redundant with the eviction in stop_subgraph but since it's a cheap operation seemed safer to do it in both places.

Copy link
Contributor

@tilacog tilacog left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.
I left a few questions for me to understand a little more about how deployment handling goes.

…oyment

Previously, the query would treat a non-existant deployment like one that
had the block pointer set already.
Evict deployments from internal caches when

* a new one is created, which covers the case where a deployment is deleted
  and then quickly recreated
* when a subgraph is stopped

Note that these evictions will only affect index nodes; query nodes will
keep cache entries until the entry expires

Fixes #4005
@lutter lutter merged commit 4c25b79 into master Oct 14, 2022
@lutter lutter deleted the lutter/stop branch October 14, 2022 17:54
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants