Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

deployment watcher: fix goroutine leak when job is purged #20348

Merged
merged 2 commits into from
Apr 11, 2024

Conversation

tgross
Copy link
Member

@tgross tgross commented Apr 10, 2024

The deployment watcher on the leader makes blocking queries to detect when the set of active deployments changes. It takes the resulting list of deployments and adds or removes watchers based on whether the deployment is active. But when a job is purged, the deployment will be deleted. This unblocks the query but the query result only shows the remaining deployments.

When the query unblocks, ensure that all active watchers have a corresponding deployment in state. If not, remove the watcher so that the goroutine stops.

Fixes: #19988

The deployment watcher on the leader makes blocking queries to detect when the
set of active deployments changes. It takes the resulting list of deployments
and adds or removes watchers based on whether the deployment is active. But when
a job is purged, the deployment will be deleted. This unblocks the query but
the query result only shows the remaining deployments.

When the query unblocks, ensure that all active watchers have a corresponding
deployment in state. If not, remove the watcher so that the goroutine stops.

Fixes: #19988
@tgross tgross force-pushed the 19988-deploymentwatcher-goroutine-leak branch from 0c69e47 to 004fb71 Compare April 10, 2024 15:33
@tgross tgross added this to the 1.7.x milestone Apr 10, 2024
@tgross tgross added backport/1.5.x backport to 1.5.x release line backport/1.6.x backport to 1.6.x release line backport/1.7.x backport to 1.7.x release line hcc/cst Admin - internal labels Apr 10, 2024
@tgross tgross marked this pull request as ready for review April 10, 2024 16:35
@@ -191,6 +190,21 @@ func (w *Watcher) watchDeployments(ctx context.Context) {
w.remove(d)
}
}

// Ensure we're not tracking deployments that have been deleted because
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you check for the number of watchers and number of deployments, you could avoid always checking even when there are no changes

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch. That also made me realize that we're not locking the read on w.watchers here, so I need to fix that.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we can't optimize this with a length check after all. Imagine the following scenario:

  • watchers for deployment A and B exist, len(deployments) == 2, len(watchers) == 2
  • A gets purged, B completes (is no longer active), C gets added
  • after we call add/remove, we have len(deployments) == 2, len(watchers) == 2
  • so now checking the lengths would be equal and we'd skip removing A

I've pulled this block of code out to a function to make the locking nicer, and I've added a note about why we can't optimize it with the length check.

Copy link
Member

@shoenig shoenig left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@tgross tgross merged commit a13e455 into main Apr 11, 2024
19 checks passed
@tgross tgross deleted the 19988-deploymentwatcher-goroutine-leak branch April 11, 2024 18:51
tgross added a commit that referenced this pull request Apr 11, 2024
The deployment watcher on the leader makes blocking queries to detect when the
set of active deployments changes. It takes the resulting list of deployments
and adds or removes watchers based on whether the deployment is active. But when
a job is purged, the deployment will be deleted. This unblocks the query but
the query result only shows the remaining deployments.

When the query unblocks, ensure that all active watchers have a corresponding
deployment in state. If not, remove the watcher so that the goroutine stops.

Fixes: #19988
tgross added a commit that referenced this pull request Apr 11, 2024
…#20348) (#20359)

The deployment watcher on the leader makes blocking queries to detect when the
set of active deployments changes. It takes the resulting list of deployments
and adds or removes watchers based on whether the deployment is active. But when
a job is purged, the deployment will be deleted. This unblocks the query but
the query result only shows the remaining deployments.

When the query unblocks, ensure that all active watchers have a corresponding
deployment in state. If not, remove the watcher so that the goroutine stops.

Fixes: #19988

Co-authored-by: Tim Gross <[email protected]>
tgross added a commit that referenced this pull request Apr 12, 2024
The test added in #20348 for the deployment watcher was backported to 1.5.x but
is hitting a compilation error because the signature of a state store method was
changed.
tgross added a commit that referenced this pull request Apr 12, 2024
The test added in #20348 for the deployment watcher was backported to 1.5.x but
is hitting a compilation error because the signature of a state store method was
changed.
philrenaud pushed a commit that referenced this pull request Apr 18, 2024
The deployment watcher on the leader makes blocking queries to detect when the
set of active deployments changes. It takes the resulting list of deployments
and adds or removes watchers based on whether the deployment is active. But when
a job is purged, the deployment will be deleted. This unblocks the query but
the query result only shows the remaining deployments.

When the query unblocks, ensure that all active watchers have a corresponding
deployment in state. If not, remove the watcher so that the goroutine stops.

Fixes: #19988
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport/1.5.x backport to 1.5.x release line backport/1.6.x backport to 1.6.x release line backport/1.7.x backport to 1.7.x release line hcc/cst Admin - internal theme/deployments type/bug
Projects
None yet
Development

Successfully merging this pull request may close these issues.

use of -purge leads to leak of goroutines
3 participants