-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix GC'd alloc tracking #3445
Fix GC'd alloc tracking #3445
Conversation
client/alloc_runner.go
Outdated
// runner shutting down. Since handleDestroy can be called by Run() we | ||
// can't block shutdown here as it would cause a deadlock. | ||
go r.updater(alloc) | ||
r.updater(alloc) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you explain why the updater is now a blocking call, given the deleted comment about it causing a deadlock if its a blocking call
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A great question that took me a long time to figure out!
The short answer is: because I couldn't find a reason for it, and it works without it being in a goroutine.
The longer answer is much more fulfilling though:
When I first made this hack in #2852, the Client.updateAllocStatus
method that this line calls could trigger a blocking GC of this alloc. However that was fixed in #3007 -- Client.updateAllocStatus
was greatly simplified and can no longer trigger a blocking GC.
It can still however cause a concurrent GC, so I should actually move this update
call after the final saveAllocRunnerState()
call to ensure we don't try to save state after a GC. Thanks to proper locking the worst case scenario should be a noop, but in the past state-saving-after-GC was a source of state corruption errors. So the more defensive we can be the better!
tl;dr - minor updates incoming
client/client.go
Outdated
@@ -1205,6 +1211,7 @@ func (c *Client) updateNodeStatus() error { | |||
for _, s := range resp.Servers { | |||
addr, err := resolveServer(s.RPCAdvertiseAddr) | |||
if err != nil { | |||
c.logger.Printf("[DEBUG] client: ignoring invalid server %q: %v", s.RPCAdvertiseAddr, err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
WARN?
client/client.go
Outdated
@@ -1234,8 +1241,14 @@ func (c *Client) updateNodeStatus() error { | |||
// updateAllocStatus is used to update the status of an allocation | |||
func (c *Client) updateAllocStatus(alloc *structs.Allocation) { | |||
if alloc.Terminated() { | |||
// Terminated, mark for GC | |||
if ar, ok := c.getAllocRunners()[alloc.ID]; ok { | |||
// Terminated, mark for GC iff we're still tracking this alloc |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
iff -> if
client/client.go
Outdated
go c.garbageCollector.Collect(alloc.ID) | ||
|
||
return nil | ||
return |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Remove this?
client/gc.go
Outdated
a.destroyAllocRunner(gcAlloc.allocRunner, "forced collection") | ||
return nil | ||
|
||
a.logger.Printf("[DEBUG] client.gc: alloc %s already garbage collected", allocID) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't necessarily true right? I could do Collect("hello-world")
and this would log it was already collected. I would log that it isn't be tracked by the garbage collector
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, true. This is in the call chain of the GC Alloc API so users could pass in arbitrary data. Will clarify.
client/gc.go
Outdated
func (a *AllocGarbageCollector) numAllocs() int { | ||
return a.allocRunners.Length() + a.allocCounter.NumAllocs() | ||
return a.allocCounter.NumAllocs() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just delete method and use the allocCounter?
job.TaskGroups[0].Tasks[0].Driver = "mock_driver" | ||
job.TaskGroups[0].Tasks[0].Config["run_for"] = "30s" | ||
nodeID := client.Node().ID | ||
if err := state.UpsertJob(98, job); err != nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Use the assert library
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did at first and then got really confusing test failures. I didn't realize The assert library uses t.Error
instead of t.Fatal
and you need to check its return bool to know whether or not to fail!
So I just skipped the assert library as it didn't seem to add anything.
0d4ec58
to
307a1e5
Compare
@schmichael Actually the test appears to be flaky:
|
TestHTTP_AllocGC_ACL seems to be broken in this branch only as well |
54b3c03
to
4eeb673
Compare
The Client.allocs map now contains all AllocRunners again, not just un-GC'd AllocRunners. Client.allocs is only pruned when the server GCs allocs. Also stops logging "marked for GC" twice.
GC much more aggressively by triggering GCs when allocations become terminal as well as after new allocations are added.
Fixed Merging to 0.7.1. Yell at me if it proves flaky. |
I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions. |
The two important changes:
Removes error returns from lots of methods that never actually returned errors. Actual GC'ing by the AllocRunner is a best-effort, so there's rarely a meaningful error to return in those call chains.
Testing the GC in a meaningful way is very difficult. A couple of the existing unit tests made incorrect assumptions about how the units they were testing (the GC's max allocs calculation) behaved, and therefore weren't actually testing any realistic behavior. This is what allowed bugs to persist in the face of what would appear to be reasonable test coverage. I replaced a couple of those small unit tests with a a larger integration test (integration in that it spins up a server & client and black box tests the gc).
Also stops logging "marked for GC" twice.