Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix GC'd alloc tracking #3445

Merged
merged 4 commits into from
Oct 30, 2017
Merged

Fix GC'd alloc tracking #3445

merged 4 commits into from
Oct 30, 2017

Conversation

schmichael
Copy link
Member

@schmichael schmichael commented Oct 25, 2017

The two important changes:

  • The Client.allocs map now contains all AllocRunners again, not just un-GC'd AllocRunners. Client.allocs is only pruned when the server GCs allocs.
  • Client.NumAllocs - which is called by the GC to determine if it the node is over the max allocs - properly counts all non-GC'd allocs.

Removes error returns from lots of methods that never actually returned errors. Actual GC'ing by the AllocRunner is a best-effort, so there's rarely a meaningful error to return in those call chains.

Testing the GC in a meaningful way is very difficult. A couple of the existing unit tests made incorrect assumptions about how the units they were testing (the GC's max allocs calculation) behaved, and therefore weren't actually testing any realistic behavior. This is what allowed bugs to persist in the face of what would appear to be reasonable test coverage. I replaced a couple of those small unit tests with a a larger integration test (integration in that it spins up a server & client and black box tests the gc).

Also stops logging "marked for GC" twice.

// runner shutting down. Since handleDestroy can be called by Run() we
// can't block shutdown here as it would cause a deadlock.
go r.updater(alloc)
r.updater(alloc)
Copy link
Contributor

@preetapan preetapan Oct 26, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you explain why the updater is now a blocking call, given the deleted comment about it causing a deadlock if its a blocking call

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A great question that took me a long time to figure out!

The short answer is: because I couldn't find a reason for it, and it works without it being in a goroutine.

The longer answer is much more fulfilling though:

When I first made this hack in #2852, the Client.updateAllocStatus method that this line calls could trigger a blocking GC of this alloc. However that was fixed in #3007 -- Client.updateAllocStatus was greatly simplified and can no longer trigger a blocking GC.

It can still however cause a concurrent GC, so I should actually move this update call after the final saveAllocRunnerState() call to ensure we don't try to save state after a GC. Thanks to proper locking the worst case scenario should be a noop, but in the past state-saving-after-GC was a source of state corruption errors. So the more defensive we can be the better!

tl;dr - minor updates incoming

client/client.go Outdated
@@ -1205,6 +1211,7 @@ func (c *Client) updateNodeStatus() error {
for _, s := range resp.Servers {
addr, err := resolveServer(s.RPCAdvertiseAddr)
if err != nil {
c.logger.Printf("[DEBUG] client: ignoring invalid server %q: %v", s.RPCAdvertiseAddr, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

WARN?

client/client.go Outdated
@@ -1234,8 +1241,14 @@ func (c *Client) updateNodeStatus() error {
// updateAllocStatus is used to update the status of an allocation
func (c *Client) updateAllocStatus(alloc *structs.Allocation) {
if alloc.Terminated() {
// Terminated, mark for GC
if ar, ok := c.getAllocRunners()[alloc.ID]; ok {
// Terminated, mark for GC iff we're still tracking this alloc
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

iff -> if

client/client.go Outdated
go c.garbageCollector.Collect(alloc.ID)

return nil
return
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove this?

client/gc.go Outdated
a.destroyAllocRunner(gcAlloc.allocRunner, "forced collection")
return nil

a.logger.Printf("[DEBUG] client.gc: alloc %s already garbage collected", allocID)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't necessarily true right? I could do Collect("hello-world") and this would log it was already collected. I would log that it isn't be tracked by the garbage collector

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, true. This is in the call chain of the GC Alloc API so users could pass in arbitrary data. Will clarify.

client/gc.go Outdated
func (a *AllocGarbageCollector) numAllocs() int {
return a.allocRunners.Length() + a.allocCounter.NumAllocs()
return a.allocCounter.NumAllocs()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just delete method and use the allocCounter?

job.TaskGroups[0].Tasks[0].Driver = "mock_driver"
job.TaskGroups[0].Tasks[0].Config["run_for"] = "30s"
nodeID := client.Node().ID
if err := state.UpsertJob(98, job); err != nil {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Use the assert library

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did at first and then got really confusing test failures. I didn't realize The assert library uses t.Error instead of t.Fatal and you need to check its return bool to know whether or not to fail!

So I just skipped the assert library as it didn't seem to add anything.

@schmichael schmichael changed the title Fix GC'd alloc tracking [WIP] Fix GC'd alloc tracking Oct 26, 2017
@schmichael schmichael changed the base branch from b-nomad-0.7.1 to master October 26, 2017 21:47
@schmichael schmichael force-pushed the b-gc branch 2 times, most recently from 0d4ec58 to 307a1e5 Compare October 26, 2017 23:41
@schmichael schmichael changed the title [WIP] Fix GC'd alloc tracking Fix GC'd alloc tracking Oct 26, 2017
@dadgar
Copy link
Contributor

dadgar commented Oct 27, 2017

@schmichael Actually the test appears to be flaky:

--- FAIL: TestAllocGarbageCollector_MakeRoomForAllocations_MaxAllocs (140.67s)

	gc_test.go:335: 2 alloc state: expected 7 allocs (found 7); expected 1 destroy (found 4)```

@dadgar
Copy link
Contributor

dadgar commented Oct 27, 2017

TestHTTP_AllocGC_ACL seems to be broken in this branch only as well

@schmichael schmichael force-pushed the b-gc branch 3 times, most recently from 54b3c03 to 4eeb673 Compare October 30, 2017 16:43
The Client.allocs map now contains all AllocRunners again, not just
un-GC'd AllocRunners. Client.allocs is only pruned when the server GCs
allocs.

Also stops logging "marked for GC" twice.
GC much more aggressively by triggering GCs when allocations become
terminal as well as after new allocations are added.
@schmichael schmichael changed the base branch from master to b-nomad-0.7.1 October 30, 2017 22:19
@schmichael
Copy link
Member Author

Fixed TestHTTP_AllocGC_ACL and it TestAllocGarbageCollector_MakeRoomForAllocations_MaxAllocs has remained passing for at least 3 Travis builds after I made some tweaks.

Merging to 0.7.1. Yell at me if it proves flaky.

@schmichael schmichael merged commit 84b9c3e into b-nomad-0.7.1 Oct 30, 2017
@schmichael schmichael deleted the b-gc branch October 30, 2017 22:21
schmichael added a commit that referenced this pull request Nov 2, 2017
@schmichael schmichael mentioned this pull request Nov 2, 2017
@github-actions
Copy link

I'm going to lock this pull request because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active contributions.
If you have found a problem that seems related to this change, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Mar 19, 2023
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants