-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Read latest configuration independently from main loop #379
Conversation
Another pretty important issue this caused for future reference is when restoring a large snapshot the leader/follower loop might be blocked for seconds or minutes resulting in Stats() calls hanging "forever". Thanks for getting to this @i0rek. @catsby FYI I think you were looking into this at some point so just flagging the PR in case you missed it and are interested. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think using an atomic value has lower risk for blocking raft loop as noted inline. That said, I'm open to objections if I missed something.
api.go
Outdated
@@ -138,6 +138,11 @@ type Raft struct { | |||
// the log/snapshot. | |||
configurations configurations | |||
|
|||
// Holds a copy of the latest configuration which can be read | |||
// independently from main loop. | |||
latestConfiguration Configuration |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we be returning the latest configuration or the latest comitted configuration? I may well be misremembering but I feel like we shouldn't be using/returning the config until it's committed right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nm. GetConfiguration
docs state:
GetConfiguration returns the latest configuration. This may not yet be currently in use. This may not yet be committed...
SO this preserves that behaviour although frankly that seems strange to me that Consul or an application would "see" a config that tis not the one actually in use and may never get committed by a quorum 🤔.
I guess that is an edge case issue we can solve separately though this preserves same behaviour while fixing the performance issue.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wondering about the same thing, but wanted to preserve current behaviour.
} | ||
configReq.configurations = configurations{latest: r.getLatestConfiguration()} | ||
configReq.respond(nil) | ||
return configReq |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this make case future := <-r.configurationsCh:
in leader/follower loop dead code now? Should we remove that?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We are still using it in our tests and adding an alternative would be more work. I decided to leave it as is so that this PR doesn't get out of hand. But it is essentially dead code now.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In theory configurations
could be an atomic value, and we only ever load and store. That way I could get rid of the channel and fix the tests. But that would be quite some places to change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 hmm. Making part of the internal workings atomic and accessing it directly in tests could introduce other issues though. Atomics prevent data races but not logical ones. I think the usage here is fine as we are limiting to only a single writer and wrapping concurrent readers so they don't have direct access but making internal state atomic and then tweaking it from tests could get brittle fast if we ever update or make assumptions about whether then atomic value.
Does leaving tests using a different code path to real callers cause issues? If not I'm inclined to leave it to.
Oh we also still use it in takeSnapshot
. We also rely in those places on being able to get both the latest and the committed configs which means the GetConfiguration can't do the same job.
So I say it's OK to leave it for now as it is as those call sites need to lower level and more accurate answer, this just optimises for external callers.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
api.go
Outdated
// Holds a copy of the latest configuration which can be read | ||
// independently from main loop. | ||
latestConfiguration Configuration | ||
latestConfigurationLock sync.RWMutex |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this would probably work well for most use cases but there is a risk that very heavy reads of GetConfiguration
can starve the main loop on a config change - i.e. cause raft to block indefinitely.
For example in Consul if there were some script calling agent info
on a server that accidentally got into a hot loop and called it constantly, it could bring the whole of raft grinding to a halt. While that's hopefully not super likely it feels like a pretty bad outcome.
I think in this case atomic.Value
might be a better way to syncronize Load and Store of this copy. Then readers can read without blocking using latestConfiguration.Load()
and high contention might slow the leaderloop a little due to cross-processor core contention on the cache line but it can't starve it indefinitely.
What do you think?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that it would be better to use atomic and I will change it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great job!
@@ -138,6 +139,10 @@ type Raft struct { | |||
// the log/snapshot. | |||
configurations configurations | |||
|
|||
// Holds a copy of the latest configuration which can be read | |||
// independently from main loop. | |||
latestConfiguration atomic.Value |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record.. I looked it up and atomic.Value
is itself a struct which means even as a non-pointer field like this it will be aligned to 64 bit boundaries and the atomically accessed pointer within it won't have alignment issues.
Before reading the raft configuration required the main loop to work off the
configurationsCh
. This is problematic in different scenarios:The main reason why that was the case is that the
configurations
attribute must not be accessed in parallel which was guaranteed because it was only ever read or written from theleaderLoop
.By adding a copy of the latest configuration (which is ultimately what was returned) and protecting it with its own mutex, it is now possible to access the latest configuration from outside of the leader loop.
Fixes #302, #356, and improves hashicorp/consul#6852.