-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixes races with Raft configurations member, adds GetConfiguration accessor. #134
Conversation
Haven't looked through this patch yet, but I was sort of thinking GetConfiguration would be treated like the other application calls that get shoved into a channel as some future type for leaderLoop to process without additional locking. Will read through and think more on this later. |
We could definitely do it with channels, I think. I'll take a crack at that and see if it simplifies things. |
1e18bcb
to
264bfc2
Compare
@ongardie this is ready for a look. Using a channel was definitely more Go-like and kept a lot of lock cruft out of the main thread where we can just work the the configuration directly. |
This isn't strictly related to this change but it's nice to reduce the scope of raft.go if we can, especially since we tried to tag "runs in the main thread" or not as part of this change (this code doesn't care).
cloned.Servers[1].ID = "scribble" | ||
if sampleConfiguration.Servers[1].ID == "scribble" { | ||
t.Fatalf("cloned configuration shouldn't alias Servers") | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Split here into two separate test functions?
I'm still trying to wrap my head around how snapshots are taken... |
// the configuration changes. It should be ok in practice with normal | ||
// application traffic flowing through the FSM. If there's none of that | ||
// then it's not crucial that we snapshot, since there's not much going | ||
// on Raft-wise. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why would little/no normal traffic cause the state machine to fall behind on applying entries?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It wouldn't cause the state machine to fall behind. Imagine that the last commit at index X was a config change. The FSM won't see this log, so when you try to snapshot the FSM it will be at X-1 so this will hit the check above and not write the snapshot. Once X+1 comes along and the FSM ingests it then we will be happy to snapshot again with the config change at index X.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This depends on the FSM having seen some index past the config change in order to take a snapshot, which relies on some application-related logs to come through and hit the FSM.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah. In LogCabin, the state machine is fed no-op entries to increment its "last applied" index.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Interesting - that would prevent this wrinkle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like processLog() could just write to fsmCommitCh for any type of log entry, and the FSM already filters out uninteresting ones. We'd just have to remove the switch statement out of processLog().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added item 21 to #84 to track this since it's probably better done in a separate, small PR.
There's the We used to have the config stuff on the FSM side, but that was one level deeper than it needed to be, and the FSM doesn't have anything to do with the configuration changes. I suspect that this division of things might be a bit of a leftover from an earlier version of the library, but the one nice thing about this architecture is that the FSM is either taking a snapshot, restoring a snapshot, or applying changes - all the serialization happens by virtue of the single |
r.logger.Printf("[INFO] raft: Starting snapshot up to %d", req.index) | ||
// Make a request for the configurations and extract the committed info. | ||
// We have to use the future here to safely get this information since | ||
// it is owned by the main thread. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good, safer. Who knew we had all these races? :/
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I did it by inspection but ran go test -race
and it saw some of these, too :-)
@slackpad, thanks, after reading through, that makes sense to me now. |
Well, lgtm. We can split those files in a later PR too if that's easier. |
It was easy enough I rolled in the split as the last change. Thanks for taking a look! |
Also, thanks for so eagerly porting over to use channels. Didn't expect that to happen so fast last night :) |
The locks did make me feel gross, and like 95% of them were almost always useless b/c it's all in a single thread - channels were dramatically better in this case! |
This adds
GetConfiguration()
which is item 7 from #84 (comment). This also resolves item 12.While adding that I realized that we should move the "no snapshots during config changes" outside the FSM snapshot thread to fix racy access to the configurations and also because that's not really a concern of the FSM. That thread just exists to let the FSM do its thing, but the main snapshot thread is the one that should know about the configuration.
To add an externalGetConfiguration()
API I added an RW mutex which I'm not super thrilled about, but it lets us correctly manipulate this state from the outside, and among the different Raft threads.