Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fixes races with Raft configurations member, adds GetConfiguration accessor. #134

Merged
merged 11 commits into from
Jul 9, 2016

Conversation

slackpad
Copy link
Contributor

@slackpad slackpad commented Jul 8, 2016

This adds GetConfiguration() which is item 7 from #84 (comment). This also resolves item 12.

While adding that I realized that we should move the "no snapshots during config changes" outside the FSM snapshot thread to fix racy access to the configurations and also because that's not really a concern of the FSM. That thread just exists to let the FSM do its thing, but the main snapshot thread is the one that should know about the configuration.

To add an external GetConfiguration() API I added an RW mutex which I'm not super thrilled about, but it lets us correctly manipulate this state from the outside, and among the different Raft threads.

@slackpad slackpad changed the title Fixes races with Raft configuration. Fixes races with Raft configuration member, adds GetConfiguration accessor. Jul 8, 2016
@ongardie
Copy link
Contributor

ongardie commented Jul 8, 2016

Haven't looked through this patch yet, but I was sort of thinking GetConfiguration would be treated like the other application calls that get shoved into a channel as some future type for leaderLoop to process without additional locking. Will read through and think more on this later.

@slackpad
Copy link
Contributor Author

slackpad commented Jul 8, 2016

We could definitely do it with channels, I think. I'll take a crack at that and see if it simplifies things.

@slackpad slackpad force-pushed the b-racy-configuration branch from 1e18bcb to 264bfc2 Compare July 8, 2016 07:13
@slackpad
Copy link
Contributor Author

slackpad commented Jul 8, 2016

@ongardie this is ready for a look. Using a channel was definitely more Go-like and kept a lot of lock cruft out of the main thread where we can just work the the configuration directly.

@slackpad slackpad mentioned this pull request Jul 8, 2016
14 tasks
This isn't strictly related to this change but it's nice to reduce the scope
of raft.go if we can, especially since we tried to tag "runs in the main thread"
or not as part of this change (this code doesn't care).
@slackpad slackpad changed the title Fixes races with Raft configuration member, adds GetConfiguration accessor. Fixes races with Raft configurations member, adds GetConfiguration accessor. Jul 8, 2016
cloned.Servers[1].ID = "scribble"
if sampleConfiguration.Servers[1].ID == "scribble" {
t.Fatalf("cloned configuration shouldn't alias Servers")

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Split here into two separate test functions?

@ongardie-sfdc
Copy link

I'm still trying to wrap my head around how snapshots are taken...

// the configuration changes. It should be ok in practice with normal
// application traffic flowing through the FSM. If there's none of that
// then it's not crucial that we snapshot, since there's not much going
// on Raft-wise.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why would little/no normal traffic cause the state machine to fall behind on applying entries?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It wouldn't cause the state machine to fall behind. Imagine that the last commit at index X was a config change. The FSM won't see this log, so when you try to snapshot the FSM it will be at X-1 so this will hit the check above and not write the snapshot. Once X+1 comes along and the FSM ingests it then we will be happy to snapshot again with the config change at index X.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This depends on the FSM having seen some index past the config change in order to take a snapshot, which relies on some application-related logs to come through and hit the FSM.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah. In LogCabin, the state machine is fed no-op entries to increment its "last applied" index.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Interesting - that would prevent this wrinkle.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems like processLog() could just write to fsmCommitCh for any type of log entry, and the FSM already filters out uninteresting ones. We'd just have to remove the switch statement out of processLog().

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added item 21 to #84 to track this since it's probably better done in a separate, small PR.

@slackpad
Copy link
Contributor Author

slackpad commented Jul 9, 2016

There's the runSnapshots() goroutine that calls takeSnapshot() at the right intervals. I think the confusing thing is that takeSnapshot() does an async ping to runFSM() (another goroutine) and asks it to take a snapshot of the FSM, then waits on the results of that via a future. That future has the snapshot of the FSM, and now I use another async / future to pull the configuration and do the check.

We used to have the config stuff on the FSM side, but that was one level deeper than it needed to be, and the FSM doesn't have anything to do with the configuration changes.

I suspect that this division of things might be a bit of a leftover from an earlier version of the library, but the one nice thing about this architecture is that the FSM is either taking a snapshot, restoring a snapshot, or applying changes - all the serialization happens by virtue of the single runFSM() loop and it reading from channels.

r.logger.Printf("[INFO] raft: Starting snapshot up to %d", req.index)
// Make a request for the configurations and extract the committed info.
// We have to use the future here to safely get this information since
// it is owned by the main thread.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good, safer. Who knew we had all these races? :/

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I did it by inspection but ran go test -race and it saw some of these, too :-)

@ongardie-sfdc
Copy link

@slackpad, thanks, after reading through, that makes sense to me now.

@ongardie-sfdc
Copy link

Well, lgtm. We can split those files in a later PR too if that's easier.

@slackpad
Copy link
Contributor Author

slackpad commented Jul 9, 2016

It was easy enough I rolled in the split as the last change. Thanks for taking a look!

@slackpad slackpad merged commit 78e25a2 into issue-84-integration Jul 9, 2016
@slackpad slackpad deleted the b-racy-configuration branch July 9, 2016 03:28
@ongardie-sfdc
Copy link

Also, thanks for so eagerly porting over to use channels. Didn't expect that to happen so fast last night :)

@slackpad
Copy link
Contributor Author

slackpad commented Jul 9, 2016

The locks did make me feel gross, and like 95% of them were almost always useless b/c it's all in a single thread - channels were dramatically better in this case!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants