Fixes races with Raft configurations member, adds GetConfiguration accessor. #134

slackpad · 2016-07-08T04:34:26Z

This adds GetConfiguration() which is item 7 from #84 (comment). This also resolves item 12.

While adding that I realized that we should move the "no snapshots during config changes" outside the FSM snapshot thread to fix racy access to the configurations and also because that's not really a concern of the FSM. That thread just exists to let the FSM do its thing, but the main snapshot thread is the one that should know about the configuration.

To add an external GetConfiguration() API I added an RW mutex which I'm not super thrilled about, but it lets us correctly manipulate this state from the outside, and among the different Raft threads.

…napshotter.

ongardie · 2016-07-08T04:42:01Z

Haven't looked through this patch yet, but I was sort of thinking GetConfiguration would be treated like the other application calls that get shoved into a channel as some future type for leaderLoop to process without additional locking. Will read through and think more on this later.

slackpad · 2016-07-08T04:49:32Z

We could definitely do it with channels, I think. I'll take a crack at that and see if it simplifies things.

slackpad · 2016-07-08T07:21:04Z

@ongardie this is ready for a look. Using a channel was definitely more Go-like and kept a lot of lock cruft out of the main thread where we can just work the the configuration directly.

This isn't strictly related to this change but it's nice to reduce the scope of raft.go if we can, especially since we tried to tag "runs in the main thread" or not as part of this change (this code doesn't care).

ongardie-sfdc · 2016-07-09T02:40:03Z

configuration_test.go

-	cloned.Servers[1].ID = "scribble"
-	if sampleConfiguration.Servers[1].ID == "scribble" {
-		t.Fatalf("cloned configuration shouldn't alias Servers")
+


Split here into two separate test functions?

ongardie-sfdc · 2016-07-09T03:01:51Z

I'm still trying to wrap my head around how snapshots are taken...

ongardie-sfdc · 2016-07-09T03:07:06Z

raft.go

+	// the configuration changes. It should be ok in practice with normal
+	// application traffic flowing through the FSM. If there's none of that
+	// then it's not crucial that we snapshot, since there's not much going
+	// on Raft-wise.


Why would little/no normal traffic cause the state machine to fall behind on applying entries?

It wouldn't cause the state machine to fall behind. Imagine that the last commit at index X was a config change. The FSM won't see this log, so when you try to snapshot the FSM it will be at X-1 so this will hit the check above and not write the snapshot. Once X+1 comes along and the FSM ingests it then we will be happy to snapshot again with the config change at index X.

This depends on the FSM having seen some index past the config change in order to take a snapshot, which relies on some application-related logs to come through and hit the FSM.

Ah. In LogCabin, the state machine is fed no-op entries to increment its "last applied" index.

Interesting - that would prevent this wrinkle.

It seems like processLog() could just write to fsmCommitCh for any type of log entry, and the FSM already filters out uninteresting ones. We'd just have to remove the switch statement out of processLog().

I added item 21 to #84 to track this since it's probably better done in a separate, small PR.

slackpad · 2016-07-09T03:09:17Z

There's the runSnapshots() goroutine that calls takeSnapshot() at the right intervals. I think the confusing thing is that takeSnapshot() does an async ping to runFSM() (another goroutine) and asks it to take a snapshot of the FSM, then waits on the results of that via a future. That future has the snapshot of the FSM, and now I use another async / future to pull the configuration and do the check.

We used to have the config stuff on the FSM side, but that was one level deeper than it needed to be, and the FSM doesn't have anything to do with the configuration changes.

I suspect that this division of things might be a bit of a leftover from an earlier version of the library, but the one nice thing about this architecture is that the FSM is either taking a snapshot, restoring a snapshot, or applying changes - all the serialization happens by virtue of the single runFSM() loop and it reading from channels.

ongardie-sfdc · 2016-07-09T03:10:05Z

raft.go

-	r.logger.Printf("[INFO] raft: Starting snapshot up to %d", req.index)
+	// Make a request for the configurations and extract the committed info.
+	// We have to use the future here to safely get this information since
+	// it is owned by the main thread.


Good, safer. Who knew we had all these races? :/

I did it by inspection but ran go test -race and it saw some of these, too :-)

ongardie-sfdc · 2016-07-09T03:11:54Z

@slackpad, thanks, after reading through, that makes sense to me now.

ongardie-sfdc · 2016-07-09T03:25:38Z

Well, lgtm. We can split those files in a later PR too if that's easier.

slackpad · 2016-07-09T03:27:19Z

It was easy enough I rolled in the split as the last change. Thanks for taking a look!

ongardie-sfdc · 2016-07-09T03:29:03Z

Also, thanks for so eagerly porting over to use channels. Didn't expect that to happen so fast last night :)

slackpad · 2016-07-09T03:30:21Z

The locks did make me feel gross, and like 95% of them were almost always useless b/c it's all in a single thread - channels were dramatically better in this case!

James Phillips added 3 commits July 7, 2016 20:06

Adds a read/write log around the configuration.

ecf7dfa

Moves the "no snapshots during config changes" logic out of the FSM s…

799c89f

…napshotter.

Fixes unit tests to not be racy with the configuration.

1bef573

slackpad changed the title ~~Fixes races with Raft configuration.~~ Fixes races with Raft configuration member, adds GetConfiguration accessor. Jul 8, 2016

Uses GetConfiguration whenever possible vs. read locking.

6803a9c

James Phillips added 2 commits July 7, 2016 23:59

Refactors to use a channel instead of a lock.

522ae1e

Does some style and comment cleanup.

264bfc2

slackpad force-pushed the b-racy-configuration branch from 1e18bcb to 264bfc2 Compare July 8, 2016 07:13

Cleans up tests (polling is a little slower now).

5c52de4

slackpad mentioned this pull request Jul 8, 2016

Cleanup Meta Ticket #84

Closed

14 tasks

Moves nextConfiguration out of raft.go.

98f6ec9

This isn't strictly related to this change but it's nice to reduce the scope of raft.go if we can, especially since we tried to tag "runs in the main thread" or not as part of this change (this code doesn't care).

slackpad changed the title ~~Fixes races with Raft configuration member, adds GetConfiguration accessor.~~ Fixes races with Raft configurations member, adds GetConfiguration accessor. Jul 8, 2016

ongardie-sfdc reviewed Jul 9, 2016
View reviewed changes

James Phillips added 2 commits July 8, 2016 19:59

Splits the clone tests into two.

7806449

Lowers channel buffer size on the configurations channel.

878e71b

ongardie-sfdc reviewed Jul 9, 2016
View reviewed changes

Splits the public interface from the internals.

a6ec80e

slackpad merged commit 78e25a2 into issue-84-integration Jul 9, 2016

slackpad deleted the b-racy-configuration branch July 9, 2016 03:28

slackpad mentioned this pull request Oct 7, 2016

Dispatch a noop to the FSM from non-FSM logs #166

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fixes races with Raft configurations member, adds GetConfiguration accessor. #134

Fixes races with Raft configurations member, adds GetConfiguration accessor. #134

slackpad commented Jul 8, 2016 •

edited

Loading

ongardie commented Jul 8, 2016

slackpad commented Jul 8, 2016

slackpad commented Jul 8, 2016

ongardie-sfdc Jul 9, 2016

ongardie-sfdc commented Jul 9, 2016

ongardie-sfdc Jul 9, 2016

slackpad Jul 9, 2016

slackpad Jul 9, 2016

ongardie-sfdc Jul 9, 2016

slackpad Jul 9, 2016

ongardie-sfdc Jul 9, 2016

slackpad Jul 9, 2016

slackpad commented Jul 9, 2016 •

edited

Loading

ongardie-sfdc Jul 9, 2016

slackpad Jul 9, 2016

ongardie-sfdc commented Jul 9, 2016

ongardie-sfdc commented Jul 9, 2016

slackpad commented Jul 9, 2016

ongardie-sfdc commented Jul 9, 2016

slackpad commented Jul 9, 2016

Fixes races with Raft configurations member, adds GetConfiguration accessor. #134

Fixes races with Raft configurations member, adds GetConfiguration accessor. #134

Conversation

slackpad commented Jul 8, 2016 • edited Loading

ongardie commented Jul 8, 2016

slackpad commented Jul 8, 2016

slackpad commented Jul 8, 2016

Choose a reason for hiding this comment

ongardie-sfdc commented Jul 9, 2016

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

slackpad commented Jul 9, 2016 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ongardie-sfdc commented Jul 9, 2016

ongardie-sfdc commented Jul 9, 2016

slackpad commented Jul 9, 2016

ongardie-sfdc commented Jul 9, 2016

slackpad commented Jul 9, 2016

slackpad commented Jul 8, 2016 •

edited

Loading

slackpad commented Jul 9, 2016 •

edited

Loading