Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

storage: evaluate proposals without lock #10012

Merged
merged 1 commit into from
Oct 25, 2016
Merged

Conversation

tbg
Copy link
Member

@tbg tbg commented Oct 16, 2016

The main objective of this change is to (later) enable proposer-evaluated KV to
compute the WriteBatch when evaluating a proposal (i.e. when turning
a BatchRequest into a *ProposalData). The previous code assumes making
a proposal is cheap, and thus it is done under a large critical sections.

MaxLeaseIndex is assigned at the latest possible moment before submitting to
Raft. Multiple inbound requests can pass the command queue without interacting,
and it's important to assign the Lease index only right before when the
proposal is submitted to Raft.

Note that the command queue currently doesn't cover all the data accessed by
Raft commands (#10084) and that a discussion of how precisely command
evaluation will work is still being discussed (#6290). Nevertheless, this
change should be safe and a step in the right direction.


This change is Reviewable

@tbg tbg added the S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting label Oct 16, 2016
@tbg tbg force-pushed the evaluate-refactor branch 3 times, most recently from 324433d to 1a01c28 Compare October 16, 2016 15:01
@bdarnell
Copy link
Contributor

Reviewed 5 of 5 files at r1.
Review status: all files reviewed at latest revision, 9 unresolved discussions, some commit checks failed.


pkg/storage/replica.go, line 1597 at r1 (raw file):

// into Replica's proposal map and subsequently passed to submitProposalLocked.
//
// Replica.mu must not be held.

To be clear, it's no longer synchronized by Replica.mu, but it should only be called from the scheduler (which acts as a kind of mutex), right?


pkg/storage/replica.go, line 1682 at r1 (raw file):

  r.mu.Lock()
  if r.mu.destroyed != nil {
      return nil, nil, r.mu.destroyed

Missing r.mu.Unlock()


pkg/storage/replica.go, line 2302 at r1 (raw file):

// refreshPendingCmds with the ElectionTimeoutTicks as parameter.
func (r *Replica) queueRefreshStalePendingCmds(reason refreshRaftReason) {
  log.Warningf(r.ctx, "queueing refresh of stale pending commands: %s", reason)

Are these warnings temporary?


pkg/storage/replica.go, line 2313 at r1 (raw file):

}

// refreshPendingCmds goes through the pending proposals, refurbishing what

These names are (still) really difficult to remember/distinguish. Does the move to proposer-evaluated KV suggest any better names? For example, could we now s/refurbish/reevaluate/?


pkg/storage/replica.go, line 2316 at r1 (raw file):

// it can and reproposing the rest. refreshAtDelta specifies how old (in ticks)
// a command must be for it to be inspected; usually this is called with zero (affect everything) or the number of ticks of an election timeout (affect only
// proposals that have had ample time to apply but didn't).

Proposer-evaluated KV makes it much more expensive to refresh a command, so we may need to be more conservative about how often we do it.


pkg/storage/replica.go, line 2361 at r1 (raw file):

      if err := r.submitProposalLocked(p); err != nil {
          log.Warning(p.ctx, err)
          // TODO(tschottdorf): should notify the client and remove from

This will be fixed as a part of this PR, right?


pkg/storage/replica.go, line 2370 at r1 (raw file):

  r.mu.Unlock()

  for _, p := range refurbish {

This reverses the order of repropsals and refurbishments. Why?


pkg/storage/replica.go, line 2671 at r1 (raw file):

              // refurbish the command. We do this asynchronously because
              // it's expensive and shouldn't block this goroutine.
              cmdCopy := *cmd // take copy so we can bend the done channel below

"bend"?


pkg/storage/replica.go, line 2688 at r1 (raw file):

                      }
                  }); pErr != nil {
                      log.Info(r.ctx, pErr)

Include more context in this error message (or silently drop it? Does it matter that we give up on refurbishments during shutdown?)


Comments from Reviewable

@petermattis
Copy link
Collaborator

Review status: all files reviewed at latest revision, 11 unresolved discussions, some commit checks failed.


pkg/storage/replica.go, line 1597 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

To be clear, it's no longer synchronized by Replica.mu, but it should only be called from the scheduler (which acts as a kind of mutex), right?

`Replica.Send` which calls down to `Replica.propose` is not run on the scheduler (unless something changed in this PR).

pkg/storage/replica.go, line 2043 at r1 (raw file):

  }
  if refreshReason != noReason {
      r.queueRefreshAllPendingCmds(refreshReason)

Why are you queueing the refresh vs executing it here? You're already running on a raftScheduler worker.


pkg/storage/replica.go, line 2107 at r1 (raw file):

      // through. In effect, proposals will will be refreshed when they have
      // been pending for 1 to 2 election cycles.
      r.queueRefreshStalePendingCmds(reasonTicks)

Similar to my other comment, you're already running on a raftScheduler worker goroutine here. Why are you enqueuing which will just cause another scheduler loop before processing the refresh?


Comments from Reviewable

@tbg tbg force-pushed the evaluate-refactor branch from 1a01c28 to 16c896e Compare October 17, 2016 11:54
@tbg
Copy link
Member Author

tbg commented Oct 17, 2016

Review status: all files reviewed at latest revision, 11 unresolved discussions, some commit checks failed.


pkg/storage/replica.go, line 1597 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Replica.Send which calls down to Replica.propose is not run on the scheduler (unless something changed in this PR).

The scheduler isn't involved here at all. I don't think this was ever spelled out anywhere (so we should do it in this change, especially now that guarantees are relaxed), but here's my understanding:

The replica mutex does not at all serialize proposals, but the command queue serializes proposals which may "conflict" in some manner. For simple key-based commands, it makes intuitive sense what that means (we should double-check what can go on with commands which skip the command queue, and look at leases, splits, ... once more).
Essentially, anywhere between beginCmd and the proposal applying, you can interleave arbitrarily with other "non-conflicting" commands, and they can pass through evaluation and various critical sections in some order (including reproposals, refurbishments, ...).
That needs to be correct in order for this change to be correct (and also for the old code to have been correct).
It also needs to be efficient, and I (just) changed one thing that could have bitten us, namely we now assign the lease counter IDs (MaxLeaseIndex) just before proposing to Raft. We used to do it when creating the command, but that is before the (soon to be) long evaluate section.


pkg/storage/replica.go, line 1682 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Missing r.mu.Unlock()

Done.

pkg/storage/replica.go, line 2043 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Why are you queueing the refresh vs executing it here? You're already running on a raftScheduler worker.

I agree, but only because we're doing it at the end of the work (for example, it would seem wrong to let one command pay for the latency that comes with evaluating a potentially large number of proposals anew). For `tick` the same is true. I can change those two, though I think it would be simpler if there were one straightforward and safe way to make these calls (i.e. pay the scheduler round trip).

The more interesting invocation is that in setReplicaID, which is called (with some indirection) from getOrCreateReplica, so that is less tied to a goroutine (though in practice we only call it from two places, both of which are - again - on the scheduler). We definitely don't want to insert arbitrary delays there as those calls are early in the invocation.

And then there's the awkward one of the bunch, in which we can't go through the scheduler due to only wanting to target one specific command. How do you feel about that goroutine? Now that we have the scheduler, I feel pretty bad about offloading work in an uncontrolled way. But then again, we want some of that work done in parallel with standard Raft processing - by occupying the scheduler with reproposals, am I not also delaying the handling of "regular" Raft processing (since no other worker will process outstanding requests for a Range, which btw isn't documented).

So ISTM that I may not be doing performance a favor here by using the scheduler, and that I should always delegate?


pkg/storage/replica.go, line 2313 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

These names are (still) really difficult to remember/distinguish. Does the move to proposer-evaluated KV suggest any better names? For example, could we now s/refurbish/reevaluate/?

Yes, especially `refreshPendingCmds` should already have been renamed. I like `s/refurbish/reevaluate`, though I'd prefer doing another rename-only PR and keep this one as small as possible. I added TODOs for both renames.

pkg/storage/replica.go, line 2316 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Proposer-evaluated KV makes it much more expensive to refresh a command, so we may need to be more conservative about how often we do it.

I agree. Anything you want me to do here?

pkg/storage/replica.go, line 2361 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This will be fixed as a part of this PR, right?

Done.

pkg/storage/replica.go, line 2370 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

This reverses the order of repropsals and refurbishments. Why?

Concretely because of what makes sense with the critical sections here, but also because I think that's the right order. Added a comment, PTAL and LMK if I'm confusing something there.

pkg/storage/replica.go, line 2671 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

"bend"?

Done.

pkg/storage/replica.go, line 2688 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Include more context in this error message (or silently drop it? Does it matter that we give up on refurbishments during shutdown?)

In general it shouldn't matter that we don't refurbish, though I can think of at least one situation in which it leads to deadlock (a client is waiting on its done channel at shutdown, and can't abandon the command as it is being applied. So it won't be able to quit unless the command applies successfully, but if that needs a refurbishment, it will never do so). I'm now sending the error directly to the client to avoid that ever happening.

Comments from Reviewable

@tbg
Copy link
Member Author

tbg commented Oct 17, 2016

Review status: 0 of 49 files reviewed at latest revision, 11 unresolved discussions, some commit checks failed.


pkg/storage/replica.go, line 2302 at r1 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

Are these warnings temporary?

I'll clean these up when the way forward becomes clear.

Comments from Reviewable

@tbg tbg changed the base branch from tschottdorf/origin-replica to master October 17, 2016 11:55
@tbg
Copy link
Member Author

tbg commented Oct 17, 2016

Addressed most comments, but some discussion remains. I changed the base branch of this PR, so please excuse the weird diffs Reviewable might throw at you.


Review status: 3 of 5 files reviewed at latest revision, 11 unresolved discussions, some commit checks pending.


Comments from Reviewable

@petermattis
Copy link
Collaborator

Review status: 3 of 5 files reviewed at latest revision, 11 unresolved discussions, all commit checks successful.


pkg/storage/replica.go, line 1597 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

The scheduler isn't involved here at all. I don't think this was ever spelled out anywhere (so we should do it in this change, especially now that guarantees are relaxed), but here's my understanding:

The replica mutex does not at all serialize proposals, but the command queue serializes proposals which may "conflict" in some manner. For simple key-based commands, it makes intuitive sense what that means (we should double-check what can go on with commands which skip the command queue, and look at leases, splits, ... once more).
Essentially, anywhere between beginCmd and the proposal applying, you can interleave arbitrarily with other "non-conflicting" commands, and they can pass through evaluation and various critical sections in some order (including reproposals, refurbishments, ...).
That needs to be correct in order for this change to be correct (and also for the old code to have been correct).
It also needs to be efficient, and I (just) changed one thing that could have bitten us, namely we now assign the lease counter IDs (MaxLeaseIndex) just before proposing to Raft. We used to do it when creating the command, but that is before the (soon to be) long evaluate section.

You should convert some of the above into a comment in the code describing the correctness requirements. Something I'm not clear on is when a proposal needs to be re-evaluated vs re-proposed. With proposer-evaluated KV, can we ever simply re-propose a Raft command with reevaluating it?

pkg/storage/replica.go, line 2043 at r1 (raw file):

And then there's the awkward one of the bunch, in which we can't go through the scheduler due to only wanting to target one specific command. How do you feel about that goroutine?

Can you elaborate on this? I'm not quite sure what you're referring to.

Ok, if refreshing pending commands is going to be expensive, then I can see the value in not imposing that cost on the current operation being performed. But see my other comment about needing to clarify when reevaluation is necessary. Are there scenarios (perhaps common) where we can re-propose instead of reevaluate (and then re-propose)?


Comments from Reviewable

@tbg tbg force-pushed the evaluate-refactor branch 2 times, most recently from b2d3e64 to 50770af Compare October 18, 2016 08:56
@tbg
Copy link
Member Author

tbg commented Oct 18, 2016

Review status: 1 of 5 files reviewed at latest revision, 12 unresolved discussions.


pkg/storage/replica.go, line 1597 at r1 (raw file):

You should convert some of the above into a comment in the code describing the correctness requirements.

Done (addWriteCmd).

Something I'm not clear on is when a proposal needs to be re-evaluated vs re-proposed

This is documented inside of refreshPendingCmds, but not very well. I added it more explicitly:

// Refurbishing roughly equates being allowed to assume that the previously
// submitted command can no more be applied successfully (due to the applied
// lease index having surpassed the proposal's MaxLeaseIndex). Reproposing
// refers to simply taking the in-flight proposal and submitting it to Raft
// again; this is what we have to do for everything that couldn't be
// refurbished as refurbishing anyway would lead to doubly-applying some of
// the proposals.

On your other question:

With proposer-evaluated KV, can we ever simply re-propose a Raft command with reevaluating it?

Nothing there should have changed. Two commands which are in flight (i.e. past command queue) concurrently should not be able to influence each other. Of course that is one of the big assumptions that we should thoroughly try to poke holes into.


pkg/storage/replica.go, line 2043 at r1 (raw file):

Can you elaborate on this? I'm not quite sure what you're referring to.

Pinged you from a comment there.

Could you get back to me on this:

But then again, we want some of that work done in parallel with standard Raft processing - by occupying the scheduler with reproposals, am I not also delaying the handling of "regular" Raft processing (since no other worker will process outstanding requests for a Range, which btw isn't documented).

It's my biggest concern right now. Doesn't seem right to use the scheduler.


pkg/storage/replica.go, line 2687 at r2 (raw file):

              defer func() {
                  if err := r.store.Stopper().RunAsyncTask(ctx, func(ctx context.Context) {
                      if pErr := r.refurbishPendingCmd(&cmdCopy); pErr != nil {

@petermattis this one does part of the work that refreshPendingCmds does, but only for a single command.


Comments from Reviewable

@bdarnell
Copy link
Contributor

Reviewed 4 of 4 files at r3.
Review status: all files reviewed at latest revision, 8 unresolved discussions, all commit checks successful.


pkg/storage/replica.go, line 1597 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

You should convert some of the above into a comment in the code describing the correctness requirements.

Done (addWriteCmd).

Something I'm not clear on is when a proposal needs to be re-evaluated vs re-proposed

This is documented inside of refreshPendingCmds, but not very well. I added it more explicitly:

// Refurbishing roughly equates being allowed to assume that the previously
// submitted command can no more be applied successfully (due to the applied
// lease index having surpassed the proposal's MaxLeaseIndex). Reproposing
// refers to simply taking the in-flight proposal and submitting it to Raft
// again; this is what we have to do for everything that couldn't be
// refurbished as refurbishing anyway would lead to doubly-applying some of
// the proposals.

On your other question:

With proposer-evaluated KV, can we ever simply re-propose a Raft command with reevaluating it?

Nothing there should have changed. Two commands which are in flight (i.e. past command queue) concurrently should not be able to influence each other. Of course that is one of the big assumptions that we should thoroughly try to poke holes into.

The new comment looks good, thanks.

pkg/storage/replica.go, line 2313 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Yes, especially refreshPendingCmds should already have been renamed. I like s/refurbish/reevaluate, though I'd prefer doing another rename-only PR and keep this one as small as possible. I added TODOs for both renames.

Yeah, fine to do the renaming separately.

pkg/storage/replica.go, line 2316 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

I agree. Anything you want me to do here?

Nothing specific; just pointing it out. The only tuneable variable here is how many ticks a command must be outstanding to repropose, but I don't think it makes sense to extend that beyond an election timeout unless and until we see real problems.

pkg/storage/replica.go, line 2370 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Concretely because of what makes sense with the critical sections here, but also because I think that's the right order. Added a comment, PTAL and LMK if I'm confusing something there.

Yeah, makes sense.

Comments from Reviewable

@petermattis
Copy link
Collaborator

Review status: all files reviewed at latest revision, 13 unresolved discussions, some commit checks failed.


pkg/storage/replica.go, line 2043 at r1 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Can you elaborate on this? I'm not quite sure what you're referring to.

Pinged you from a comment there.

Could you get back to me on this:

But then again, we want some of that work done in parallel with standard Raft processing - by occupying the scheduler with reproposals, am I not also delaying the handling of "regular" Raft processing (since no other worker will process outstanding requests for a Range, which btw isn't documented).

It's my biggest concern right now. Doesn't seem right to use the scheduler.

`processRaftCommand` is currently serialized on Ranges. As discussed earlier today, it would be great to eventually parallelize processing within a Range, but I don't think proposer-evaluated KV should be gated on that.

pkg/storage/replica.go, line 2687 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

@petermattis this one does part of the work that refreshPendingCmds does, but only for a single command.

Using a Raft scheduler worker wouldn't be much better than using the current goroutine (which is a Raft scheduler worker). My preference would be to do the refurbishment inline here with a TODO to investigate the benefit of using a separate goroutine.

pkg/storage/replica.go, line 1531 at r3 (raw file):

// reorderings. For example, a batch of writes and a split could be in flight
// in parallel without overlap, but if the writes hit the RHS, something must
// prevent them from writing outside of the key range when they apply.

Per #10084, a split should lock-out the RHS of the split. I suppose you'll be updating this comment in that PR.


pkg/storage/replica.go, line 1532 at r3 (raw file):

// in parallel without overlap, but if the writes hit the RHS, something must
// prevent them from writing outside of the key range when they apply.
// Similarly, a command proposed under lease A must not apply under lease B.

I'm not clear on what check you're going to perform to prevent this. We can't check the lease validity when Raft has committed the proposal and is asking for it to be applied as Raft can't depend on the wall time. Perhaps what you meant is a command evaluated under lease A must not be proposed under lease B. Or I could just be confused.


pkg/storage/replica.go, line 1697 at r3 (raw file):

  // has a tiny (~1%) performance hit for single-node block_writer testing.
  r.raftMu.Lock()
  defer r.raftMu.Unlock()

I think you could delay this locking to just before r.evaluateProposal.


pkg/storage/replica.go, line 2371 at r3 (raw file):

// can and reproposing the rest.
//
// Refurbishing roughly equates being allowed to assume that the previously

s/equates being/equates to being/g


pkg/storage/replica.go, line 2379 at r3 (raw file):

// the proposals.
//
// refreshAtDelta specifies how old (in ticks)

Nit: rewrap this paragraph.


Comments from Reviewable

@tbg
Copy link
Member Author

tbg commented Oct 21, 2016

I want to get #10131 in first. It will remove some of the complications here.

@tbg tbg force-pushed the evaluate-refactor branch from 50770af to 94cc2b7 Compare October 24, 2016 11:02
@tbg
Copy link
Member Author

tbg commented Oct 24, 2016

PTAL at the new changes (fair bit of churn between the revisions due to reworking this after #10131, perhaps just review from scratch).


Review status: 0 of 2 files reviewed at latest revision, 13 unresolved discussions, some commit checks pending.


pkg/storage/replica.go, line 2043 at r1 (raw file):

Previously, petermattis (Peter Mattis) wrote…

processRaftCommand is currently serialized on Ranges. As discussed earlier today, it would be great to eventually parallelize processing within a Range, but I don't think proposer-evaluated KV should be gated on that.

Thanks to #10131 (which removed refurbishments), this question has been resolved (by not needing to be addressed in the first place).

pkg/storage/replica.go, line 2687 at r2 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Using a Raft scheduler worker wouldn't be much better than using the current goroutine (which is a Raft scheduler worker). My preference would be to do the refurbishment inline here with a TODO to investigate the benefit of using a separate goroutine.

Absent in newer revision.

pkg/storage/replica.go, line 1531 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Per #10084, a split should lock-out the RHS of the split. I suppose you'll be updating this comment in that PR.

Yep.

pkg/storage/replica.go, line 1532 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I'm not clear on what check you're going to perform to prevent this. We can't check the lease validity when Raft has committed the proposal and is asking for it to be applied as Raft can't depend on the wall time. Perhaps what you meant is a command evaluated under lease A must not be proposed under lease B. Or I could just be confused.

No, it's meant as written. Checking downstream of Raft is done today and is always possible. It only depends on the command's timestamp and the lease under which it commits. Why I'm pointing it out here is because we need more than what's currently done (for scenarios in which lease A and lease B are both held by the same node but the node restarted in between and lost its command queue, for example. That would allow scenarios in which a write pops up out of nowhere and applies successfully, invalidating reads which were served).

pkg/storage/replica.go, line 1697 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I think you could delay this locking to just before r.evaluateProposal.

Done.

pkg/storage/replica.go, line 2371 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

s/equates being/equates to being/g

Absent in newer revision.

pkg/storage/replica.go, line 2379 at r3 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Nit: rewrap this paragraph.

Done.

Comments from Reviewable

@tbg tbg force-pushed the evaluate-refactor branch from 94cc2b7 to d5beaf3 Compare October 24, 2016 11:19
@petermattis
Copy link
Collaborator

:lgtm: Have you run make stress PKG=./storage?


Review status: 0 of 2 files reviewed at latest revision, 9 unresolved discussions, all commit checks successful.


pkg/storage/replica.go, line 294 at r4 (raw file):

      // The *ProposalData in the map are "owned" by it. Elements from the
      // map must only be referenced while Replica.mu is held, except if the
      // element is removed from the map first.

Did you audit the code to make sure this is true?


pkg/storage/replica.go, line 296 at r4 (raw file):

      // element is removed from the map first.
      //
      // TODO(tschottdorf): evaluate whether this should be a list/slice.

Is this TODO still valid?


pkg/storage/replica.go, line 1743 at r4 (raw file):

  //
  // TODO(tschottdorf): holding raftMu during evaluateProposal may not be the
  // right thing once we actually do evaluation work there. See #10084.

I'd rephrase this to say that holding raftMu during evaluation will limit concurrency at the range level and is something we will eventually need to address.


Comments from Reviewable

@tbg tbg force-pushed the evaluate-refactor branch 2 times, most recently from 9d153b7 to 856485c Compare October 24, 2016 19:18
@tbg
Copy link
Member Author

tbg commented Oct 24, 2016

TFTR. make stress is running: 866 runs so far, 0 failures, over 10m50s. I'll leave it on for a few more hours.


Review status: 0 of 2 files reviewed at latest revision, 9 unresolved discussions, all commit checks successful.


pkg/storage/replica.go, line 294 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Did you audit the code to make sure this is true?

Yes. This was mostly relevant during refurbishment, which has now changed to deleting from the map and notifying the client, in that order (which is also the order in which it always happens, as audited).

pkg/storage/replica.go, line 296 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

Is this TODO still valid?

Yes, but it's in an odd place and has no context. Moved to `refreshProposalsLocked` where it belongs.

pkg/storage/replica.go, line 1743 at r4 (raw file):

Previously, petermattis (Peter Mattis) wrote…

I'd rephrase this to say that holding raftMu during evaluation will limit concurrency at the range level and is something we will eventually need to address.

Done.

Comments from Reviewable

@tbg
Copy link
Member Author

tbg commented Oct 24, 2016

Did get some stress failures, but nothing that seems related:

--- FAIL: TestStoreScanIntents (0.12s)
        store_test.go:1814: 0: expected scan count 2; got 3

--- FAIL: TestRaftRemoveRace (0.05s)
        client_test.go:926: [n1,s1,r1/1:/M{in-ax}]: change replicas aborted due to failed
 preemptive snapshot: range=1: remote declined snapshot:

The latter is tracked in #10166, but I'll have to double check on the former.

@tbg
Copy link
Member Author

tbg commented Oct 24, 2016

Indeed seems newly flaky:

10241 runs completed, 1 failures, over 1m36s

(ran on master for >9min)

I'll take a look tomorrow.

The main objective of this change is to (later) enable proposer-evaluated KV to
compute the `WriteBatch` when evaluating a proposal (i.e. when turning
a `BatchRequest` into a `*ProposalData`). The previous code assumes making
a proposal is cheap, and thus it is done under a large critical sections.

MaxLeaseIndex is assigned at the latest possible moment before submitting to
Raft. Multiple inbound requests can pass the command queue without interacting,
and it's important to assign the Lease index only right before when the
proposal is submitted to Raft.

Note that the command queue currently doesn't cover all the data accessed by
Raft commands (cockroachdb#10084) and that a discussion of how precisely command
evaluation will work is still being discussed (cockroachdb#6290). Nevertheless, this
change should be safe and a step in the right direction.
@tbg tbg force-pushed the evaluate-refactor branch from 856485c to 41526e2 Compare October 25, 2016 10:51
@tbg
Copy link
Member Author

tbg commented Oct 25, 2016

Upon further testing, it's also flaky on master. Filed #10198.

@tbg tbg merged commit c214896 into cockroachdb:master Oct 25, 2016
@tbg tbg deleted the evaluate-refactor branch October 25, 2016 11:02
@tamird
Copy link
Contributor

tamird commented Oct 25, 2016

Reviewed 3 of 3 files at r6.
Review status: all files reviewed at latest revision, 10 unresolved discussions, all commit checks successful.


pkg/storage/replica.go, line 2500 at r6 (raw file):

  // definitely required, however.
  //
  // TODO(tschottdorf): evaluate whether `proposals` should be a list/slice.

proposals meaning r.mu.proposals?


pkg/storage/replica_test.go, line 5789 at r6 (raw file):

  // Disable ticks to avoid automatic reproposals after a timeout, which
  // would pass this test.
  cfg.RaftTickInterval = time.Hour

If you're disabling ticks, why not use math.MaxInt64? time.Hour invites speculation.


pkg/storage/replica_test.go, line 5932 at r6 (raw file):

  const num = 10

  var chs []chan proposalResult

you might want a comment here that explains chs is protected by rng.mu.


pkg/storage/replica_test.go, line 5953 at r6 (raw file):

              t.Error(err)
          } else {

NYC but random whitespace here =/


pkg/storage/replica_test.go, line 6065 at r6 (raw file):

  var tc testContext
  cfg := TestStoreConfig()
  cfg.RaftTickInterval = 24 * time.Hour // disable ticks

again, yet this time you're using 24 hours? this invites speculation.


Comments from Reviewable

@tbg
Copy link
Member Author

tbg commented Oct 26, 2016

Review status: all files reviewed at latest revision, 10 unresolved discussions, all commit checks successful.


pkg/storage/replica.go, line 2500 at r6 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

proposals meaning r.mu.proposals?

Yep.

pkg/storage/replica_test.go, line 5789 at r6 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

If you're disabling ticks, why not use math.MaxInt64? time.Hour invites speculation.

No good reason.

pkg/storage/replica_test.go, line 5932 at r6 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

you might want a comment here that explains chs is protected by rng.mu.

SGTM

pkg/storage/replica_test.go, line 5953 at r6 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

NYC but random whitespace here =/

Kill it! :)

pkg/storage/replica_test.go, line 6065 at r6 (raw file):

Previously, tamird (Tamir Duberstein) wrote…

again, yet this time you're using 24 hours? this invites speculation.

MaxInt64 SGTM

Comments from Reviewable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
S-1-stability Severe stability issues that can be fixed by upgrading, but usually don’t resolve by restarting
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants