Search code, repositories, users, issues, pull requests...

Previously, spencerkimball (Spencer Kimball) wrote…

This change allows the system to properly retry such errors instead of failing the request to the distributed sender.

I believe this makes `noNodeAddrsAvailError` unused.

roachpb/data.proto, line 155 at r5 (raw file):

message ModifiedSpanTrigger {
  optional bool system_config_span = 1 [(gogoproto.nullable) = false];
  // The node liveness span can be used to indicate that node liveness

s/The node liveness span can be used/node_liveness_span is set/

roachpb/data.proto, line 156 at r5 (raw file):

  optional bool system_config_span = 1 [(gogoproto.nullable) = false];
  // The node liveness span can be used to indicate that node liveness
  // records need re-gossiping after modification or range leadership

avoid the word "leadership".

  // The node liveness span can be used to indicate that node liveness
  // records need re-gossiping after modification or range leadership
  // change. In the case of modification, the span is typically set to

s/In the case of modification,/When set,/

  // records need re-gossiping after modification or range leadership
  // change. In the case of modification, the span is typically set to
  // a single key. For range leadership change, the span is set to the

What does it mean when it's set to a single key? When does that happen and when the other case? Also same comment about word "leadership".

storage/client_test.go, line 490 at r5 (raw file):

      // querying all stores; it may not be present on all stores, but the
      // current version is guaranteed to be present on one of them.
      if str == nil {

What prompted this?

storage/liveness.proto, line 29 at r5 (raw file):

      (gogoproto.casttype) = "github.com/cockroachdb/cockroach/roachpb.NodeID"];
  // Epoch is a monotonically-increasing value for node liveness. It
  // may be incremented every time a node's last heartbeat is older

What is the heartbeat?

storage/node_liveness.go, line 65 at r2 (raw file):

      expiration = lr.received.Add(LivenessThreshold.Nanoseconds(), 0)
  }
  return nl.clock.Now().Less(expiration)

Looks like you could add LivenessThreshold here.

storage/node_liveness.go, line 117 at r2 (raw file):

// last heartbeat in the canonical node liveness table.
func (nl *NodeLiveness) StartHeartbeat(ctx context.Context, stopper *stop.Stopper) {
  log.VEventf(1, ctx, "starting liveness heartbeat for node %d", nl.gossip.GetNodeID())

NodeID should already be in the context.

storage/node_liveness.go, line 123 at r2 (raw file):

      for {
          if err := nl.heartbeat(ctx); err != nil {
              log.Errorf(ctx, "failed liveness heartbeat for node %d: %s", nl.gossip.GetNodeID(), err)

Ditto.

storage/node_liveness.go, line 212 at r2 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

I'm adding a TODO for this – I think it's better measured in the context of more metrics related to the higher level concept of range leases, which I'll add with next PR.

Why kick that can down the road? We certainly do want to track this and it's worth seeing how this behaves without waiting for the more complicated PRs that try to do much more at once.

storage/node_liveness.go, line 226 at r5 (raw file):

// for the node specified by nodeID. In the event that the conditional
// put fails, the handleCondFailed callback is invoked with the actual
// node liveness record. The conditional put is done as a 1PC

... and nil returned (which is odd interface. Consider returning a special error instead).

storage/node_liveness.go, line 241 at r5 (raw file):

      if oldLiveness == nil {
          b.CPut(key, newLiveness, nil)
      } else {

The two branches are equivalent.

storage/node_liveness_test.go, line 121 at r5 (raw file):

  }

  // Verify that the epoch has been advanced

nit: dot.

storage/node_liveness_test.go, line 134 at r5 (raw file):

      }
      if live, err := mtc.nodeLivenesses[0].IsLive(deadNodeID); live || err != nil {
          return errors.Errorf("expected dead node to remain dead after epoch increment: %s", err)

print live as well.

storage/node_liveness_test.go, line 165 at r5 (raw file):

  // Register a callback to gossip in order to verify liveness records
  // are re-gossiped.
  var mu syncutil.Mutex

nit:

var keysMu struct {
   sync.Mutex
   keys []string
}

storage/node_liveness_test.go, line 167 at r5 (raw file):

  var mu syncutil.Mutex
  var keys []string
  livenessRegex := gossip.MakePrefixPattern(gossip.KeyNodeLivenessPrefix)

Weird that KeyNodeLivenessPrefix is in gossip.

Comments from Reviewable

spencerkimball · 2016-09-28T22:36:12Z

Review status: all files reviewed at latest revision, 58 unresolved discussions, some commit checks failed.

docs/RFCS/range_leases.md, line 61 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Heartbeats are not mentioned here nor explained anywhere else though they are referenced and implemented. Please clarify throughout.

I now mention heartbeat at the start.

docs/RFCS/range_leases.md, line 66 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Does that make sense? I assume that by "accept" you mean "apply" commands. But what is the effective "expiration timestamp"? That's not part of the state machine, so I can't use it for decision making. I would think that commands not proposed by the lease holder (which is a part of the replicated state, but doesn't have an expiration timestamp any more for your new type of lease) apply as a no-op (i.e. it error out). Is that what you mean? It would be good to "prove" that that covers all bases.

Clarified further.

docs/RFCS/range_leases.md, line 97 at r6 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

When would that happen? Incrementing the epoch is covered in the third case, so here this would be the lease holder proposing an identical lease? Why?

I removed that case as it doesn't make sense.

keys/constants.go, line 201 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

PrefixEnd() exists on Key as well.

Done.

Previously, tschottdorf (Tobias Schottdorf) wrote…

I believe this makes noNodeAddrsAvailError unused.

Yes, and it's been removed from the codebase entirely. Maybe I'm missing what you're comment was meant to say?

roachpb/data.proto, line 155 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

s/The node liveness span can be used/node_liveness_span is set/

Done.

roachpb/data.proto, line 156 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

avoid the word "leadership".

Done.

Previously, tschottdorf (Tobias Schottdorf) wrote…

s/In the case of modification,/When set,/

Done.

Previously, tschottdorf (Tobias Schottdorf) wrote…

What does it mean when it's set to a single key? When does that happen and when the other case? Also same comment about word "leadership".

Removed leadership, clarified comment.

storage/client_test.go, line 490 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

What prompted this?

Node liveness heartbeats go out pretty quickly, sometimes before we've gossiped the first range, which causes the multi test context to hit the case of stores which haven't yet been initialized.

storage/liveness.proto, line 29 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

What is the heartbeat?

updated to "expiration"

storage/node_liveness.go, line 65 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Looks like you could add LivenessThreshold here.

Not sure what you mean?

storage/node_liveness.go, line 117 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

NodeID should already be in the context.

Done.

storage/node_liveness.go, line 123 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Ditto.

Done.

storage/node_liveness.go, line 212 at r2 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Why kick that can down the road? We certainly do want to track this and it's worth seeing how this behaves without waiting for the more complicated PRs that try to do much more at once.

This isn't being used yet. It's just part of the API for the next change. I'll add metrics in that change which will make sense in the context of range leases.

storage/node_liveness.go, line 226 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

... and nil returned (which is odd interface. Consider returning a special error instead).

The point of the callback is to avoid getting any error. If I've specified the callback to handle a `ConditionalFailedError`, I don't want to then do a lot of the same work to verify the error type for the return.

storage/node_liveness.go, line 241 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

The two branches are equivalent.

They're not. It's the difference between `interface{}(nil)` and `*NodeLiveness(nil)` and it matters to the `Batch` API.

storage/node_liveness_test.go, line 121 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

nit: dot.

Done.

storage/node_liveness_test.go, line 134 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

print live as well.

Done.

storage/node_liveness_test.go, line 165 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

nit:

var keysMu struct {
   sync.Mutex
   keys []string
}

Done.

storage/node_liveness_test.go, line 167 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Weird that KeyNodeLivenessPrefix is in gossip.

...as are our other gossip keys. Doesn't seem weird.

Comments from Reviewable

bdarnell · 2016-09-29T06:57:27Z

Reviewed 16 of 34 files at r1, 2 of 6 files at r2, 4 of 17 files at r3, 6 of 9 files at r5, 8 of 8 files at r7.
Review status: all files reviewed at latest revision, 37 unresolved discussions, all commit checks successful.

server/server.go, line 210 at r7 (raw file):

  // liveness expiration and heartbeat interval.
  active, renewal := storage.RangeLeaseDurations(
      storage.RaftElectionTimeout(base.DefaultRaftTickInterval, 0 /* use default */))

This needs to use srvCtx.RaftTickInterval, not base.DefaultRaftTickInterval. And hard-coding zero here seems like a trap if we ever allow plumbing the election timeout through server.Context, so I'd rather go ahead and do that now so we can use it here. (or at least export the default constant and use it in place of 0 so we have something to grep for).

storage/node_liveness.go, line 75 at r2 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

I see this type more as the API into node liveness in the cluster. The node both publishes its liveness through it via the periodic heartbeat, queries other nodes' liveness, or increments other nodes' epochs to prove they were non-live. The tasks are all so closely related I don't see benefit in separating them.

And I don't see the benefit in keeping them together. The difference with the `self` liveness record and the entry for our node ID in gossip is an example of the traps that are possible from combining the two sides of this operation. It looks to me like there's very little cost in making this two separate types (in the same file).

storage/node_liveness.go, line 212 at r2 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

This isn't being used yet. It's just part of the API for the next change. I'll add metrics in that change which will make sense in the context of range leases.

I think we want to deploy this change on its own first to make sure the liveness table is working before we start relying on it for leases. Adding metrics here will be useful for that stage of deployment.

storage/node_liveness.go, line 226 at r5 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

The point of the callback is to avoid getting any error. If I've specified the callback to handle a ConditionalFailedError, I don't want to then do a lot of the same work to verify the error type for the return.

I understand that you don't want to both run the callback and return an error, but this is still an odd interface. A callback that is only called just before return seems like it would be more naturally handled via a special return value (although it does get ugly - I wouldn't want to return a special error type here for the client to handle, but returning a `Liveness` that is only present on conditional put failure would be strange). I can't come up with a design that I clearly like better than this one so it's not worth obsessing over.

storage/node_liveness.go, line 241 at r5 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

They're not. It's the difference between interface{}(nil) and *NodeLiveness(nil) and it matters to the Batch API.

This could use a comment then.

Previously, spencerkimball (Spencer Kimball) wrote…

Turns out we need to do this to avoid reentry on the same key in the command queue, which blocks this from completion. Added comments in both locations.

Hmm, seems strange. But not worth worrying about for this PR.

Comments from Reviewable

tbg · 2016-09-29T09:22:02Z

Reviewed 8 of 8 files at r7.
Review status: all files reviewed at latest revision, 41 unresolved discussions, all commit checks successful.

docs/RFCS/range_leases.md, line 66 at r5 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

Clarified further.

Define what it means to be the "range lease holder" before this comment (in particular, is it a property tied o a HLC timestamp of a command, or a real-time measure, or a mixture of both?). Also, explain what happens at _application_ time of proposed commands. Surely they will error out when the node in the lease isn't equal to the proposing node, but what about the epochs when the node _does_ match? Is it at all a correctness issue to propose at any time not "covered" by the lease (whatever "covered" means since now there is a "real-time duration notion" and a command's HLC timestamp)? I also still would like to see some examples here. Seeing that we're about to update the design doc, the work will well be worth it.

docs/RFCS/range_leases.md, line 68 at r7 (raw file):

does change, *all* of the range leases held by this node are
revoked. A node can only execute commands (propose writes to Raft or
serve reads) if it's the range leaseholder, the range lease epoch is

s/leaseholder/lease holder/ (I think).

Previously, spencerkimball (Spencer Kimball) wrote…

Yes, and it's been removed from the codebase entirely. Maybe I'm missing what you're comment was meant to say?

Sorry, I didn't notice that it had been removed, I thought there was cruft left you could tear out.

Previously, spencerkimball (Spencer Kimball) wrote…

Done.

s/node's/nodes/

For range lease updates, the span is set to the entire node liveness key range.

"range lease updates" is not very specific. Maybe

On lease activity on the range containing the liveness span, all of the liveness information is (re-)gossiped by means of specifying the entire node liveness key range.

though that's also a bit unwieldy.

storage/client_test.go, line 490 at r5 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

Node liveness heartbeats go out pretty quickly, sometimes before we've gossiped the first range, which causes the multi test context to hit the case of stores which haven't yet been initialized.

Add a TODO to avoid the next person wasting their life away trying to figure out what you just said.

storage/liveness.proto, line 33 at r4 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

Ah, actually I think we do need to send the MaxOffset because the case I just mentioned only holds if Node B's MaxOffset is greater than or equal to Node A's. What we need is Node A to stop serving reads at Expiration and Node B to consider a lease or liveness record expired only after Expiration + Max(MaxOffset[Node B], MaxOffset[Node A]).

But again, this just is begging the question of whether we ought to support changing max offset without requiring a cluster freeze. Is this kind of complexity healthy when you weigh costs and benefits?

I don't think changing MaxOffset should be a goal for anytime in the near future.

storage/liveness.proto, line 29 at r7 (raw file):

      (gogoproto.casttype) = "github.com/cockroachdb/cockroach/roachpb.NodeID"];
  // Epoch is a monotonically-increasing value for node liveness. It
  // may be incremented every time a node's expiration is older than a

s/every time/when/

What's a threshold liveness duration? I would think that the conditions for updating is that the HLC of the incrementor is ahead of the old expiration timestamp (at the time the action is taken), and (correspondingly, of course) that the new expiration timestamp is larger than the old one.

storage/node_liveness.go, line 65 at r2 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

Not sure what you mean?

The comment referred to code that has since been removed.

storage/node_liveness.go, line 226 at r5 (raw file):

Previously, bdarnell (Ben Darnell) wrote…

I understand that you don't want to both run the callback and return an error, but this is still an odd interface. A callback that is only called just before return seems like it would be more naturally handled via a special return value (although it does get ugly - I wouldn't want to return a special error type here for the client to handle, but returning a Liveness that is only present on conditional put failure would be strange). I can't come up with a design that I clearly like better than this one so it's not worth obsessing over.

If this interface stays, comment about the nil return please.

storage/node_liveness_test.go, line 167 at r5 (raw file):

Previously, spencerkimball (Spencer Kimball) wrote…

...as are our other gossip keys. Doesn't seem weird.

Still strikes me as odd, but not worth changing here.

Previously, bdarnell (Ben Darnell) wrote…

Hmm, seems strange. But not worth worrying about for this PR.

Why strange? We call in here synchronously from a commit trigger.

Comments from Reviewable

bdarnell · 2016-09-29T09:41:02Z

Review status: all files reviewed at latest revision, 41 unresolved discussions, all commit checks successful.

Previously, tschottdorf (Tobias Schottdorf) wrote…

Why strange? We call in here synchronously from a commit trigger.

From a commit trigger we should be using low-level (MVCC) APIs, or kicking off a goroutine to use the high-level (Send) API. Starting in the middle (executeBatch) is weird and seems dangerous - it doesn't matter for the way these tables are being used now, but if we're not blocking in the command queue then we don't interact with other transactions in the same way as normal reads.

Comments from Reviewable

tbg · 2016-09-29T10:25:08Z

Review status: all files reviewed at latest revision, 41 unresolved discussions, all commit checks successful.

Previously, bdarnell (Ben Darnell) wrote…

From a commit trigger we should be using low-level (MVCC) APIs, or kicking off a goroutine to use the high-level (Send) API. Starting in the middle (executeBatch) is weird and seems dangerous - it doesn't matter for the way these tables are being used now, but if we're not blocking in the command queue then we don't interact with other transactions in the same way as normal reads.

I agree. I meant that it's not strange that this deadlocks with `r.Send` as-is.

Comments from Reviewable

vivekmenezes · 2016-09-29T15:17:17Z

docs/RFCS/range_leases.md

+circular dependencies. This table maps node IDs to an epoch counter,
+and an expiration timestamp.
+
+## Liveness table


Node liveness table

vivekmenezes · 2016-09-29T15:23:50Z

docs/RFCS/range_leases.md

+liveness updates will simply resort to a conditional put to increment
+a seemingly not-live node's liveness epoch. The conditional put will
+fail because the expected value is out of date and the correct liveness
+info is returned to the caller.


This paragraph is better placed further down. Perhaps after the next two paragraphs

vivekmenezes · 2016-09-29T15:34:44Z

docs/RFCS/range_leases.md

-epoch. If a node is down (and its node lease has expired), another
-node may revoke its lease(s) by incrementing the node lease
+epoch. If a node is down (and its node liveness has expired), another
+node may revoke its lease(s) by incrementing the node liveness
 epoch. Once this is done the old range lease is invalidated and a new
 node may claim the range lease.


I think it's nice to be explicit here about the disjointed invariant. A range lease can move from node1 to node2 only after the node1's liveness record has expired, and node2 has a valid unexpired liveness epoch.

vivekmenezes · 2016-09-29T15:35:05Z

docs/RFCS/range_leases.md

-epoch. If a node is down (and its node lease has expired), another
-node may revoke its lease(s) by incrementing the node lease
+epoch. If a node is down (and its node liveness has expired), another
+node may revoke its lease(s) by incrementing the node liveness


which node liveness?

spencerkimball · 2016-09-29T22:24:31Z

Review status: all files reviewed at latest revision, 45 unresolved discussions, all commit checks successful.

docs/RFCS/range_leases.md, line 66 at r5 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

Define what it means to be the "range lease holder" before this comment (in particular, is it a property tied o a HLC timestamp of a command, or a real-time measure, or a mixture of both?). Also, explain what happens at application time of proposed commands. Surely they will error out when the node in the lease isn't equal to the proposing node, but what about the epochs when the node does match? Is it at all a correctness issue to propose at any time not "covered" by the lease (whatever "covered" means since now there is a "real-time duration notion" and a command's HLC timestamp)?
I also still would like to see some examples here. Seeing that we're about to update the design doc, the work will well be worth it.

I added another couple of paragraphs. Not sure what you're looking for with examples. Perhaps you could suggest some?

docs/RFCS/range_leases.md, line 56 at r7 (raw file):

Previously, vivekmenezes wrote…

Node liveness table

Done.

docs/RFCS/range_leases.md, line 68 at r7 (raw file):

Previously, tschottdorf (Tobias Schottdorf) wrote…

s/leaseholder/lease holder/ (I think).

leaseholder is a word.

docs/RFCS/range_leases.md, line 80 at r7 (raw file):

Previously, vivekmenezes wrote…

This paragraph is better placed further down. Perhaps after the next two paragraphs

Done.

docs/RFCS/range_leases.md, line 84 at r7 (raw file):

Previously, vivekmenezes wrote…

which node liveness?

Done.

docs/RFCS/range_leases.md, line 86 at r7 (raw file):

Previously, vivekmenezes wrote…

I think it's nice to be explicit here about the disjointed invariant. A range lease can move from node1 to node2 only after the node1's liveness record has expired, and node2 has a valid unexpired liveness epoch.

Done.

Previously, tschottdorf (Tobias Schottdorf) wrote…

Sorry, I didn't notice that it had been removed, I thought there was cruft left you could tear out.

Done.

Previously, tschottdorf (Tobias Schottdorf) wrote…

s/node's/nodes/

Done.