Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

stability: investigate indigo issues #13687

Closed
mberhault opened this issue Feb 21, 2017 · 36 comments
Closed

stability: investigate indigo issues #13687

mberhault opened this issue Feb 21, 2017 · 36 comments
Assignees
Milestone

Comments

@mberhault
Copy link
Contributor

Indigo keeps breaking, throwing everything from zero-qps, liveness != nodes_ups, and leader-not-leaseholder alerts.

Alerts are currently silenced.

@BramGruneir
Copy link
Member

So indigo was running with a .default set to:

constraints: [-centralus]

I've removed this, but I keep seeing the same issue in the logs:
E170221 15:45:51.901299 110 storage/queue.go:628 [raftsnapshot,n1,s1,r6882/2:/System/tsd/cr.node.{cl…-di…},@c434d5d880] [n1,s1,r6882/2:/System/tsd/cr.node.{cl…-di…}]: change replicas aborted due to failed preemptive snapshot: r6882: remote couldn't accept snapshot with error: [n8,s8],r6882: cannot apply snapshot: snapshot intersects existing range; initiated GC: [n8,s8,r1/8:/{Min-System/tsd/c…}]

Which is currently being tracked in #13042

@petermattis petermattis added this to the 1.0 milestone Feb 23, 2017
@petermattis
Copy link
Collaborator

I wiped and restarted indigo with a binary built at 602e690. I then ran block_writer and let it run for a few minutes. Almost immediately, I saw the amount of Raft log behind metric start to grow:

screen shot 2017-02-24 at 1 27 12 pm

This corresponded with 1 range being reported as under-replicated:

screen shot 2017-02-24 at 1 27 17 pm

I stopped block_writer at 18:24 to give the system time to recover. Poking around, I determined that indigo 8 was the source of the "raft log self behind" spike. The logs show we applied a Raft snapshot to r22 17 times between 18:17 and 18:25. This isn't supposed to happen. When we're sending a snapshot we're supposed to hold on to the portion of the Raft log that the snapshot requires.

@petermattis
Copy link
Collaborator

Started the block_writer load again for a few minutes and this time the under-replicated range did not recover when load was stopped:

screen shot 2017-02-24 at 1 40 57 pm

screen shot 2017-02-24 at 1 41 23 pm

Seems like there is a stuck snapshot operation:

screen shot 2017-02-24 at 1 41 08 pm

Looking at the running goroutines I see:

goroutine 101221 [select]:
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/transport.(*recvBufferReader).Read(0xc420460c00, 0xc42970ff15, 0x44148, 0x44148, 0x0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/transport/transport.go:141 +0x556
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/transport.(*Stream).Read(0xc422854600, 0xc42970ff15, 0x44148, 0x44148, 0x3f23, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/transport/transport.go:326 +0x5c
io.ReadAtLeast(0x2876b60, 0xc422854600, 0xc429654000, 0x10005d, 0x10005d, 0x10005d, 0x5, 0x0, 0x0)
	/usr/local/go/src/io/io.go:307 +0xa9
io.ReadFull(0x2876b60, 0xc422854600, 0xc429654000, 0x10005d, 0x10005d, 0x10005d, 0x0, 0x0)
	/usr/local/go/src/io/io.go:325 +0x58
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.(*parser).recvMsg(0xc420833040, 0x7fffffff, 0x10007f, 0x10007f, 0x1b70160, 0x6, 0xb, 0x6)
	/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/rpc_util.go:259 +0x117
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.recv(0xc420833040, 0x2889320, 0x2da8070, 0xc422854600, 0x0, 0x0, 0x1b70160, 0xc420c08dc0, 0x7fffffff, 0x0, ...)
	/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/rpc_util.go:339 +0x4d
github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc.(*serverStream).RecvMsg(0xc422e4a5a0, 0x1b70160, 0xc420c08dc0, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/vendor/google.golang.org/grpc/stream.go:612 +0x11d
github.com/cockroachdb/cockroach/pkg/storage.(*multiRaftRaftSnapshotServer).Recv(0xc424a28910, 0xc421121c00, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/raft.pb.go:377 +0x62
github.com/cockroachdb/cockroach/pkg/storage.(*Store).HandleSnapshot(0xc420404a80, 0xc42096a000, 0x7fb8f48f4348, 0xc424a28910, 0x0, 0x0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/store.go:2716 +0x396
github.com/cockroachdb/cockroach/pkg/storage.(*RaftTransport).RaftSnapshot.func1.1(0x2893480, 0xc424a28910, 0xc420070300, 0x7fb8f485ee18, 0xc4243035c0, 0x10, 0xd0320b)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/raft_transport.go:362 +0x196
github.com/cockroachdb/cockroach/pkg/storage.(*RaftTransport).RaftSnapshot.func1(0x7fb8f485ee18, 0xc4243035c0)
	/go/src/github.com/cockroachdb/cockroach/pkg/storage/raft_transport.go:363 +0x5d
github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask.func1(0xc4202a1f40, 0x1c3bce4, 0x3, 0x1, 0x0, 0x0, 0xc4243035f0, 0x7fb8f485ee18, 0xc4243035c0)
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:283 +0xdf
created by github.com/cockroachdb/cockroach/pkg/util/stop.(*Stopper).RunAsyncTask
	/go/src/github.com/cockroachdb/cockroach/pkg/util/stop/stopper.go:284 +0x14a

The above is on indigo 8. On the sender side (indigo 2) the logs show:

E170224 18:42:44.314091 147 storage/queue.go:628  [raftsnapshot,n2,s2,r22/2:/Table/51{-/1/46120…},@c4208e7180] snapshot failed: breaker open
I170224 18:42:45.314044 147 storage/replica_raftstorage.go:414  [raftsnapshot,n2,s2,r22/2:/Table/51{-/1/46120…},@c4208e7180] generated Raft snapshot b66e76e3 at index 672130
W170224 18:43:45.314307 147 storage/raft_transport.go:649  [raftsnapshot,n2,s2,r22/2:/Table/51{-/1/46120…},@c4208e7180] failed to close snapshot stream: rpc error: code = 4 desc = context deadline exceeded
E170224 18:43:45.314823 147 storage/queue.go:628  [raftsnapshot,n2,s2,r22/2:/Table/51{-/1/46120…},@c4208e7180] snapshot failed: rpc error: code = 4 desc = context deadline exceeded
I170224 18:43:45.315803 147 storage/replica_raftstorage.go:414  [raftsnapshot,n2,s2,r23/2:/{Table/51/1/4…-Max},@c425ee0000] generated Raft snapshot 5584b13a at index 192328
E170224 18:43:45.315897 147 storage/queue.go:628  [raftsnapshot,n2,s2,r23/2:/{Table/51/1/4…-Max},@c425ee0000] snapshot failed: breaker open
I170224 18:43:46.314426 147 storage/replica_raftstorage.go:414  [raftsnapshot,n2,s2,r22/2:/Table/51{-/1/46120…},@c4208e7180] generated Raft snapshot 8a13e907 at index 672130
E170224 18:43:46.314537 147 storage/queue.go:628  [raftsnapshot,n2,s2,r22/2:/Table/51{-/1/46120…},@c4208e7180] snapshot failed: breaker open
I170224 18:43:46.514302 147 storage/replica_raftstorage.go:414  [raftsnapshot,n2,s2,r23/2:/{Table/51/1/4…-Max},@c425ee0000] generated Raft snapshot 8f3a5d4d at index 192328

I'm not sure what to make of all this yet.

@petermattis
Copy link
Collaborator

I restarted indigo 8. After restarting, another snapshot wedged up almost immediately. On the sender side (indigo 2) I see:

I170224 18:58:27.403762 996675 storage/raft_transport.go:437  [n2] raft transport stream to node 8 established
I170224 18:58:27.796160 147 storage/replica_raftstorage.go:414  [raftsnapshot,n2,s2,r22/2:/Table/51{-/1/46120…},@c4208e7180] generated Raft snapshot 401b9a6d at index 672130
...
W170224 18:59:27.795907 147 storage/raft_transport.go:649  [raftsnapshot,n2,s2,r22/2:/Table/51{-/1/46120…},@c4208e7180] failed to close snapshot stream: rpc error: code = 4 desc = context deadline exceeded
E170224 18:59:27.796480 147 storage/queue.go:628  [raftsnapshot,n2,s2,r22/2:/Table/51{-/1/46120…},@c4208e7180] snapshot failed: rpc error: code = 4 desc = context deadline exceeded
I170224 18:59:27.797346 147 storage/replica_raftstorage.go:414  [raftsnapshot,n2,s2,r23/2:/{Table/51/1/4…-Max},@c425ee0000] generated Raft snapshot d3a89d06 at index 192328
E170224 18:59:27.797697 147 storage/queue.go:628  [raftsnapshot,n2,s2,r23/2:/{Table/51/1/4…-Max},@c425ee0000] snapshot failed: breaker open

So indigo 2 tried to send a snapshot for r22 to indigo 8 at 18:58:27. At 18:59:27 the 1m deadline for that operation timed out. Indigo 2 then tries to send another snapshot to indigo 8, but that fails immediately because the breaker is open.

Anyone have any thoughts on why we can't send the snapshot in 1m? I'm going to try and add some more log messages to pinpoint what is getting horked up here.

@petermattis
Copy link
Collaborator

Also, why does indigo 8 seem to never get notification that the snapshot operation timed out? @tamird Is there something funky going on with context cancellation here?

@tamird
Copy link
Contributor

tamird commented Feb 24, 2017

I wouldn't normally expect this to have anything to do with context cancellation; indigo 8 should be noticing if the sender goes away when it calls Recv or Send on the stream.

That said, if the sender has a timeout (why does it have one, btw?), maybe we should be watching the stream context (which will have a propagated timeout) and bail from the RPC method.

@petermattis
Copy link
Collaborator

The sender has a 1m timeout because it was started from the Raft snapshot queue. Queues have a default 1m timeout.

The recipient is stuck on Recv. What else can it be watching for?

@tamird
Copy link
Contributor

tamird commented Feb 24, 2017

diff --git a/pkg/storage/raft_transport.go b/pkg/storage/raft_transport.go
index e228fb0a0..f51bca5bf 100644
--- a/pkg/storage/raft_transport.go
+++ b/pkg/storage/raft_transport.go
@@ -337,8 +337,9 @@ func (t *RaftTransport) RaftMessageBatch(stream MultiRaft_RaftMessageBatchServer
 
 // RaftSnapshot handles incoming streaming snapshot requests.
 func (t *RaftTransport) RaftSnapshot(stream MultiRaft_RaftSnapshotServer) error {
+	ctx := stream.Context()
 	errCh := make(chan error, 1)
-	if err := t.rpcContext.Stopper.RunAsyncTask(stream.Context(), func(ctx context.Context) {
+	if err := t.rpcContext.Stopper.RunAsyncTask(ctx, func(ctx context.Context) {
 		errCh <- func() error {
 			req, err := stream.Recv()
 			if err != nil {
@@ -365,6 +366,8 @@ func (t *RaftTransport) RaftSnapshot(stream MultiRaft_RaftSnapshotServer) error
 		return err
 	}
 	select {
+	case <-ctx.Done():
+		return ctx.Err()
 	case <-t.rpcContext.Stopper.ShouldStop():
 		return nil
 	case err := <-errCh:

Breaking out of the function should unwedge the Recv call.

@petermattis
Copy link
Collaborator

Oh, yeah, that's not good. That explains why the snapshot is getting wedged on the recipient. And I think it is timing out because we're only able to send ~1MB/sec and the range is larger than 60MB.

@petermattis
Copy link
Collaborator

Hmm, a simple bandwidth test using nc shows we can send 40MB/sec between indigo 2 and indigo 8.

@tamird
Copy link
Contributor

tamird commented Feb 24, 2017 via email

@petermattis
Copy link
Collaborator

Log messages in sendSnapshot for every batch being sent. The log messages were being output every second (on average). For a given sendSnapshot invocation, I saw 60 of them before we hit the deadline.

@petermattis
Copy link
Collaborator

I'm suspicious that indigo 8 is falling behind simply because it has higher latency. Ping times between indigo 2 and indigo 5 (the other node with a replica) are 40ms. Between indigo 2 and indigo 8 the ping times are 60ms. The Raft log is likely growing to a size where we're truncating it and forcing a snapshot. But there really isn't another choice here. Seems like we need to have flow control ala #8659.

Cc @bdarnell, @spencerkimball

petermattis added a commit to petermattis/cockroach that referenced this issue Feb 24, 2017
Bump the snapshot operation timeout from 1m to 5m. We've seen in
production that sending O(64MB) can take longer than 1m on a
cross-region link. (Need to investigate why this is happening, though).

Properly notice cancellation of snapshot streams on the
recipient. Failing to do this would cause a reception of a snapshot to
wedge permanently if the associated operation timed out.

See cockroachdb#13687
@bdarnell
Copy link
Contributor

We do ultimately need flow control, but in this case we should really be able to send the snapshot in under a minute. I suspect that we're hitting some sort of pathological case in grpc or its http/2 layer that might explain both this and #13722.

@petermattis
Copy link
Collaborator

Even if we can send snapshots faster, the fact that one Replica is "slower" than the other 2 means it will fall farther and farther behind until a snapshot is needed. Hence the need for flow control.

@bdarnell
Copy link
Contributor

A higher-latency replica is not automatically doomed to fall behind; it should be able to keep up as long as it has sufficient bandwidth (of course the two tend to be correlated)

@petermattis
Copy link
Collaborator

Regardless of the reason a replica is continually falling behind (latency, bandwidth, slowness on the node itself), without flow control we'll get into these snapshot loops.

But before flow control, I'm going to do a bit of experimentation to try and figure out why our bandwidth via gRPC seems dramatically lower than what nc sees (1-2MB/sec vs 30-40MB/sec).

@petermattis
Copy link
Collaborator

I wrote a simple gRPC program that implements the following service:

message PingRequest {
  optional bytes payload = 1;
}

message PingResponse {
  optional bytes payload = 1;
}

service Pinger {
  rpc Ping (PingRequest) returns (PingResponse) {}
  rpc PingStream (stream PingRequest) returns (stream PingResponse) {}
}

And used some of the looping logic from loadgen/kv. On indigo 8 I ran:

~ ./pinger -l :2222 -s 0

And on indigo 2:

~$ ./pinger -c $[1024*1024] cockroach-indigo-0008.crdb.io:2222
_elapsed____ops/s_____MB/s__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
      1s      0.0      0.0      0.0      0.0      0.0      0.0
      2s      1.0      1.0   1208.0   1208.0   1208.0   1208.0
      3s      1.0      1.0   1140.9   1140.9   1140.9   1140.9
      4s      1.0      1.0   1140.9   1140.9   1140.9   1140.9
      5s      1.0      1.0   1140.9   1140.9   1140.9   1140.9
      6s      1.0      1.0   1140.9   1140.9   1140.9   1140.9
      7s      1.0      1.0   1140.9   1140.9   1140.9   1140.9
      8s      1.0      1.0   1140.9   1140.9   1140.9   1140.9
      9s      1.0      1.0   1140.9   1140.9   1140.9   1140.9
     10s      1.0      1.0   1140.9   1140.9   1140.9   1140.9

Wtf?

For comparison, an nc-based bandwidth test shows:

~$ dd if=/dev/zero bs=1024K count=512 | nc -v cockroach-indigo-0008.crdb.io 2222
Connection to cockroach-indigo-0008.crdb.io 2222 port [tcp/*] succeeded!
512+0 records in
512+0 records out
536870912 bytes (537 MB, 512 MiB) copied, 11.7852 s, 45.6 MB/s

And the pinger program between indigo 2 and indigo 3:

~$ ./pinger -c $[1024*1024] cockroach-indigo-0003.crdb.io:2222
_elapsed____ops/s_____MB/s__p50(ms)__p95(ms)__p99(ms)_pMax(ms)
      1s     56.0     56.0     15.2     22.0     26.2     29.4
      2s     56.0     56.0     16.8     25.2     27.3     37.7
      3s     51.0     51.0     18.9     28.3     35.7     37.7
      4s     66.0     66.0     14.7     19.9     22.0     30.4
      5s     62.0     62.0     15.7     23.1     25.2     25.2
      6s     56.0     56.0     17.8     26.2     31.5     33.6
      7s     58.0     58.0     16.3     23.1     25.2     25.2
      8s     58.0     58.0     16.3     25.2     27.3     32.5
      9s     60.0     60.0     16.3     23.1     25.2     28.3
     10s     63.0     63.0     15.7     21.0     24.1     25.2

But even this is sad as nc reports it can send 242 MB/s between indigo 2 and 3.

@petermattis
Copy link
Collaborator

gRPC has a default "window size" of 64KB. This is the maximum amount of data that can be sent on a stream before blocking until the data is acknowledged. With a 60ms RTT, sending 64KB at a time results in a bandwidth of 64KB * 1000 / 60 == ~1MB/s. I believe gRPC is supposed to grow the size of the window, but perhaps that isn't working correctly.

@bdarnell
Copy link
Contributor

I've been looking at GRPC's HTTP/2 window logic. It does not attempt to grow the window; it just uses a fixed 64KB per stream and 1024KB per connection. Looks like we need to either make it configurable or make it grow dynamically.

@petermattis
Copy link
Collaborator

Yeah, that's what I'm seeing too. I also found grpc/grpc-go#760.

Manually bumping the window size by 10x allows my pinger program to get 7MB/s.

@petermattis
Copy link
Collaborator

Java gRPC uses a default flow control window size of 1MB (16x the Go gRPC default): https://github.com/grpc/grpc-java/blob/master/netty/src/main/java/io/grpc/netty/NettyChannelBuilder.java#L72

@spencerkimball
Copy link
Member

spencerkimball commented Feb 25, 2017 via email

@petermattis
Copy link
Collaborator

Does pinger do better than 7MB/s with 16x?

Yes, but you have to set the size of each RPC larger than 1MB. Even doing that, I've only seen the bandwidth hit 10MB/s.

@bdarnell
Copy link
Contributor

There are two window size settings, one for the stream and one for the connection. Are you increasing both? (The linked java code is used for the connection window. I can't find whether the same value is used for the connection window).

@petermattis
Copy link
Collaborator

grpc/grpc-go#1043

While this is being fixed upstream, I wonder if we should have a local gRPC fork where we bump initialWindowSize and initialConnWindowSize.

Also, we don't have compression enabled. At the time we switch to gRPC, I don't think there were hooks for compression, but now there are.

@petermattis
Copy link
Collaborator

screen shot 2017-02-27 at 12 14 03 pm

The region on the left is with the default window size settings. On the right is a manual edit to grpc where I bump initialWindowSize and initialConnWindowSize to 2MB each. Much improved, but note that this is only improving the speed of snapshots. We're still generating snapshots frequently because of Raft log truncation and we're truncating the Raft log and forcing Raft snapshots because 1 of the replicas cannot keep up.

@spencerkimball
Copy link
Member

spencerkimball commented Feb 27, 2017 via email

@petermattis
Copy link
Collaborator

I'm happy to look at flow control next.

I'm already looking at it.

@petermattis
Copy link
Collaborator

I wonder if our default setting for raft.Config.MaxInflightMsgs of 4 is problematic for high latency networks. That setting was reduced from 256 to reduce the burden of applying a significant number of messages at once. At the time, it was noted that writing the log messages wasn't problematic, but Replica.handleRaftReady will then apply all of the messages in one go, starving other Raft traffic for the Range (such as heartbeats).

@bdarnell
Copy link
Contributor

Yeah, the combination of MaxInflightMsgs and MaxSizePerMsg (currently 4 and 16KB) gives us the same 1MB/sec limit per range when latency is 60ms. We should increase one or both of these values (preferably MaxInflightMsgs, I think) to the same size we're using for HTTP/2 flow control.

@petermattis
Copy link
Collaborator

I'm testing various values of COCKROACH_RAFT_LOG_MAX_SIZE (1MB, 2MB, 4MB and 8MB) and COCKROACH_RAFT_MAX_INFLIGHT_MSGS (4, 16, 64 and 256) on indigo.

@petermattis
Copy link
Collaborator

screen shot 2017-02-28 at 6 47 58 pm

The above shows 16 experiments varying max-inflight-msgs from 4, 16, 64 and 256 and raft log max-size from 1MB, 2MB, 4MB and 8MB. The short story is that setting max-inflight-msgs to 4 is too low as 3 out of the 4 runs with that setting show problematic Raft log growth. Curiously, setting max-inflight-msgs to 256 also seems to be mildly problematic and I have no explanation for that.

Given this experiment, I'm going to default max-inflight-msgs to 64 and Raft log max-size to 4MB. We'll want to pay attention to what this does to chaos testing on blue.

petermattis added a commit to petermattis/cockroach that referenced this issue Feb 28, 2017
Increase Raft MaxInFlightMsgs default from 4 to 64 which allows deeper
pipeling of Raft messages on high latency links.

Increase Raft log max-size from 1MB to 4MB. The 1MB setting was overly
aggressive on high latency links and seems to contribute to increased
likelihood of Raft snapshots.

See cockroachdb#13687
@spencerkimball
Copy link
Member

spencerkimball commented Mar 1, 2017 via email

@petermattis
Copy link
Collaborator

Yes, exactly.

@mberhault
Copy link
Contributor Author

I think this is resolved. Please re-open if not.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants