: TestStoreRangeRebalance failed under stress #10193

cockroach-teamcity · 2016-10-25T07:01:48Z

SHA: https://github.com/cockroachdb/cockroach/commits/ca89f456766a8f0381815e58aa7abfe5d3ece741

Stress build found a failed test:

I161025 07:00:18.636488 20970 storage/engine/rocksdb.go:340  opening in memory rocksdb instance
I161025 07:00:18.637936 20970 gossip/gossip.go:237  [n?] initial resolvers: []
W161025 07:00:18.638332 20970 gossip/gossip.go:1055  [n?] no resolvers found; use --join to specify a connected node
I161025 07:00:18.638661 20970 base/node_id.go:62  NodeID set to 1
I161025 07:00:18.656447 20970 storage/store.go:1151  [n1,s1]: failed initial metrics computation: [n1,s1]: system config not yet available
I161025 07:00:18.656762 20970 gossip/gossip.go:280  [n1] NodeDescriptor set to node_id:1 address:<network_field:"tcp" address_field:"127.0.0.1:37344" > attrs:<> locality:<> 
I161025 07:00:18.671578 20990 storage/replica_proposal.go:292  [s1,r1/1:/M{in-ax}] new range lease replica {1 1 1} 1970-01-01 00:00:00 +0000 UTC 900ms following replica {0 0 0} 1970-01-01 00:00:00 +0000 UTC 0s [physicalTime=1970-01-01 00:00:00 +0000 UTC]
I161025 07:00:18.681440 20970 storage/engine/rocksdb.go:340  opening in memory rocksdb instance
I161025 07:00:18.683173 20970 gossip/gossip.go:237  [n?] initial resolvers: [127.0.0.1:37344]
W161025 07:00:18.687152 20970 gossip/gossip.go:1057  [n?] no incoming or outgoing connections
I161025 07:00:18.687465 20970 base/node_id.go:62  NodeID set to 2
I161025 07:00:18.720593 20970 storage/store.go:1151  [n2,s2]: failed initial metrics computation: [n2,s2]: system config not yet available
I161025 07:00:18.720903 20970 gossip/gossip.go:280  [n2] NodeDescriptor set to node_id:2 address:<network_field:"tcp" address_field:"127.0.0.1:35894" > attrs:<> locality:<> 
I161025 07:00:18.722132 20970 storage/engine/rocksdb.go:340  opening in memory rocksdb instance
I161025 07:00:18.728701 20970 gossip/gossip.go:237  [n?] initial resolvers: [127.0.0.1:37344]
I161025 07:00:18.729262 21128 gossip/client.go:126  [n2] node 2: started gossip client to 127.0.0.1:37344
W161025 07:00:18.751634 20970 gossip/gossip.go:1057  [n?] no incoming or outgoing connections
I161025 07:00:18.751951 20970 base/node_id.go:62  NodeID set to 3
I161025 07:00:18.769635 21167 gossip/client.go:126  [n3] node 3: started gossip client to 127.0.0.1:37344
I161025 07:00:18.772127 20970 gossip/gossip.go:280  [n3] NodeDescriptor set to node_id:3 address:<network_field:"tcp" address_field:"127.0.0.1:36810" > attrs:<> locality:<> 
I161025 07:00:18.777182 20970 storage/engine/rocksdb.go:340  opening in memory rocksdb instance
I161025 07:00:18.799070 20970 gossip/gossip.go:237  [n?] initial resolvers: [127.0.0.1:37344]
W161025 07:00:18.799571 20970 gossip/gossip.go:1057  [n?] no incoming or outgoing connections
I161025 07:00:18.800114 20970 base/node_id.go:62  NodeID set to 4
I161025 07:00:18.833857 20970 storage/store.go:1151  [n4,s4]: failed initial metrics computation: [n4,s4]: system config not yet available
I161025 07:00:18.834179 20970 gossip/gossip.go:280  [n4] NodeDescriptor set to node_id:4 address:<network_field:"tcp" address_field:"127.0.0.1:60110" > attrs:<> locality:<> 
I161025 07:00:18.835642 21262 gossip/client.go:126  [n4] node 4: started gossip client to 127.0.0.1:37344
I161025 07:00:18.837467 20970 storage/engine/rocksdb.go:340  opening in memory rocksdb instance
I161025 07:00:18.839250 20970 gossip/gossip.go:237  [n?] initial resolvers: [127.0.0.1:37344]
W161025 07:00:18.839843 20970 gossip/gossip.go:1057  [n?] no incoming or outgoing connections
I161025 07:00:18.843654 20970 base/node_id.go:62  NodeID set to 5
I161025 07:00:18.888068 20970 storage/store.go:1151  [n5,s5]: failed initial metrics computation: [n5,s5]: system config not yet available
I161025 07:00:18.888382 20970 gossip/gossip.go:280  [n5] NodeDescriptor set to node_id:5 address:<network_field:"tcp" address_field:"127.0.0.1:40370" > attrs:<> locality:<> 
I161025 07:00:18.896468 20970 storage/engine/rocksdb.go:340  opening in memory rocksdb instance
I161025 07:00:18.907972 21219 gossip/client.go:126  [n5] node 5: started gossip client to 127.0.0.1:37344
I161025 07:00:18.928990 21346 gossip/server.go:263  [n1] refusing gossip from node 5 (max 3 conns); forwarding to 4 ({tcp 127.0.0.1:60110})
I161025 07:00:18.929375 21346 gossip/server.go:263  [n1] refusing gossip from node 5 (max 3 conns); forwarding to 2 ({tcp 127.0.0.1:35894})
I161025 07:00:18.930816 20970 gossip/gossip.go:237  [n?] initial resolvers: [127.0.0.1:37344]
W161025 07:00:18.931213 20970 gossip/gossip.go:1057  [n?] no incoming or outgoing connections
I161025 07:00:18.931512 20970 base/node_id.go:62  NodeID set to 6
I161025 07:00:18.937475 21346 gossip/server.go:263  [n1] refusing gossip from node 5 (max 3 conns); forwarding to 4 ({tcp 127.0.0.1:60110})
I161025 07:00:18.973249 21219 gossip/client.go:131  [n5] node 5: closing client to node 1 (127.0.0.1:37344): received forward from node 1 to 4 (127.0.0.1:60110)
I161025 07:00:18.973794 21402 gossip/client.go:126  [n5] node 5: started gossip client to 127.0.0.1:60110
I161025 07:00:18.984323 20970 storage/store.go:1151  [n6,s6]: failed initial metrics computation: [n6,s6]: system config not yet available
I161025 07:00:18.984643 20970 gossip/gossip.go:280  [n6] NodeDescriptor set to node_id:6 address:<network_field:"tcp" address_field:"127.0.0.1:47909" > attrs:<> locality:<> 
I161025 07:00:18.986792 21465 gossip/client.go:126  [n6] node 6: started gossip client to 127.0.0.1:37344
I161025 07:00:18.999127 21473 gossip/server.go:263  [n1] refusing gossip from node 6 (max 3 conns); forwarding to 2 ({tcp 127.0.0.1:35894})
I161025 07:00:19.010596 21473 gossip/server.go:263  [n1] refusing gossip from node 6 (max 3 conns); forwarding to 2 ({tcp 127.0.0.1:35894})
I161025 07:00:19.011183 21473 gossip/server.go:263  [n1] refusing gossip from node 6 (max 3 conns); forwarding to 2 ({tcp 127.0.0.1:35894})
I161025 07:00:19.011537 21465 gossip/client.go:131  [n6] node 6: closing client to node 1 (127.0.0.1:37344): received forward from node 1 to 2 (127.0.0.1:35894)
I161025 07:00:19.024425 21380 gossip/client.go:126  [n6] node 6: started gossip client to 127.0.0.1:35894
I161025 07:00:19.105553 20970 storage/replica_command.go:2354  [s1,r1/1:/M{in-ax}] initiating a split of this range at key "split" [r2]
E161025 07:00:19.203456 21021 storage/queue.go:569  [replicate] (purgatory) on [n1,s1,r1/1:{/Min-"split"}]: 0 of 0 stores with an attribute matching []; likely not enough nodes in cluster
E161025 07:00:19.225795 21021 storage/queue.go:569  [replicate] (purgatory) on [n1,s1,r2/1:{"split"-/Max}]: 0 of 0 stores with an attribute matching []; likely not enough nodes in cluster
I161025 07:00:19.245841 20970 storage/replica_raftstorage.go:446  [s1,r1/1:{/Min-"split"}] generated snapshot 129a40ae for range 1 at index 31 in 98.195µs.
I161025 07:00:19.252897 20970 storage/store.go:3032  streamed snapshot: kv pairs: 33, log entries: 21
I161025 07:00:19.254664 21603 storage/replica_raftstorage.go:577  [s2] [n2,s2,r1/?:{-}]: with replicaID [?], applying preemptive snapshot at index 31 (id=129a40ae, encoded size=16, 1 rocksdb batches, 21 log entries)
I161025 07:00:19.272327 21603 storage/replica_raftstorage.go:580  [s2] [n2,s2,r1/?:{/Min-"split"}]: with replicaID [?], applied preemptive snapshot in 0.017s
I161025 07:00:19.276623 20970 storage/replica_command.go:3232  change replicas: read existing descriptor range_id:1 start_key:"" end_key:"split" replicas:<node_id:1 store_id:1 replica_id:1 > next_replica_id:2 
I161025 07:00:19.328255 21569 storage/replica.go:1793  [s1,r1/1:{/Min-"split"}] proposing ADD_REPLICA {NodeID:2 StoreID:2 ReplicaID:2} for range 1: [{NodeID:1 StoreID:1 ReplicaID:1} {NodeID:2 StoreID:2 ReplicaID:2}]
I161025 07:00:19.368421 20970 storage/replica_raftstorage.go:446  [s1,r1/1:{/Min-"split"}] generated snapshot 6c22622e for range 1 at index 36 in 114.095µs.
I161025 07:00:19.370365 21666 storage/raft_transport.go:423  raft transport stream to node 1 established
I161025 07:00:19.374031 20970 storage/store.go:3032  streamed snapshot: kv pairs: 39, log entries: 26
I161025 07:00:19.375836 21598 storage/replica_raftstorage.go:577  [s3] [n3,s3,r1/?:{-}]: with replicaID [?], applying preemptive snapshot at index 36 (id=6c22622e, encoded size=16, 1 rocksdb batches, 26 log entries)
I161025 07:00:19.380724 21598 storage/replica_raftstorage.go:580  [s3] [n3,s3,r1/?:{/Min-"split"}]: with replicaID [?], applied preemptive snapshot in 0.005s
I161025 07:00:19.404399 20970 storage/replica_command.go:3232  change replicas: read existing descriptor range_id:1 start_key:"" end_key:"split" replicas:<node_id:1 store_id:1 replica_id:1 > replicas:<node_id:2 store_id:2 replica_id:2 > next_replica_id:3 
I161025 07:00:19.660711 21624 storage/replica.go:1793  [s1,r1/1:{/Min-"split"}] proposing ADD_REPLICA {NodeID:3 StoreID:3 ReplicaID:3} for range 1: [{NodeID:1 StoreID:1 ReplicaID:1} {NodeID:2 StoreID:2 ReplicaID:2} {NodeID:3 StoreID:3 ReplicaID:3}]
I161025 07:00:19.896606 20970 storage/replica_raftstorage.go:446  [s1,r2/1:{"split"-/Max}] generated snapshot 0b5f3204 for range 2 at index 11 in 131.194µs.
I161025 07:00:19.900139 20970 storage/store.go:3032  streamed snapshot: kv pairs: 28, log entries: 1
I161025 07:00:19.949931 21759 storage/replica_raftstorage.go:577  [s2] [n2,s2,r2/?:{-}]: with replicaID [?], applying preemptive snapshot at index 11 (id=0b5f3204, encoded size=16, 1 rocksdb batches, 1 log entries)
I161025 07:00:19.951620 21759 storage/replica_raftstorage.go:580  [s2] [n2,s2,r2/?:{"split"-/Max}]: with replicaID [?], applied preemptive snapshot in 0.001s
I161025 07:00:19.960562 21725 storage/raft_transport.go:423  raft transport stream to node 1 established
I161025 07:00:19.967837 20970 storage/replica_command.go:3232  change replicas: read existing descriptor range_id:2 start_key:"split" end_key:"\377\377" replicas:<node_id:1 store_id:1 replica_id:1 > next_replica_id:2 
W161025 07:00:20.246771 21842 storage/intent_resolver.go:313  [n2,s2,r1/2:{/Min-"split"}]: failed to push during intent resolution: failed to push "change-replica" id=f5f1f3b5 key=/Local/Range/"split"/RangeDescriptor rw=true pri=0.02533519 iso=SERIALIZABLE stat=PENDING epo=0 ts=0.000000000,578 orig=0.000000000,578 max=0.000000000,578 wto=false rop=false
W161025 07:00:20.382662 21677 storage/intent_resolver.go:313  [n2,s2,r1/2:{/Min-"split"}]: failed to push during intent resolution: failed to push "change-replica" id=f5f1f3b5 key=/Local/Range/"split"/RangeDescriptor rw=true pri=0.02533519 iso=SERIALIZABLE stat=PENDING epo=0 ts=0.000000000,578 orig=0.000000000,578 max=0.000000000,578 wto=false rop=false
I161025 07:00:20.400857 21676 storage/replica.go:1793  [s1,r2/1:{"split"-/Max}] proposing ADD_REPLICA {NodeID:2 StoreID:2 ReplicaID:2} for range 2: [{NodeID:1 StoreID:1 ReplicaID:1} {NodeID:2 StoreID:2 ReplicaID:2}]
I161025 07:00:20.636311 20970 storage/replica_raftstorage.go:446  [s1,r2/1:{"split"-/Max}] generated snapshot 327c3d5a for range 2 at index 16 in 100.495µs.
I161025 07:00:20.647627 20970 storage/store.go:3032  streamed snapshot: kv pairs: 30, log entries: 6
I161025 07:00:20.654484 21888 storage/replica_raftstorage.go:577  [s4] [n4,s4,r2/?:{-}]: with replicaID [?], applying preemptive snapshot at index 16 (id=327c3d5a, encoded size=16, 1 rocksdb batches, 6 log entries)
I161025 07:00:20.697362 21888 storage/replica_raftstorage.go:580  [s4] [n4,s4,r2/?:{"split"-/Max}]: with replicaID [?], applied preemptive snapshot in 0.043s
I161025 07:00:20.727191 20970 storage/replica_command.go:3232  change replicas: read existing descriptor range_id:2 start_key:"split" end_key:"\377\377" replicas:<node_id:1 store_id:1 replica_id:1 > replicas:<node_id:2 store_id:2 replica_id:2 > next_replica_id:3 
I161025 07:00:21.679896 21840 storage/raft_transport.go:423  raft transport stream to node 2 established
I161025 07:00:21.757827 22023 storage/replica.go:1793  [s1,r2/1:{"split"-/Max}] proposing ADD_REPLICA {NodeID:4 StoreID:4 ReplicaID:3} for range 2: [{NodeID:1 StoreID:1 ReplicaID:1} {NodeID:2 StoreID:2 ReplicaID:2} {NodeID:4 StoreID:4 ReplicaID:3}]
I161025 07:00:21.991116 20970 storage/replica_raftstorage.go:446  [s1,r2/1:{"split"-/Max}] generated snapshot 58779c29 for range 2 at index 19 in 101.796µs.
I161025 07:00:22.020959 20970 storage/store.go:3032  streamed snapshot: kv pairs: 31, log entries: 9
I161025 07:00:22.022787 22066 storage/replica_raftstorage.go:577  [s5] [n5,s5,r2/?:{-}]: with replicaID [?], applying preemptive snapshot at index 19 (id=58779c29, encoded size=16, 1 rocksdb batches, 9 log entries)
I161025 07:00:22.032897 22066 storage/replica_raftstorage.go:580  [s5] [n5,s5,r2/?:{"split"-/Max}]: with replicaID [?], applied preemptive snapshot in 0.010s
I161025 07:00:22.038836 20970 storage/replica_command.go:3232  change replicas: read existing descriptor range_id:2 start_key:"split" end_key:"\377\377" replicas:<node_id:1 store_id:1 replica_id:1 > replicas:<node_id:2 store_id:2 replica_id:2 > replicas:<node_id:4 store_id:4 replica_id:3 > next_replica_id:4 
I161025 07:00:22.042480 22057 storage/raft_transport.go:423  raft transport stream to node 1 established
I161025 07:00:22.140359 22041 storage/raft_transport.go:423  raft transport stream to node 3 established
I161025 07:00:23.476766 22181 storage/replica.go:1793  [s1,r2/1:{"split"-/Max}] proposing ADD_REPLICA {NodeID:5 StoreID:5 ReplicaID:4} for range 2: [{NodeID:1 StoreID:1 ReplicaID:1} {NodeID:2 StoreID:2 ReplicaID:2} {NodeID:4 StoreID:4 ReplicaID:3} {NodeID:5 StoreID:5 ReplicaID:4}]
I161025 07:00:23.747140 22193 storage/raft_transport.go:423  raft transport stream to node 1 established
I161025 07:00:23.784601 20970 storage/replica_command.go:3232  change replicas: read existing descriptor range_id:2 start_key:"split" end_key:"\377\377" replicas:<node_id:1 store_id:1 replica_id:1 > replicas:<node_id:2 store_id:2 replica_id:2 > replicas:<node_id:4 store_id:4 replica_id:3 > replicas:<node_id:5 store_id:5 replica_id:4 > next_replica_id:5 
I161025 07:00:24.700070 21028 storage/replica.go:1841  [s1,r1/1:{/Min-"split"}] not quiescing: 7 pending commands
I161025 07:00:25.190026 22101 storage/replica.go:1793  [s1,r2/1:{"split"-/Max}] proposing REMOVE_REPLICA {NodeID:1 StoreID:1 ReplicaID:1} for range 2: [{NodeID:5 StoreID:5 ReplicaID:4} {NodeID:2 StoreID:2 ReplicaID:2} {NodeID:4 StoreID:4 ReplicaID:3}]
I161025 07:00:25.484946 21584 storage/store.go:2893  [s1] [n1,s1,r2/1:{"split"-/Max}]: added to replica GC queue (peer suggestion)
I161025 07:00:25.672147 22159 storage/raft_transport.go:423  raft transport stream to node 5 established
I161025 07:00:25.714677 22158 storage/raft_transport.go:423  raft transport stream to node 4 established
I161025 07:00:25.770179 22378 storage/raft_transport.go:423  raft transport stream to node 2 established
I161025 07:00:25.783129 22161 storage/raft_transport.go:423  raft transport stream to node 2 established
I161025 07:00:25.936934 21111 storage/replica_proposal.go:292  [s2,r2/2:{"split"-/Max}] new range lease replica {2 2 2} 1970-01-01 00:00:00.9 +0000 UTC 1.8s following replica {1 1 1} 1970-01-01 00:00:00 +0000 UTC 900ms [physicalTime=1970-01-01 00:00:01.8 +0000 UTC]
W161025 07:00:26.181004 21032 raft/raft.go:696  [s1,r2/1:{"split"-/Max}] 1 stepped down to follower since quorum is not active
I161025 07:00:26.740749 21112 storage/replica_proposal.go:292  [s2,r1/2:{/Min-"split"}] new range lease replica {2 2 2} 1970-01-01 00:00:00.9 +0000 UTC 1.8s following replica {1 1 1} 1970-01-01 00:00:00 +0000 UTC 900ms [physicalTime=1970-01-01 00:00:12.6 +0000 UTC]
I161025 07:00:27.044633 21067 storage/replica_proposal.go:292  [s3,r1/3:{/Min-"split"}] new range lease replica {3 3 3} 1970-01-01 00:00:02.7 +0000 UTC 10.8s following replica {2 2 2} 1970-01-01 00:00:00.9 +0000 UTC 1.8s [physicalTime=1970-01-01 00:00:12.6 +0000 UTC]
E161025 07:00:27.234029 21075 storage/node_liveness.go:141  [hb] failed liveness heartbeat: transaction commit result is ambiguous
I161025 07:00:27.386609 20970 storage/client_test.go:419  gossip network initialized
I161025 07:00:27.761160 21243 storage/replica_proposal.go:292  [s4,r2/3:{"split"-/Max}] new range lease replica {4 4 3} 1970-01-01 00:00:13.5 +0000 UTC 5.4s following replica {2 2 2} 1970-01-01 00:00:00.9 +0000 UTC 12.6s [physicalTime=1970-01-01 00:00:27 +0000 UTC]
I161025 07:00:27.792376 22528 util/stop/stopper.go:425  quiescing; tasks left:
1      storage/client_test.go:506
I161025 07:00:27.792603 22527 util/stop/stopper.go:425  quiescing; tasks left:
9      storage/client_test.go:506
2      storage/replica_range_lease.go:150
1      storage/intent_resolver.go:316
I161025 07:00:27.802436 22526 util/stop/stopper.go:425  quiescing; tasks left:
8      storage/client_test.go:506
1      storage/replica_range_lease.go:150
1      storage/queue.go:614
1      storage/queue.go:467
1      storage/intent_resolver.go:381
I161025 07:00:27.844849 22526 util/stop/stopper.go:425  quiescing; tasks left:
7      storage/client_test.go:506
1      storage/replica_range_lease.go:150
1      storage/queue.go:614
1      storage/queue.go:467
1      storage/intent_resolver.go:381
I161025 07:00:27.845245 22526 util/stop/stopper.go:425  quiescing; tasks left:
6      storage/client_test.go:506
1      storage/replica_range_lease.go:150
1      storage/queue.go:614
1      storage/queue.go:467
1      storage/intent_resolver.go:381
I161025 07:00:27.858331 22526 util/stop/stopper.go:425  quiescing; tasks left:
4      storage/client_test.go:506
1      storage/replica_range_lease.go:150
1      storage/queue.go:614
1      storage/queue.go:467
1      storage/intent_resolver.go:381
I161025 07:00:27.858684 22526 util/stop/stopper.go:425  quiescing; tasks left:
4      storage/client_test.go:506
1      storage/queue.go:614
1      storage/queue.go:467
1      storage/intent_resolver.go:381
E161025 07:00:27.882063 21558 storage/queue.go:558  [replicate] on [n1,s1,r1/1:{/Min-"split"}]: [n1,s1,r1/1:{/Min-"split"}]: could not obtain lease: node unavailable; try another peer
I161025 07:00:27.883430 22527 util/stop/stopper.go:425  quiescing; tasks left:
8      storage/client_test.go:506
2      storage/replica_range_lease.go:150
1      storage/intent_resolver.go:316
E161025 07:00:27.884194 21075 storage/node_liveness.go:141  [hb] failed liveness heartbeat: transaction commit result is ambiguous
I161025 07:00:27.885079 22529 util/stop/stopper.go:425  quiescing; tasks left:
2      storage/client_test.go:506
1      storage/replica_range_lease.go:150
I161025 07:00:27.885220 22526 util/stop/stopper.go:425  quiescing; tasks left:
1      storage/queue.go:467
1      storage/intent_resolver.go:381
1      storage/client_test.go:506
I161025 07:00:27.885336 22527 util/stop/stopper.go:425  quiescing; tasks left:
6      storage/client_test.go:506
1      storage/replica_range_lease.go:150
1      storage/intent_resolver.go:316
I161025 07:00:27.904724 22527 util/stop/stopper.go:425  quiescing; tasks left:
5      storage/client_test.go:506
1      storage/replica_range_lease.go:150
1      storage/intent_resolver.go:316
W161025 07:00:27.908162 22281 storage/intent_resolver.go:378  could not GC completed transaction anchored at /Local/Range/"split"/RangeDescriptor: failed to send RPC: sending to all 3 replicas failed; last error: node unavailable; try another peer
I161025 07:00:27.908462 22526 util/stop/stopper.go:425  quiescing; tasks left:
1      storage/queue.go:467
1      storage/client_test.go:506
E161025 07:00:27.908856 21075 storage/node_liveness.go:141  [hb] failed liveness heartbeat: node unavailable; try another peer
I161025 07:00:27.911705 22529 util/stop/stopper.go:425  quiescing; tasks left:
1      storage/replica_range_lease.go:150
1      storage/client_test.go:506
I161025 07:00:27.912033 22529 util/stop/stopper.go:425  quiescing; tasks left:
1      storage/client_test.go:506
W161025 07:00:27.925653 22289 storage/intent_resolver.go:313  [n2,s2,r1/2:{/Min-"split"}]: failed to push during intent resolution: failed to send RPC: sending to all 3 replicas failed; last error: node unavailable; try another peer
I161025 07:00:27.925950 22527 util/stop/stopper.go:425  quiescing; tasks left:
5      storage/client_test.go:506
1      storage/replica_range_lease.go:150
E161025 07:00:27.931982 21472 storage/node_liveness.go:141  [hb] failed liveness heartbeat: failed to send RPC: sending to all 3 replicas failed; last error: node unavailable; try another peer
E161025 07:00:27.932892 21472 storage/node_liveness.go:141  [hb] failed liveness heartbeat: node unavailable; try another peer
I161025 07:00:27.935261 22527 util/stop/stopper.go:425  quiescing; tasks left:
4      storage/client_test.go:506
1      storage/replica_range_lease.go:150
E161025 07:00:27.937759 21098 storage/node_liveness.go:141  [hb] failed liveness heartbeat: failed to send RPC: sending to all 3 replicas failed; last error: node unavailable; try another peer
I161025 07:00:27.940387 22527 util/stop/stopper.go:425  quiescing; tasks left:
3      storage/client_test.go:506
1      storage/replica_range_lease.go:150
I161025 07:00:27.942966 22527 util/stop/stopper.go:425  quiescing; tasks left:
2      storage/client_test.go:506
1      storage/replica_range_lease.go:150
E161025 07:00:27.943153 21266 storage/node_liveness.go:141  [hb] failed liveness heartbeat: failed to send RPC: sending to all 3 replicas failed; last error: node unavailable; try another peer
E161025 07:00:27.944163 21004 storage/node_liveness.go:141  [hb] failed liveness heartbeat: failed to send RPC: sending to all 3 replicas failed; last error: node unavailable; try another peer
E161025 07:00:27.945248 21004 storage/node_liveness.go:141  [hb] failed liveness heartbeat: node unavailable; try another peer
I161025 07:00:27.945707 22527 util/stop/stopper.go:425  quiescing; tasks left:
1      storage/replica_range_lease.go:150
1      storage/client_test.go:506
I161025 07:00:27.945971 22527 util/stop/stopper.go:425  quiescing; tasks left:
1      storage/client_test.go:506
W161025 07:00:27.946420 22103 storage/intent_resolver.go:101  [s1] asynchronous resolveIntents failed: failed to send RPC: sending to all 3 replicas failed; last error: node unavailable; try another peer
I161025 07:00:27.947446 22526 util/stop/stopper.go:425  quiescing; tasks left:
1      storage/queue.go:467
E161025 07:00:27.968215 21266 storage/node_liveness.go:141  [hb] failed liveness heartbeat: node unavailable; try another peer
E161025 07:00:27.968453 21022 storage/queue.go:558  [replicaGC] on [n1,s1,r2/1:{"split"-/Max}]: failed to send RPC: sending to all 3 replicas failed; last error: range 1: replica {1 1 1} not lease holder; <nil> is
I161025 07:00:27.968736 22525 util/stop/stopper.go:353  stop has been called, stopping or quiescing all running tasks
E161025 07:00:27.969749 21098 storage/node_liveness.go:141  [hb] failed liveness heartbeat: node unavailable; try another peer
E161025 07:00:27.971575 21004 storage/node_liveness.go:141  [hb] failed liveness heartbeat: node unavailable; try another peer
E161025 07:00:27.977287 21007 storage/node_liveness.go:141  [hb] failed liveness heartbeat: failed to send RPC: sending to all 3 replicas failed; last error: node unavailable; try another peer
E161025 07:00:27.978663 21007 storage/node_liveness.go:141  [hb] failed liveness heartbeat: node unavailable; try another peer
E161025 07:00:27.999359 21472 storage/node_liveness.go:141  [hb] failed liveness heartbeat: node unavailable; try another peer
I161025 07:00:28.002023 22525 util/stop/stopper.go:353  stop has been called, stopping or quiescing all running tasks
W161025 07:00:28.017954 22419 storage/store.go:2897  [s5] got error from range 2, replica {2 2 2}: storage/raft_transport.go:249: unable to accept Raft message from {NodeID:5 StoreID:5 ReplicaID:4}: no handler registered for {NodeID:2 StoreID:2 ReplicaID:2}
W161025 07:00:28.018330 22419 storage/store.go:2897  [s5] got error from range 2, replica {2 2 2}: storage/raft_transport.go:249: unable to accept Raft message from {NodeID:5 StoreID:5 ReplicaID:4}: no handler registered for {NodeID:2 StoreID:2 ReplicaID:2}
W161025 07:00:28.020196 21768 storage/store.go:2897  [s3] got error from range 0, replica {1 1 0}: storage/raft_transport.go:249: unable to accept Raft message from {NodeID:3 StoreID:3 ReplicaID:0}: no handler registered for {NodeID:1 StoreID:1 ReplicaID:0}
I161025 07:00:28.022101 22525 util/stop/stopper.go:353  stop has been called, stopping or quiescing all running tasks
W161025 07:00:28.032963 21901 storage/raft_transport.go:463  no handler found for store 3 in response range_id:0 from_replica:<node_id:2 store_id:2 replica_id:0 > to_replica:<node_id:3 store_id:3 replica_id:0 > union:<error:<message:"storage/raft_transport.go:249: unable to accept Raft message from {NodeID:3 StoreID:3 ReplicaID:0}: no handler registered for {NodeID:2 StoreID:2 ReplicaID:0}" transaction_restart:NONE origin_node:0 now:<wall_time:0 logical:0 > > > 
E161025 07:00:28.042742 21266 storage/node_liveness.go:141  [hb] failed liveness heartbeat: node unavailable; try another peer
I161025 07:00:28.043794 22525 util/stop/stopper.go:353  stop has been called, stopping or quiescing all running tasks
W161025 07:00:28.046586 22379 storage/raft_transport.go:463  no handler found for store 4 in response range_id:2 from_replica:<node_id:2 store_id:2 replica_id:2 > to_replica:<node_id:4 store_id:4 replica_id:3 > union:<error:<message:"storage/raft_transport.go:249: unable to accept Raft message from {NodeID:4 StoreID:4 ReplicaID:3}: no handler registered for {NodeID:2 StoreID:2 ReplicaID:2}" transaction_restart:NONE origin_node:0 now:<wall_time:0 logical:0 > > > 
W161025 07:00:28.046822 22379 storage/raft_transport.go:463  no handler found for store 4 in response range_id:2 from_replica:<node_id:2 store_id:2 replica_id:2 > to_replica:<node_id:4 store_id:4 replica_id:3 > union:<error:<message:"storage/raft_transport.go:249: unable to accept Raft message from {NodeID:4 StoreID:4 ReplicaID:3}: no handler registered for {NodeID:2 StoreID:2 ReplicaID:2}" transaction_restart:NONE origin_node:0 now:<wall_time:0 logical:0 > > > 
W161025 07:00:28.047029 22379 storage/raft_transport.go:463  no handler found for store 4 in response range_id:2 from_replica:<node_id:2 store_id:2 replica_id:2 > to_replica:<node_id:4 store_id:4 replica_id:3 > union:<error:<message:"storage/raft_transport.go:249: unable to accept Raft message from {NodeID:4 StoreID:4 ReplicaID:3}: no handler registered for {NodeID:2 StoreID:2 ReplicaID:2}" transaction_restart:NONE origin_node:0 now:<wall_time:0 logical:0 > > > 
W161025 07:00:28.047230 22379 storage/raft_transport.go:463  no handler found for store 4 in response range_id:2 from_replica:<node_id:2 store_id:2 replica_id:2 > to_replica:<node_id:4 store_id:4 replica_id:3 > union:<error:<message:"storage/raft_transport.go:249: unable to accept Raft message from {NodeID:4 StoreID:4 ReplicaID:3}: no handler registered for {NodeID:2 StoreID:2 ReplicaID:2}" transaction_restart:NONE origin_node:0 now:<wall_time:0 logical:0 > > > 
W161025 07:00:28.047429 22379 storage/raft_transport.go:463  no handler found for store 4 in response range_id:2 from_replica:<node_id:2 store_id:2 replica_id:2 > to_replica:<node_id:4 store_id:4 replica_id:3 > union:<error:<message:"storage/raft_transport.go:249: unable to accept Raft message from {NodeID:4 StoreID:4 ReplicaID:3}: no handler registered for {NodeID:2 StoreID:2 ReplicaID:2}" transaction_restart:NONE origin_node:0 now:<wall_time:0 logical:0 > > > 
W161025 07:00:28.047652 22379 storage/raft_transport.go:463  no handler found for store 4 in response range_id:2 from_replica:<node_id:2 store_id:2 replica_id:2 > to_replica:<node_id:4 store_id:4 replica_id:3 > union:<error:<message:"storage/raft_transport.go:249: unable to accept Raft message from {NodeID:4 StoreID:4 ReplicaID:3}: no handler registered for {NodeID:2 StoreID:2 ReplicaID:2}" transaction_restart:NONE origin_node:0 now:<wall_time:0 logical:0 > > > 
I161025 07:00:28.049444 22525 util/stop/stopper.go:353  stop has been called, stopping or quiescing all running tasks
I161025 07:00:28.051472 22525 util/stop/stopper.go:353  stop has been called, stopping or quiescing all running tasks
I161025 07:00:28.053885 22525 util/stop/stopper.go:353  stop has been called, stopping or quiescing all running tasks
I161025 07:00:28.054825 21009 http2_server.go:276  transport: http2Server.HandleStreams failed to read frame: read tcp 127.0.0.1:47909->127.0.0.1:52355: use of closed network connection
I161025 07:00:28.055190 21306 http2_server.go:276  transport: http2Server.HandleStreams failed to read frame: read tcp 127.0.0.1:40370->127.0.0.1:54737: use of closed network connection
I161025 07:00:28.055526 21181 http2_server.go:276  transport: http2Server.HandleStreams failed to read frame: read tcp 127.0.0.1:60110->127.0.0.1:49491: use of closed network connection
I161025 07:00:28.055861 21063 http2_server.go:276  transport: http2Server.HandleStreams failed to read frame: read tcp 127.0.0.1:36810->127.0.0.1:36186: use of closed network connection
I161025 07:00:28.056569 20558 http2_client.go:1053  transport: http2Client.notifyError got notified that the client transport was broken EOF.
I161025 07:00:28.056901 21285 http2_client.go:1053  transport: http2Client.notifyError got notified that the client transport was broken EOF.
I161025 07:00:28.057225 21211 http2_client.go:1053  transport: http2Client.notifyError got notified that the client transport was broken EOF.
W161025 07:00:28.057590 22191 storage/raft_transport.go:428  raft transport stream to node 5 failed: EOF
W161025 07:00:28.057802 22193 storage/raft_transport.go:428  raft transport stream to node 1 failed: EOF
W161025 07:00:28.058006 22057 storage/raft_transport.go:428  raft transport stream to node 1 failed: EOF
I161025 07:00:28.058773 20987 http2_server.go:276  transport: http2Server.HandleStreams failed to read frame: read tcp 127.0.0.1:37344->127.0.0.1:42248: use of closed network connection
I161025 07:00:28.059001 21011 http2_client.go:1053  transport: http2Client.notifyError got notified that the client transport was broken EOF.
W161025 07:00:28.059428 22054 storage/raft_transport.go:428  raft transport stream to node 4 failed: EOF
I161025 07:00:28.059660 21213 /go/src/google.golang.org/grpc/clientconn.go:667  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:60110: getsockopt: connection refused"; Reconnecting to {"127.0.0.1:60110" <nil>}
I161025 07:00:28.059826 21213 /go/src/google.golang.org/grpc/clientconn.go:767  grpc: addrConn.transportMonitor exits due to: grpc: the connection is closing
I161025 07:00:28.060056 21013 /go/src/google.golang.org/grpc/clientconn.go:667  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:37344: getsockopt: connection refused"; Reconnecting to {"127.0.0.1:37344" <nil>}
I161025 07:00:28.060249 20945 http2_server.go:276  transport: http2Server.HandleStreams failed to read frame: read tcp 127.0.0.1:35894->127.0.0.1:38481: use of closed network connection
I161025 07:00:28.060887 21130 http2_client.go:1053  transport: http2Client.notifyError got notified that the client transport was broken EOF.
I161025 07:00:28.061184 21132 /go/src/google.golang.org/grpc/clientconn.go:667  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:36810: operation was canceled"; Reconnecting to {"127.0.0.1:36810" <nil>}
I161025 07:00:28.061282 21083 http2_client.go:1053  transport: http2Client.notifyError got notified that the client transport was broken read tcp 127.0.0.1:38481->127.0.0.1:35894: read: connection reset by peer.
I161025 07:00:28.061954 21287 /go/src/google.golang.org/grpc/clientconn.go:667  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:40370: operation was canceled"; Reconnecting to {"127.0.0.1:40370" <nil>}
I161025 07:00:28.062112 21287 /go/src/google.golang.org/grpc/clientconn.go:767  grpc: addrConn.transportMonitor exits due to: grpc: the connection is closing
I161025 07:00:28.062248 20560 /go/src/google.golang.org/grpc/clientconn.go:667  grpc: addrConn.resetTransport failed to create client transport: connection error: desc = "transport: dial tcp 127.0.0.1:47909: operation was canceled"; Reconnecting to {"127.0.0.1:47909" <nil>}
I161025 07:00:28.062402 20560 /go/src/google.golang.org/grpc/clientconn.go:767  grpc: addrConn.transportMonitor exits due to: grpc: the connection is closing
I161025 07:00:28.062547 22525 util/stop/stopper.go:353  stop has been called, stopping or quiescing all running tasks
I161025 07:00:28.062917 22525 util/stop/stopper.go:353  stop has been called, stopping or quiescing all running tasks
I161025 07:00:28.063290 22525 util/stop/stopper.go:353  stop has been called, stopping or quiescing all running tasks
I161025 07:00:28.063606 22525 util/stop/stopper.go:353  stop has been called, stopping or quiescing all running tasks
I161025 07:00:28.063998 22525 util/stop/stopper.go:353  stop has been called, stopping or quiescing all running tasks
I161025 07:00:28.064325 22525 util/stop/stopper.go:353  stop has been called, stopping or quiescing all running tasks
    client_raft_test.go:1903: range 1: replica {3 3 3} not lease holder; node_id:3 store_id:3 replica_id:3  is

The text was updated successfully, but these errors were encountered:

tamird · 2016-10-26T20:26:04Z

#10156.

tamird · 2016-10-26T20:26:24Z

Actually, this is not #10156.

Rather than the somewhat complicated rebalancing scenario, use a simple scenario that we perform up-replication of range 1 from 1 to 3 nodes. We check that this up-replication is performed using preemptive snapshots. The more complicated scenario was very fragile, frequently being broken by innocuous changes. Fixes cockroachdb#10497 Fixes cockroachdb#10193 Fixes cockroachdb#10156 Fixes cockroachdb#9395

petermattis · 2016-11-07T22:32:28Z

This error message is certainly confusing. It is being generated by Replica.processRaftCommand() when the isLeaseError() condition is true. In particular, this occurs when the current lease owner is the same as when the raft command was proposed, but the lease is no longer valid (i.e. raftCmd.Timestamp >= lease.StartStasis.

We see this on every node processing the raft command. That's good. Here is the log message I added:

    isLeaseError := func() bool {
        l, origin := r.mu.state.Lease, raftCmd.OriginReplica
        if l.Replica != origin && !raftCmd.IsLeaseRequest {
            return true
        }
        notCovered := !l.OwnedBy(origin.StoreID) || !l.Covers(raftCmd.Timestamp)
        if notCovered && !raftCmd.IsFreeze && !raftCmd.IsLeaseRequest {
            // Verify the range lease is held, unless this command is trying
            // to obtain it or is a freeze change (which can be proposed by any
            // Replica). Any other Raft command has had the range lease held
            // by the replica at proposal time, but this may no longer be the
            // case. Corruption aside, the most likely reason is a lease
            // change (the most recent lease holder assumes responsibility for all
            // past timestamps as well). In that case, it's not valid to go
            // ahead with the execution: Writes must be aware of the last time
            // the mutated key was read, and since reads are served locally by
            // the lease holder without going through Raft, a read which was
            // not taken into account may have been served. Hence, we must
            // retry at the current lease holder.
            log.Infof(ctx, "%x: isLeaseError: leaseOwner=%d leaseStartStasis=%s origin.StoreID=%d raftCmd.Timestamp=%s: %s",
                idKey, l.Replica.StoreID, l.StartStasis, origin.StoreID, raftCmd.Timestamp, raftCmd.Cmd.Summary())
            return true
        }
        return false
    }

And the output:

I161107 17:26:16.479335 226 storage/replica.go:2943  [s3,r1/3:{/Min-"split"}] 14b2273ba43d1c6a: isLeaseError: leaseOwner=2 leaseStartStasis=4.500000127,13 origin.StoreID=2 raftCmd.Timestamp=7.200000131,2: 1 TransferLease

@andreimatei Can you take a look at this? It seems like we proposed a raft command when the lease wasn't valid. Or perhaps we proposed it but later transferred the lease making the in-flight proposal invalid. Somewhat curious, raftCmd.IsLeaseRequest is false, but the log message is showing that we have a TransferLease operation. The problem is reproducible on 0b4aad8 with the following patch:

diff --git a/pkg/storage/client_raft_test.go b/pkg/storage/client_raft_test.go
index cee381c..e8e8c89 100644
--- a/pkg/storage/client_raft_test.go
+++ b/pkg/storage/client_raft_test.go
@@ -2054,6 +2054,7 @@ func TestStoreRangeRebalance(t *testing.T) {

        mtc.Start(t, 6)
        defer mtc.Stop()
+       stopNodeLivenessHeartbeats(mtc)

        splitKey := roachpb.Key("split")
        splitArgs := adminSplitArgs(roachpb.KeyMin, splitKey)

And the stress invocation:

make stress PKG=./storage/ TESTS=TestStoreRangeRebalance TESTFLAGS="--vmodule=replica=4" STRESSFLAGS="-maxfails 1 -stderr -p 16"

andreimatei · 2016-11-07T22:47:07Z

Will look. Just to verify - you're saying that the problem is reproducible both before and after my main change in #10420, right?

petermattis · 2016-11-08T01:17:02Z

No, this isn't reproducible after #10420 (specifically, not after 3d508a1) because it seems to be masked by another problem. Or perhaps 3d508a1 does fix the problem. I'm not sure.

The question is whether this error (replica {3 3 3} not lease holder; node_id:3 store_id:3 replica_id:3 is) is indicative of a real problem, just a confusing error message or an artifact of something the test is doing. Note that this test (TestStoreRangeRebalance) is the only place that multiTestContext.transferLease() is used. I can easily believe that test or that method is doing something funky.

petermattis · 2016-11-08T01:30:34Z

I added some more logging and I can see that we're proposing the TransferLease when the current lease is invalid:

I161107 20:27:26.501065 836 storage/replica.go:2054  [s2,r1/2:{/Min-"split"}] proposing command 15c61e6c9e1ffdcc at 10.800000135,7 (lease=4.500000127,22): 1 TransferLease

10.800000135,7 is the command's timestamp. 4.500000127,22 is the lease's start stasis timestamp (the last time the lease is valid).

petermattis · 2016-11-08T02:54:52Z

I think the problem might be that multiTestContext.transferLease() calls Replica.AdminTransferLease directly without first making sure that the receiver holds the lease.

It looks possible to fix multiTestContext.transferLease(), but given the only usage is from TestStoreRangeRebalance and that test is flaky for other reasons, I'd prefer to take the approach of #10515, get rid of that usage of transferLease() and remove that method.

Rather than the somewhat complicated rebalancing scenario, use a simple scenario that we perform up-replication of range 1 from 1 to 3 nodes. We check that this up-replication is performed using preemptive snapshots. The more complicated scenario was very fragile, frequently being broken by innocuous changes. Fixes cockroachdb#10193 Fixes cockroachdb#10156 Fixes cockroachdb#9395

andreimatei · 2016-11-08T03:17:09Z

Right, I was just typing the same. mtc.TransferLease is calling AdminTransferLease() directly on a replica, namely the replica it believes to be the current lease holder. This belief is based on rather weak indications - the mtc looks at the view another replica has on the lease, without checking if that lease is valid. There's two things wrong: a) that view of a lease might be stale and b) even if it's not stale when checked, the lease might move or expire by the time we send the command (this test uses a real clock and the timestamp used when sending the command can be higher than when we looked at the lease).
But I'm still unsure about the specifics of how that observed lease gets to be wrong or expired exactly... I think a race with a split has something to do with it... And also not sure how #10420 fixes it. Looking more.

andreimatei · 2016-11-08T04:30:01Z

Well I bet that this line has something to do with the flakiness:
https://github.com/cockroachdb/cockroach/blob/b30142d/pkg/storage/client_test.go#L497
ThemultiTestContextKVTransport advances the clock whenever it gets a NotLeaseHolderError without a known lease holder. Lol; does this seem sane?

                // stores has the range, is *not* the lease holder, but the
                // lease holder is not known; this can happen if the lease
                // holder is removed from the group. Move the manual clock
                // forward in an attempt to expire the lease.
                t.mtc.expireLeases()

Does anybody know why we need such things in this transport? What would happen if we didn't have any of this code around here that advances the clock?

bdarnell · 2016-11-08T05:14:24Z

In multiTestContext, if a node becomes unavailable or is removed from the group while holding the lease, no other node can acquire the lease until we advance the clock. This is a crude attempt at ensure that we never become stuck. However, it's probably better to move the clock advancement from the transport error handling to the points where the nodes become unavailable. stopStore and unreplicateRange are the main ones; I don't know if there are others. Many tests that call these methods already do their own lease expiration.

Add the check that preemptive snapshots are being used to TestStoreRangeUpReplicate. Add TestReplicateQueueRebalance for testing that basic rebalancing is working. Fixes cockroachdb#10193 Fixes cockroachdb#10156 Fixes cockroachdb#9395

cockroach-teamcity added O-robot Originated from a bot. C-test-failure Broken test (automatically or manually discovered). labels Oct 25, 2016

tamird closed this as completed Oct 26, 2016

tamird reopened this Oct 26, 2016

tamird assigned BramGruneir Oct 26, 2016

This was referenced Oct 27, 2016

: TestStoreRangeRebalance failed under stress #10256

Closed

teamcity: failed tests on master: testrace/TestStoreRangeRebalance #9694

Closed

jordanlewis mentioned this issue Nov 2, 2016

: TestStoreRangeRebalance failed under stress #10386

Closed

tamird mentioned this issue Nov 7, 2016

storage: add Allocator.TransferLease{Source,Target} #10464

Merged

petermattis assigned petermattis and unassigned BramGruneir Nov 7, 2016

petermattis mentioned this issue Nov 7, 2016

: TestStoreRangeRebalance failed under stress #10497

Closed

petermattis mentioned this issue Nov 7, 2016

storage: rewrite TestStoreRangeRebalance #10515

Merged

petermattis closed this as completed in #10515 Nov 8, 2016

petermattis mentioned this issue Dec 15, 2016

storage: deflake TestRefreshPendingCommands #12425

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

: TestStoreRangeRebalance failed under stress #10193

: TestStoreRangeRebalance failed under stress #10193

cockroach-teamcity commented Oct 25, 2016

tamird commented Oct 26, 2016

tamird commented Oct 26, 2016

petermattis commented Nov 7, 2016

andreimatei commented Nov 7, 2016

petermattis commented Nov 8, 2016

petermattis commented Nov 8, 2016

petermattis commented Nov 8, 2016

andreimatei commented Nov 8, 2016

andreimatei commented Nov 8, 2016

bdarnell commented Nov 8, 2016

: TestStoreRangeRebalance failed under stress #10193

: TestStoreRangeRebalance failed under stress #10193

Comments

cockroach-teamcity commented Oct 25, 2016

tamird commented Oct 26, 2016

tamird commented Oct 26, 2016

petermattis commented Nov 7, 2016

andreimatei commented Nov 7, 2016

petermattis commented Nov 8, 2016

petermattis commented Nov 8, 2016

petermattis commented Nov 8, 2016

andreimatei commented Nov 8, 2016

andreimatei commented Nov 8, 2016

bdarnell commented Nov 8, 2016