Skip to content
This repository has been archived by the owner on May 10, 2022. It is now read-only.

query meta when an amount of ERR_TIMEOUT occurred but no ERR_SESSION_RESET #25

Closed
neverchanje opened this issue Dec 21, 2018 · 0 comments · Fixed by #32
Closed

query meta when an amount of ERR_TIMEOUT occurred but no ERR_SESSION_RESET #25

neverchanje opened this issue Dec 21, 2018 · 0 comments · Fixed by #32
Assignees
Labels
bug Something isn't working

Comments

@neverchanje
Copy link

neverchanje commented Dec 21, 2018

Reproduction

  1. All nodes are healthy at start.
>>> nodes -d
address               status              replica_count       primary_count       secondary_count     
172.21.0.21:34801     ALIVE               3                   1                   2                   
172.21.0.22:34801     ALIVE               5                   2                   3                   
172.21.0.23:34801     ALIVE               4                   2                   2                   
172.21.0.24:34801     ALIVE               5                   1                   4                   
172.21.0.25:34801     ALIVE               5                   2                   3                 
  1. Run ycsb.
./bin/ycsb load pegasus -s -P workloads/workload_pegasus -p "pegasus.config=file://./pegasus/conf/pegasus.properties" > outputLoad.txt
  1. Partition replica1(172.21.0.21) with rest of the nodes and ycsb client (packet loss rate is 100%)
docker run -it --rm -v /var/run/docker.sock:/var/run/docker.sock gaiaadm/pumba netem --duration 1h --tc-image gaiadocker/iproute2 loss --percent 100 pegasus_replica1_1
  1. After a while this node becomes unhealthy and kicked off by meta
>>> nodes -d
address               status              replica_count       primary_count       secondary_count     
172.21.0.21:34801     UNALIVE             0                   0                   0                   
172.21.0.22:34801     ALIVE               5                   2                   3                   
172.21.0.23:34801     ALIVE               4                   2                   2                   
172.21.0.24:34801     ALIVE               5                   2                   3                   
172.21.0.25:34801     ALIVE               5                   2                   3                   

total_node_count   : 5
alive_node_count   : 4
unalive_node_count : 1
  1. However for a long period the java client is still unconscious of the fail-over, so it retries until TCP's max retries time (usually 15min) reaches and finally gets ERR_SESSION_RESET error, which informs the client to retrieve the latest route table through meta.
2019-01-16 15:40:10:104 20 sec: 50889 operations; 1531.1 current ops/sec; est completion in 10 hours 54 minutes [INSERT: Count=15308, Max=3016703, Min=192, Avg=600.1, 90=552, 99=658, 99.9=4387, 99.99=254079] 
Retrying insertion, retry count: 1
2019-01-16 15:40:20:104 30 sec: 50892 operations; 0.3 current ops/sec; est completion in 16 hours 21 minutes [INSERT: Count=3, Max=3026943, Min=1519, Avg=1009738.67, 90=3026943, 99=3026943, 99.9=3026943, 99.99=3026943] [INSERT-FAILED: Count=1, Max=5009407, Min=5005312, Avg=5007360, 90=5009407, 99=5009407, 99.9=5009407, 99.99=5009407] 
Retrying insertion, retry count: 2

...

2019-01-16 15:55:40:104 950 sec: 50892 operations; 0 current ops/sec; est completion in 21 days 14 hours [INSERT: Count=0, Max=0, Min=9223372036854775807, Avg=NaN, 90=0, 99=0, 99.9=0, 99.99=0] [INSERT-FAILED: Count=1, Max=5001215, Min=4997120, Avg=4999168, 90=5001215, 99=5001215, 99.9=5001215, 99.99=5001215] 
Retrying insertion, retry count: 117
2019-01-16 15:55:50:103 960 sec: 57665 operations; 677.3 current ops/sec; est completion in 19 days 6 hours [INSERT: Count=6773, Max=1847295, Min=194, Avg=557.27, 90=333, 99=905, 99.9=3303, 99.99=17455] [INSERT-FAILED: Count=1, Max=5001215, Min=4997120, Avg=4999168, 90=5001215, 99=5001215, 99.9=5001215, 99.99=5001215] 
2019-01-16 15:56:00:103 970 sec: 78438 operations; 2077.3 current ops/sec; est completion in 14 days 7 hours [INSERT: Count=20773, Max=59487, Min=205, Avg=477.57, 90=580, 99=1158, 99.9=3153, 99.99=14071] [INSERT-FAILED: Count=0, Max=0, Min=9223372036854775807, Avg=NaN, 90=0, 99=0, 99.9=0, 99.99=0] 
2019-01-16 15:56:10:103 980 sec: 99983 operations; 2154.5 current ops/sec; est completion in 11 days 7 hours [INSERT: Count=21545, Max=13647, Min=352, Avg=460.73, 90=509, 99=917, 99.9=2807, 99.99=6699] [INSERT-FAILED: Count=0, Max=0, Min=9223372036854775807, Avg=NaN, 90=0, 99=0, 99.9=0, 99.99=0] 
2019-01-16 15:56:20:103 990 sec: 126495 operations; 2651.2 current ops/sec; est completion in 9 days 1 hours [INSERT: Count=26512, Max=30943, Min=250, Avg=374.47, 90=519, 99=777, 99.9=3103, 99.99=8051] [INSERT-FAILED: Count=0, Max=0, Min=9223372036854775807, Avg=NaN, 90=0, 99=0, 99.9=0, 99.99=0] 
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

1 participant