Skip to content
This repository has been archived by the owner on May 10, 2022. It is now read-only.

bug: client can't recover from a replica-server failure. #53

Open
neverchanje opened this issue Sep 18, 2019 · 1 comment
Open

bug: client can't recover from a replica-server failure. #53

neverchanje opened this issue Sep 18, 2019 · 1 comment
Labels
bug Something isn't working

Comments

@neverchanje
Copy link

neverchanje commented Sep 18, 2019

  • 2019/9/11 17:00. Our SRE stopped one instance of replica-server in our staging environment
    trying to simulate the problem java-client can't recover.

  • 2019/9/11 17:00. Some of our clients recovered right away while replica-server restarted, but some couldn't reconnect and kept retrying with ERR_TIMEOUT error.

2019-09-11 17:12:48,659 ERROR [nioEventLoopGroup-4-1] [com.xiaomi.message.mixin.fsm.ruleset.SaveAckRuleSet$3$1.operationComplete(SaveAckRuleSet.java:169)] - SaveB2CAckInfo fail, appId=2882303761517479657 msgId=sms-1c33f956-7dad-4327-8986-aa400cb4a8b2, ex:
com.xiaomi.infra.pegasus.client.PException: com.xiaomi.infra.pegasus.rpc.ReplicationException: ERR_TIMEOUT
	at com.xiaomi.infra.pegasus.client.PegasusTable$8.onCompletion(PegasusTable.java:376)
	at com.xiaomi.infra.pegasus.rpc.async.ClientRequestRound.thisRoundCompletion(ClientRequestRound.java:51)
	at com.xiaomi.infra.pegasus.rpc.async.TableHandler.tryDelayCall(TableHandler.java:314)
	at com.xiaomi.infra.pegasus.rpc.async.TableHandler.onRpcReply(TableHandler.java:295)
	at com.xiaomi.infra.pegasus.rpc.async.TableHandler$3.run(TableHandler.java:326)
	at com.xiaomi.infra.pegasus.rpc.async.ReplicaSession.tryNotifyWithSequenceID(ReplicaSession.java:226)
	at com.xiaomi.infra.pegasus.rpc.async.ReplicaSession.access$300(ReplicaSession.java:32)
	at com.xiaomi.infra.pegasus.rpc.async.ReplicaSession$4.run(ReplicaSession.java:270)
	at io.netty.util.concurrent.PromiseTask$RunnableAdapter.call(PromiseTask.java:38)
	at io.netty.util.concurrent.ScheduledFutureTask.run(ScheduledFutureTask.java:123)
	at io.netty.util.concurrent.SingleThreadEventExecutor.runAllTasks(SingleThreadEventExecutor.java:380)
	at io.netty.channel.nio.NioEventLoop.run(NioEventLoop.java:357)
	at io.netty.util.concurrent.SingleThreadEventExecutor$2.run(SingleThreadEventExecutor.java:116)
	at io.netty.util.concurrent.DefaultThreadFactory$DefaultRunnableDecorator.run(DefaultThreadFactory.java:137)
	at java.lang.Thread.run(Thread.java:748)
Caused by: com.xiaomi.infra.pegasus.rpc.ReplicationException: ERR_TIMEOUT
	... 15 more
  • 2019-09-11 17:12:48, the errors remained, we stopped the test.

Client Version

1.11.5-thrift-0.11.0-inlined-release

@neverchanje neverchanje added the bug Something isn't working label Sep 18, 2019
@neverchanje
Copy link
Author

neverchanje commented Sep 20, 2019

2019-09-11 17:01:18,424 WARN ReplicaSession.tryNotifyWithSequenceID: actively close the session because it's not responding for 10 seconds
...
2019-09-11 17:12:39,607 WARN ReplicaSession.tryNotifyWithSequenceID: actively close the session because it's not responding for 10 seconds

Bugfix #32 was working in the situation, but the session didn't close. The last ERR_SESSION_RESET is at 17:01.

2019-09-11 17:01:06,910 WARN TableHandler.onRpcReply: replica server(rpc_address(10.38.161.207:32801)) doesn't serve gpid(gpid(13.3)), operator(com.xiaomi.infra.pegasus.operator.rrdb_multi_put_operator@41e12d6f), try(3), error_code(ERR_SESSION_RESET), need query meta

The root cause is that the connection was not even established when it tried to close the session.

2019-09-11 17:12:19,575 WARN ReplicaSession.tryNotifyWithSequenceID: actively close the session because it's not responding for 10 seconds
2019-09-11 17:12:19,575 INFO ReplicaSession.closeSession: channel rpc_address(10.38.161.207:32801) not connected, skip the close

The client repeatedly tried to reconnect to this server, but it didn't succeed.

2019-09-11 17:01:05,830 WARN ReplicaSession$2.operationComplete(ReplicaSession.java:153) - rpc_address(10.38.161.207:32801): try to connect to target failed
...
2019-09-11 17:01:06,909 WARN ReplicaSession$2.operationComplete(ReplicaSession.java:153) - rpc_address(10.38.161.207:32801): try to connect to target failed

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

1 participant