Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add ping/pong heartbeating to WSClient, and fix concurrent map on config #420

Merged
merged 5 commits into from
Jan 21, 2022

Conversation

peterbroadhurst
Copy link
Contributor

@peterbroadhurst peterbroadhurst commented Jan 20, 2022

Primary fix in this PR, is to add a heartbeat to the WSClient connections, to allow quicker detection of a case where the websocket between FireFly Core and a connector (EthConnect, FabConnect, DX) fails silently.

The default is 30 seconds between sending ping packets, with a 30 second timeout.

The feature can be disabled by setting the heartbeatInterval to zero on a given websocket client configuration.

Example log output showing operation of the heartbeating:

$ docker-compose logs -f firefly_core_0 | grep WS
firefly_core_0_1  | [2022-01-20T03:18:14.047Z]  INFO WS ws://ethconnect_0:8080/ws connected pid=1
firefly_core_0_1  | [2022-01-20T03:18:14.092Z]  INFO WS ws://dataexchange_0:3000 connected pid=1
firefly_core_0_1  | [2022-01-20T03:18:14.101Z]  INFO WS ws://tokens_0:3000/api/ws connected pid=1
firefly_core_0_1  | [2022-01-20T03:18:44.014Z] DEBUG WS ws://ethconnect_0:8080/ws heartbeat timer popped (ping) after 30000.57ms pid=1
firefly_core_0_1  | [2022-01-20T03:18:44.015Z] DEBUG WS ws://ethconnect_0:8080/ws heartbeat completed (pong) after 0.51ms pid=1
firefly_core_0_1  | [2022-01-20T03:18:44.059Z] DEBUG WS ws://dataexchange_0:3000 heartbeat timer popped (ping) after 30000.65ms pid=1
firefly_core_0_1  | [2022-01-20T03:18:44.062Z] DEBUG WS ws://dataexchange_0:3000 heartbeat completed (pong) after 2.76ms pid=1
firefly_core_0_1  | [2022-01-20T03:18:44.068Z] DEBUG WS ws://tokens_0:3000/api/ws heartbeat timer popped (ping) after 30001.13ms pid=1
firefly_core_0_1  | [2022-01-20T03:18:44.070Z] DEBUG WS ws://tokens_0:3000/api/ws heartbeat completed (pong) after 1.25ms pid=1
firefly_core_0_1  | [2022-01-20T03:19:13.982Z] DEBUG WS ws://ethconnect_0:8080/ws heartbeat timer popped (ping) after 30001.21ms pid=1
firefly_core_0_1  | [2022-01-20T03:19:13.983Z] DEBUG WS ws://ethconnect_0:8080/ws heartbeat completed (pong) after 0.92ms pid=1
firefly_core_0_1  | [2022-01-20T03:19:14.029Z] DEBUG WS ws://dataexchange_0:3000 heartbeat timer popped (ping) after 30001.20ms pid=1
firefly_core_0_1  | [2022-01-20T03:19:14.031Z] DEBUG WS ws://dataexchange_0:3000 heartbeat completed (pong) after 1.04ms pid=1
firefly_core_0_1  | [2022-01-20T03:19:14.037Z] DEBUG WS ws://tokens_0:3000/api/ws heartbeat timer popped (ping) after 30001.35ms pid=1
firefly_core_0_1  | [2022-01-20T03:19:14.039Z] DEBUG WS ws://tokens_0:3000/api/ws heartbeat completed (pong) after 1.23ms pid=1
firefly_core_0_1  | [2022-01-20T03:19:43.948Z] DEBUG WS ws://ethconnect_0:8080/ws heartbeat timer popped (ping) after 29999.30ms pid=1
firefly_core_0_1  | [2022-01-20T03:19:43.949Z] DEBUG WS ws://ethconnect_0:8080/ws heartbeat completed (pong) after 0.65ms pid=1
firefly_core_0_1  | [2022-01-20T03:19:43.998Z] DEBUG WS ws://dataexchange_0:3000 heartbeat timer popped (ping) after 30001.06ms pid=1
firefly_core_0_1  | [2022-01-20T03:19:44.001Z] DEBUG WS ws://dataexchange_0:3000 heartbeat completed (pong) after 1.99ms pid=1
firefly_core_0_1  | [2022-01-20T03:19:44.006Z] DEBUG WS ws://tokens_0:3000/api/ws heartbeat timer popped (ping) after 30001.08ms pid=1
firefly_core_0_1  | [2022-01-20T03:19:44.007Z] DEBUG WS ws://tokens_0:3000/api/ws heartbeat completed (pong) after 0.72ms pid=1
firefly_core_0_1  | [2022-01-20T03:20:13.916Z] DEBUG WS ws://ethconnect_0:8080/ws heartbeat timer popped (ping) after 30000.55ms pid=1
firefly_core_0_1  | [2022-01-20T03:20:13.917Z] DEBUG WS ws://ethconnect_0:8080/ws heartbeat completed (pong) after 0.63ms pid=1
firefly_core_0_1  | [2022-01-20T03:20:13.967Z] DEBUG WS ws://dataexchange_0:3000 heartbeat timer popped (ping) after 29999.39ms pid=1
firefly_core_0_1  | [2022-01-20T03:20:13.970Z] DEBUG WS ws://dataexchange_0:3000 heartbeat completed (pong) after 0.83ms pid=1
firefly_core_0_1  | [2022-01-20T03:20:13.974Z] DEBUG WS ws://tokens_0:3000/api/ws heartbeat timer popped (ping) after 30001.31ms pid=1
firefly_core_0_1  | [2022-01-20T03:20:13.975Z] DEBUG WS ws://tokens_0:3000/api/ws heartbeat completed (pong) after 0.79ms pid=1
firefly_core_0_1  | [2022-01-20T03:20:43.884Z] DEBUG WS ws://ethconnect_0:8080/ws heartbeat timer popped (ping) after 30001.25ms pid=1
firefly_core_0_1  | [2022-01-20T03:20:43.886Z] DEBUG WS ws://ethconnect_0:8080/ws heartbeat completed (pong) after 1.18ms pid=1
firefly_core_0_1  | [2022-01-20T03:20:43.936Z] DEBUG WS ws://dataexchange_0:3000 heartbeat timer popped (ping) after 30000.20ms pid=1
firefly_core_0_1  | [2022-01-20T03:20:43.936Z] DEBUG WS ws://dataexchange_0:3000 heartbeat completed (pong) after 0.48ms pid=1
firefly_core_0_1  | [2022-01-20T03:20:43.942Z] DEBUG WS ws://tokens_0:3000/api/ws heartbeat timer popped (ping) after 30000.69ms pid=1
firefly_core_0_1  | [2022-01-20T03:20:43.943Z] DEBUG WS ws://tokens_0:3000/api/ws heartbeat completed (pong) after 0.65ms pid=1

Also added a fix for this intermittent UT failure, which could hit real environments - by changing the locking in config to cover all cases of get/set concurrency:

goroutine 71 [running]:
runtime.throw(0x4ca0b0f, 0x21)
        /usr/local/Cellar/go/1.16.4/libexec/src/runtime/panic.go:1117 +0x72 fp=0xc0006bf9c8 sp=0xc0006bf998 pc=0x403b3b2
runtime.mapaccess2_faststr(0x4b64e00, 0xc0003df200, 0xc000680b10, 0x4, 0x53977c0, 0xc0003df100)
        /usr/local/Cellar/go/1.16.4/libexec/src/runtime/map_faststr.go:116 +0x4a5 fp=0xc0006bfa38 sp=0xc0006bf9c8 pc=0x4017625
github.com/spf13/viper.(*Viper).searchMap(0xc0000a61a0, 0xc0003df200, 0xc00068b340, 0x1, 0x2, 0x53977c0, 0x0)
        /Users/peterbroadhurst/dev/go/pkg/mod/github.com/spf13/[email protected]/viper.go:613 +0x5f fp=0xc0006bfa80 sp=0xc0006bfa38 pc=0x454ba5f
github.com/spf13/viper.(*Viper).isPathShadowedInDeepMap(0xc0000a61a0, 0xc00068b340, 0x2, 0x2, 0xc0003df200, 0x0, 0x0)
        /Users/peterbroadhurst/dev/go/pkg/mod/github.com/spf13/[email protected]/viper.go:749 +0x7f fp=0xc0006bfad0 sp=0xc0006bfa80 pc=0x454c3df
github.com/spf13/viper.(*Viper).find(0xc0000a61a0, 0xc000680b10, 0xe, 0x1, 0xc0006bfc40, 0x2)
        /Users/peterbroadhurst/dev/go/pkg/mod/github.com/spf13/[email protected]/viper.go:1267 +0xafd fp=0xc0006bfbe0 sp=0xc0006bfad0 pc=0x454e23d
github.com/spf13/viper.(*Viper).Get(0xc0000a61a0, 0xc000680b00, 0xe, 0x203000, 0x53958d0)
        /Users/peterbroadhurst/dev/go/pkg/mod/github.com/spf13/[email protected]/viper.go:843 +0x85 fp=0xc0006bfc78 sp=0xc0006bfbe0 pc=0x454c825
github.com/spf13/viper.(*Viper).GetString(0xc0000a61a0, 0xc000680b00, 0xe, 0xc000680b00, 0xe)
        /Users/peterbroadhurst/dev/go/pkg/mod/github.com/spf13/[email protected]/viper.go:910 +0x3f fp=0xc0006bfcb8 sp=0xc0006bfc78 pc=0x454d01f
github.com/spf13/viper.GetString(...)
        /Users/peterbroadhurst/dev/go/pkg/mod/github.com/spf13/[email protected]/viper.go:907
github.com/hyperledger/firefly/internal/config.(*configPrefix).GetString(0xc0001371f0, 0x4c86f9f, 0x9, 0x4c1a240, 0x30)
        /Users/peterbroadhurst/dev/photic/firefly/internal/config/config.go:552 +0x65 fp=0xc0006bfcf0 sp=0xc0006bfcb8 pc=0x455b3c5
github.com/hyperledger/firefly/internal/apiserver.(*apiServer).getPublicURL(0xc000032d80, 0x4de2e98, 0xc0001371f0, 0x0, 0x0, 0x5395780, 0x0)
        /Users/peterbroadhurst/dev/photic/firefly/internal/apiserver/server.go:417 +0x5e fp=0xc0006bfda8 sp=0xc0006bfcf0 pc=0x48b8bbe
github.com/hyperledger/firefly/internal/apiserver.(*apiServer).createMuxRouter(0xc000032d80, 0x4dd4f50, 0xc000032d40, 0x4de7ab0, 0xc00013a3c0, 0x0)
        /Users/peterbroadhurst/dev/photic/firefly/internal/apiserver/server.go:493 +0x158 fp=0xc0006bfe98 sp=0xc0006bfda8 pc=0x48b91d8
github.com/hyperledger/firefly/internal/apiserver.(*apiServer).Serve(0xc000032d80, 0x4dd4f50, 0xc000032d40, 0x4de7ab0, 0xc00013a3c0, 0x0, 0x1)
        /Users/peterbroadhurst/dev/photic/firefly/internal/apiserver/server.go:107 +0x35d fp=0xc0006bff18 sp=0xc0006bfe98 pc=0x48b6c7d
github.com/hyperledger/firefly/cmd.startFirefly(0x4dd4f50, 0xc000032d40, 0xc0004099d0, 0x4de7ab0, 0xc00013a3c0, 0x4dc0d00, 0xc000032d80, 0xc000046720)
        /Users/peterbroadhurst/dev/photic/firefly/cmd/firefly.go:163 +0x176 fp=0xc0006bffa0 sp=0xc0006bff18 pc=0x4964116
runtime.goexit()
        /usr/local/Cellar/go/1.16.4/libexec/src/runtime/asm_amd64.s:1371 +0x1 fp=0xc0006bffa8 sp=0xc0006bffa0 pc=0x4075181
created by github.com/hyperledger/firefly/cmd.run
        /Users/peterbroadhurst/dev/photic/firefly/cmd/firefly.go:118 +0x49e

goroutine 72 [runnable]:
github.com/spf13/viper.deepSearch(0xc0003dff50, 0xc0000331c0, 0x3, 0x4, 0x0)
				/Users/peterbroadhurst/dev/go/pkg/mod/github.com/spf13/[email protected]/util.go:189 +0x19c
github.com/spf13/viper.(*Viper).SetDefault(0xc0000a61a0, 0x4c9bd0c, 0x1d, 0x4b1b7e0, 0x4dab720)
				/Users/peterbroadhurst/dev/go/pkg/mod/github.com/spf13/[email protected]/viper.go:1437 +0x19d
github.com/spf13/viper.SetDefault(...)
				/Users/peterbroadhurst/dev/go/pkg/mod/github.com/spf13/[email protected]/viper.go:1428
github.com/hyperledger/firefly/internal/config.Reset()
				/Users/peterbroadhurst/dev/photic/firefly/internal/config/config.go:351 +0x1434
github.com/hyperledger/firefly/cmd.run(0x0, 0x0)
				/Users/peterbroadhurst/dev/photic/firefly/cmd/firefly.go:93 +0x47
github.com/hyperledger/firefly/cmd.glob..func1(0x5354fa0, 0x5395c48, 0x0, 0x0, 0x0, 0x0)
				/Users/peterbroadhurst/dev/photic/firefly/cmd/firefly.go:48 +0x32
github.com/spf13/cobra.(*Command).execute(0x5354fa0, 0x5395c48, 0x0, 0x0, 0x5354fa0, 0x5395c48)
				/Users/peterbroadhurst/dev/go/pkg/mod/github.com/spf13/[email protected]/command.go:856 +0x472
github.com/spf13/cobra.(*Command).ExecuteC(0x5354fa0, 0x4dc3120, 0xc00013a601, 0xc000409c90)
				/Users/peterbroadhurst/dev/go/pkg/mod/github.com/spf13/[email protected]/command.go:974 +0x375
github.com/spf13/cobra.(*Command).Execute(...)
				/Users/peterbroadhurst/dev/go/pkg/mod/github.com/spf13/[email protected]/command.go:902
github.com/hyperledger/firefly/cmd.Execute(0x4c9086f, 0x13)
				/Users/peterbroadhurst/dev/photic/firefly/cmd/firefly.go:87 +0x45
github.com/hyperledger/firefly/cmd.TestExecOkRestartThenExit(0xc000103b00)
				/Users/peterbroadhurst/dev/photic/firefly/cmd/firefly_test.go:119 +0x3b0
testing.tRunner(0xc000103b00, 0x4cde8f0)
				/usr/local/Cellar/go/1.16.4/libexec/src/testing/testing.go:1193 +0xef
created by testing.(*T).Run
				/usr/local/Cellar/go/1.16.4/libexec/src/testing/testing.go:1238 +0x2b3

@codecov-commenter
Copy link

codecov-commenter commented Jan 20, 2022

Codecov Report

Merging #420 (1d534ce) into main (f56d701) will not change coverage.
The diff coverage is 100.00%.

Impacted file tree graph

@@            Coverage Diff            @@
##              main      #420   +/-   ##
=========================================
  Coverage   100.00%   100.00%           
=========================================
  Files          277       277           
  Lines        14858     14939   +81     
=========================================
+ Hits         14858     14939   +81     
Impacted Files Coverage Δ
pkg/wsclient/wstestserver.go 100.00% <ø> (ø)
internal/config/config.go 100.00% <100.00%> (ø)
internal/config/wsconfig/wsconfig.go 100.00% <100.00%> (ø)
internal/events/websockets/websocket_connection.go 100.00% <100.00%> (ø)
pkg/wsclient/wsclient.go 100.00% <100.00%> (ø)
internal/database/sqlcommon/tokenbalance_sql.go 100.00% <0.00%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update f56d701...1d534ce. Read the comment docs.

@peterbroadhurst
Copy link
Contributor Author

E2e failed due to #423

pkg/wsclient/wsclient.go Outdated Show resolved Hide resolved
@peterbroadhurst
Copy link
Contributor Author

peterbroadhurst commented Jan 20, 2022

I added a fix to #298 into this PR, as I saw it pop on an e2e test.

I introduced a separate receiverDone channel, to wait for the receiver to close, that we never send on.

defer w.heartbeatMux.Unlock()

if isPong && w.activePingSent != nil {
log.L(w.ctx).Debugf("WS %s heartbeat completed (pong) after %.2fms", w.url, float64(time.Since(*w.activePingSent))/float64(time.Millisecond))
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems like this will get very chatty... But I guess it's at debug level.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants