fix: mutex deadlocks in PeerClient #223

miparnisari · 2024-02-20T03:40:36Z

Description

Close #222.

The mutex deadlock is proven by the flaky TestPeerClientShutdown:

[25/02/24 7:09:39] ~/GitHub/gubernator (master) $ go test ./... -run=TestPeerClientShutdown -count=1 -race -timeout=10s
?   	github.com/mailgun/gubernator/v2/cmd/gubernator	[no test files]
?   	github.com/mailgun/gubernator/v2/cmd/gubernator-cli	[no test files]
?   	github.com/mailgun/gubernator/v2/cmd/gubernator-cluster	[no test files]
?   	github.com/mailgun/gubernator/v2/cmd/healthcheck	[no test files]
time="2024-02-25T19:09:44-03:00" level=info msg="Starting 12 Gubernator workers..." instance="127.0.0.1:9990"
time="2024-02-25T19:09:44-03:00" level=info msg="GRPC Listening on 127.0.0.1:9990 ..." instance="127.0.0.1:9990"
time="2024-02-25T19:09:44-03:00" level=info msg="HTTP Gateway Listening on 127.0.0.1:9980 ..." instance="127.0.0.1:9990"
time="2024-02-25T19:09:44-03:00" level=info msg="Starting 12 Gubernator workers..." instance="127.0.0.1:9991"
time="2024-02-25T19:09:44-03:00" level=info msg="GRPC Listening on 127.0.0.1:9991 ..." instance="127.0.0.1:9991"
time="2024-02-25T19:09:44-03:00" level=info msg="HTTP Gateway Listening on 127.0.0.1:9981 ..." instance="127.0.0.1:9991"
time="2024-02-25T19:09:44-03:00" level=info msg="Starting 12 Gubernator workers..." instance="127.0.0.1:9992"
time="2024-02-25T19:09:44-03:00" level=info msg="GRPC Listening on 127.0.0.1:9992 ..." instance="127.0.0.1:9992"
time="2024-02-25T19:09:44-03:00" level=info msg="HTTP Gateway Listening on 127.0.0.1:9982 ..." instance="127.0.0.1:9992"
time="2024-02-25T19:09:44-03:00" level=info msg="Starting 12 Gubernator workers..." instance="127.0.0.1:9993"
time="2024-02-25T19:09:44-03:00" level=info msg="GRPC Listening on 127.0.0.1:9993 ..." instance="127.0.0.1:9993"
time="2024-02-25T19:09:44-03:00" level=info msg="HTTP Gateway Listening on 127.0.0.1:9983 ..." instance="127.0.0.1:9993"
time="2024-02-25T19:09:44-03:00" level=info msg="Starting 12 Gubernator workers..." instance="127.0.0.1:9994"
time="2024-02-25T19:09:44-03:00" level=info msg="GRPC Listening on 127.0.0.1:9994 ..." instance="127.0.0.1:9994"
time="2024-02-25T19:09:44-03:00" level=info msg="HTTP Gateway Listening on 127.0.0.1:9984 ..." instance="127.0.0.1:9994"
time="2024-02-25T19:09:44-03:00" level=info msg="Starting 12 Gubernator workers..." instance="127.0.0.1:9995"
time="2024-02-25T19:09:44-03:00" level=info msg="GRPC Listening on 127.0.0.1:9995 ..." instance="127.0.0.1:9995"
time="2024-02-25T19:09:44-03:00" level=info msg="HTTP Gateway Listening on 127.0.0.1:9985 ..." instance="127.0.0.1:9995"
time="2024-02-25T19:09:44-03:00" level=info msg="Starting 12 Gubernator workers..." instance="127.0.0.1:9890"
time="2024-02-25T19:09:44-03:00" level=info msg="GRPC Listening on 127.0.0.1:9890 ..." instance="127.0.0.1:9890"
time="2024-02-25T19:09:44-03:00" level=info msg="HTTP Gateway Listening on 127.0.0.1:9880 ..." instance="127.0.0.1:9890"
time="2024-02-25T19:09:44-03:00" level=info msg="Starting 12 Gubernator workers..." instance="127.0.0.1:9891"
time="2024-02-25T19:09:44-03:00" level=info msg="GRPC Listening on 127.0.0.1:9891 ..." instance="127.0.0.1:9891"
time="2024-02-25T19:09:44-03:00" level=info msg="HTTP Gateway Listening on 127.0.0.1:9881 ..." instance="127.0.0.1:9891"
time="2024-02-25T19:09:44-03:00" level=info msg="Starting 12 Gubernator workers..." instance="127.0.0.1:9892"
time="2024-02-25T19:09:44-03:00" level=info msg="GRPC Listening on 127.0.0.1:9892 ..." instance="127.0.0.1:9892"
time="2024-02-25T19:09:44-03:00" level=info msg="HTTP Gateway Listening on 127.0.0.1:9882 ..." instance="127.0.0.1:9892"
time="2024-02-25T19:09:44-03:00" level=info msg="Starting 12 Gubernator workers..." instance="127.0.0.1:9893"
time="2024-02-25T19:09:44-03:00" level=info msg="GRPC Listening on 127.0.0.1:9893 ..." instance="127.0.0.1:9893"
time="2024-02-25T19:09:44-03:00" level=info msg="HTTP Gateway Listening on 127.0.0.1:9883 ..." instance="127.0.0.1:9893"
panic: test timed out after 10s
running tests:
	TestPeerClientShutdown (10s)
	TestPeerClientShutdown/Global (10s)


goroutine 566 [sync.RWMutex.Lock]:
sync.runtime_SemacquireRWMutex(0xc000fefedc?, 0xbc?, 0xc000fefec8?)
	/opt/homebrew/Cellar/go/1.21.6/libexec/src/runtime/sema.go:87 +0x28
sync.(*RWMutex).Lock(0xc000fefec8)
	/opt/homebrew/Cellar/go/1.21.6/libexec/src/sync/rwmutex.go:152 +0x88
github.com/mailgun/gubernator/v2.(*PeerClient).connect(0xc000fefe00, {0xc00150a1a0?, 0x1?})
	/Users/maria.inesparnisari/GitHub/gubernator/peer_client.go:123 +0x3b4
github.com/mailgun/gubernator/v2.(*PeerClient).getPeerRateLimitsBatch(0xc000fefe00, {0x1054e7e50, 0x1063ddd60}, 0xc00086a200)
	/Users/maria.inesparnisari/GitHub/gubernator/peer_client.go:299 +0x170
github.com/mailgun/gubernator/v2.(*PeerClient).GetPeerRateLimit(0xc000fefe00, {0x1054e7e50, 0x1063ddd60}, 0xc00086a200)
	/Users/maria.inesparnisari/GitHub/gubernator/peer_client.go:199 +0x4fc
github.com/mailgun/gubernator/v2_test.TestPeerClientShutdown.func1.1()
	/Users/maria.inesparnisari/GitHub/gubernator/peer_client_test.go:71 +0x12c
created by github.com/mailgun/gubernator/v2_test.TestPeerClientShutdown.func1 in goroutine 565
	/Users/maria.inesparnisari/GitHub/gubernator/peer_client_test.go:68 +0x3cc

Sometimes this test also gives "panic: send on close channel". It happens because when you call shutdown, we are closing the channel, but we are not preventing any further requests, which will then try to write to that channel.

Testing

With this PR:

25/02/24 7:11:21] ~/GitHub/gubernator (fix-peer-client) $ go test ./... -run=TestPeerClientShutdown -count=10 -race -timeout=5s
?   	github.com/mailgun/gubernator/v2/cmd/gubernator	[no test files]
?   	github.com/mailgun/gubernator/v2/cmd/gubernator-cli	[no test files]
?   	github.com/mailgun/gubernator/v2/cmd/gubernator-cluster	[no test files]
?   	github.com/mailgun/gubernator/v2/cmd/healthcheck	[no test files]
ok  	github.com/mailgun/gubernator/v2	7.453s
ok  	github.com/mailgun/gubernator/v2/cluster	2.966s [no tests to run]
[25/02/24 7:11:37] ~/GitHub/gubernator (fix-peer-client) $

peer_client.go

thrawn01 · 2024-02-21T03:02:34Z

This code is overly complex and I loath changing it. 😭

I'm probably missing something obvious, but where are we demonstrating there is a a deadlock and where is the deadlock?

thrawn01 · 2024-02-21T03:09:56Z

peer_client.go

-	// Take the write lock since we're going to modify the closing state
-	c.mutex.Lock()
+// Shutdown waits until outstanding requests have finished
+func (c *PeerClient) Shutdown() {


Shutdown/Close methods should always accept and context.Cancel as there may be some unknown or unforeseen reason the close cannot complete and it then hangs forever. This has happened in production on other services and it's always good to put a limit on how long you will wait for a shutdown to complete. This protects the service from having to be killed by the orchestration system without successfully saving it's internal state. Shutting down a connection is a low priority thing, if for some reason the workgroup Wait() hangs forever, then we can cancel the shutdown wait, and not interfere with normal shutdown of any service that uses gubernator as a library.

I've never seen a shutdown function accept a context. E.g. sql.Close, or connection.Close.

The context object is meant to be used when there is the potential chance that the request is cancelled by the user. This is not the case, Shutdown is invoked internally, not by a client. We should not use the context as a crutch to fix deadlock issues.
I can refactor this to avoid mutex altogether, but I would like to know if the work would be useful (given the v3 release)

There are a few things in the works. We are moving development to a new repo which will not be associated with the mailgun GitHub organization. After that, the plan is to bump to a v3 release, drop GRPC support and continue development there.

But first we want to merge all these open PRs.

I had hoped we could finish merging last week, as I'm traveling this week, but hopefully we can get it all done!

And to add, Mailgun plans to abandon this repo in favor of @thrawn01's new repo. I suppose you can continue using the /v2 flavor with the new repo until ready to migrate to /v3.

Either way, it looks like /v2 will be deprecated very soon and gRPC support along with it. I suggest not investing too much time into it.

Thanks for the update! I can lend a hand in the new repo 😄

Out of interest, what was the biggest factor for the decision to drop gRPC?

Out of interest, what was the biggest factor for the decision to drop gRPC?

@thrawn01 and I had some big conversations about this. Reduced complexity and improved performance. HTTP can perform more efficiently at scale than gRPC if done right.

This would be a good time for @thrawn01 to mention his work on Duh-RPC.

HTTP can perform more efficiently at scale than gRPC if done right.

What's the source for this? My understanding was this this in fact the opposite 😄 https://aws.amazon.com/compare/the-difference-between-grpc-and-rest/

A gRPC API is better for these use cases:
High-performance systems
High data loads
Real-time or streaming applications

Another source: https://github.com/grpc-ecosystem/grpc-gateway#background

gRPC is great -- it generates API clients and server stubs in many programming languages, it is fast, easy-to-use, bandwidth-efficient and its design is combat-proven by Google. However, you might still want to provide a traditional RESTful JSON API as well. Reasons can range from maintaining backward-compatibility, supporting languages or clients that are not well supported by gRPC, to simply maintaining the aesthetics and tooling involved with a RESTful JSON architecture.

I intend to publish some more formal information about our GRPC journey at Mailgun. But as a quick overview, we are doing some very high performance and large scale stuff with GRPC and it hasn't lived up to all the hype.

In fact we have discovered that GRPC is slightly slower than using plan old HTTP/1 !!!!!!

As a result, I started working on a project called DUH-RPC, which is still in the works, but you can see the benchmarks comparing HTTP/1 with GRPC here https://github.com/duh-rpc/duh-go-benchmarks

The spec I'm working on is in progress but is here, feed back is welcome.
https://github.com/duh-rpc/duh-go

I'll be traveling for the next 2 weeks, and I'm not always near my laptop, so I don't expect much movement on these projects until after. I still get notifications for these conversations, so I'll try to reply as I can.

miparnisari · 2024-02-21T03:20:53Z

This code is overly complex and I loath changing it. 😭

I'm probably missing something obvious, but where are we demonstrating there is a a deadlock and where is the deadlock?

The deadlock is proven by the flaky test. (If you run go test with -count=100 you will see it)

Sometimes I get "panic: send on close channel". It happens because when you call shutdown, we are closing the channel, but we are not preventing any further requests, which will then try to write to that channel

And sometimes the test times out, it's because one of the lines of code never finishes acquiring the mutex. I forget which line it was

Baliedge · 2024-02-23T14:35:32Z

@miparnisari Is this PR ready for merge? Not concerned about the go-bench error. But the PR is still in draft mode.

miparnisari · 2024-02-25T21:18:43Z

peer_client_test.go

 					}
-
-					assert.True(t, true, isExpectedErr)


this is asserting that true == true 😅 it's also bad practice to do assertions inside goroutines (https://github.com/Antonboom/testifylint#go-require)

miparnisari · 2024-02-25T21:19:35Z

peer_client_test.go

-					switch err.(type) {
-					case *gubernator.PeerErr:


this is not unwrapping errors, so it will never enter the 1st case

peer_client.go

miparnisari · 2024-02-26T15:37:17Z

@Baliedge i don't know what to do about the newly failing TestLeakyBucketDivBug. It seems to be failing because of the immediately preceding test TestHealthCheck. When I run it in isolation, it passes:

[26/02/24 12:36:35] ~/GitHub/gubernator (fix-peer-client) $ go test ./... -run=TestLeakyBucketDivBug -race -count=1 -p=1
ok      github.com/mailgun/gubernator/v2        2.411s
ok      github.com/mailgun/gubernator/v2/cluster        2.014s [no tests to run]
?       github.com/mailgun/gubernator/v2/cmd/gubernator [no test files]
?       github.com/mailgun/gubernator/v2/cmd/gubernator-cli     [no test files]
?       github.com/mailgun/gubernator/v2/cmd/gubernator-cluster [no test files]
?       github.com/mailgun/gubernator/v2/cmd/healthcheck        [no test files]
[26/02/24 12:36:47] ~/GitHub/gubernator (fix-peer-client) $

I have the suspicion that TestHealthCheck is setting lastErr and TestLeakyBucketDivBug is regurgitating that.

Baliedge · 2024-02-26T19:12:38Z

@Baliedge i don't know what to do about the newly failing TestLeakyBucketDivBug. It seems to be failing because of the immediately preceding test TestHealthCheck.

Since TestHealthCheck causes a server side shutdown and restart, I wonder if it's not ready in time for the next test.

functional_test.go

peer_client.go

Baliedge

Nice

miparnisari · 2024-02-29T04:55:13Z

@Baliedge make test still fails locally 😭

Baliedge · 2024-02-29T14:17:56Z

@miparnisari I noticed after merging. It does work on retry. If you have an idea how to make it better, let's try it.

fix race issuers in PeerClient

498acea

miparnisari mentioned this pull request Feb 20, 2024

fix: goroutine leaks #221

Merged

miparnisari commented Feb 20, 2024

View reviewed changes

peer_client.go Outdated Show resolved Hide resolved

miparnisari changed the title ~~fix race issuers in PeerClient~~ fix race issueS in PeerClient Feb 20, 2024

miparnisari changed the title ~~fix race issueS in PeerClient~~ fix mutex deadlocks in PeerClient Feb 20, 2024

miparnisari marked this pull request as ready for review February 20, 2024 03:49

miparnisari requested a review from Baliedge as a code owner February 20, 2024 03:49

miparnisari added 3 commits February 20, 2024 10:32

undo changes in comments

5c0e54e

more fixes

2b940a8

final fixes

747d3cd

miparnisari marked this pull request as draft February 20, 2024 16:35

thrawn01 reviewed Feb 21, 2024

View reviewed changes

miparnisari added 2 commits February 25, 2024 16:10

Merge remote-tracking branch 'origin/master' into fix-peer-client

33098ee

do not create 1 connection every time

53a1627

miparnisari commented Feb 25, 2024

View reviewed changes

peer_client.go Outdated Show resolved Hide resolved

miparnisari added 10 commits February 25, 2024 18:23

add comment

4b701fb

remove comment

0eccd37

fix lint

42c8e72

add ctx back to shutdown

27e91fd

move conn.Close out of goroutine

dc64a09

fix lint

045f826

oops

9ce0eb6

make it blocking

e7eabb0

put cap of 1 second to NewPeerClient connect

f132176

undo put cap of 1 second to NewPeerClient connect

f6102d9

miparnisari changed the title ~~fix mutex deadlocks in PeerClient~~ fix: mutex deadlocks in PeerClient Feb 26, 2024

miparnisari added 2 commits February 26, 2024 12:08

Merge remote-tracking branch 'origin/master' into fix-peer-client

e37a3fe

fix another leak

b218cbe

miparnisari marked this pull request as ready for review February 26, 2024 18:33

miparnisari added 2 commits February 28, 2024 11:13

on peer shutdown, clear all errors

fcb7b15

add time.sleep to end of TestHealthCheck

6a53c43

Baliedge suggested changes Feb 28, 2024

View reviewed changes

functional_test.go Outdated Show resolved Hide resolved

miparnisari added 2 commits February 28, 2024 14:20

use testutil.UntilPass

e4d42b7

check health of every instance

795af06

Baliedge suggested changes Feb 28, 2024

View reviewed changes

functional_test.go Outdated Show resolved Hide resolved

use cluster.GetAllPeers

db541c0

miparnisari force-pushed the fix-peer-client branch from 294c616 to db541c0 Compare February 28, 2024 22:36

miparnisari commented Feb 28, 2024

View reviewed changes

peer_client.go Show resolved Hide resolved

Baliedge approved these changes Feb 28, 2024

View reviewed changes

Baliedge merged commit 63fd9ff into mailgun:master Feb 28, 2024
4 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: mutex deadlocks in PeerClient #223

fix: mutex deadlocks in PeerClient #223

miparnisari commented Feb 20, 2024 •

edited

Loading

thrawn01 commented Feb 21, 2024

thrawn01 Feb 21, 2024

miparnisari Feb 21, 2024 •

edited

Loading

thrawn01 Feb 21, 2024

Baliedge Feb 21, 2024

miparnisari Feb 21, 2024

philipgough Feb 21, 2024

Baliedge Feb 21, 2024

miparnisari Feb 21, 2024

miparnisari Feb 21, 2024

thrawn01 Feb 23, 2024

miparnisari commented Feb 21, 2024 •

edited

Loading

Baliedge commented Feb 23, 2024

miparnisari Feb 25, 2024 •

edited

Loading

miparnisari Feb 25, 2024

miparnisari commented Feb 26, 2024 •

edited

Loading

Baliedge commented Feb 26, 2024

Baliedge left a comment

miparnisari commented Feb 29, 2024 •

edited

Loading

Baliedge commented Feb 29, 2024

fix: mutex deadlocks in PeerClient #223

fix: mutex deadlocks in PeerClient #223

Conversation

miparnisari commented Feb 20, 2024 • edited Loading

Description

Testing

thrawn01 commented Feb 21, 2024

Choose a reason for hiding this comment

miparnisari Feb 21, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

miparnisari commented Feb 21, 2024 • edited Loading

Baliedge commented Feb 23, 2024

miparnisari Feb 25, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

miparnisari commented Feb 26, 2024 • edited Loading

Baliedge commented Feb 26, 2024

Baliedge left a comment

Choose a reason for hiding this comment

miparnisari commented Feb 29, 2024 • edited Loading

Baliedge commented Feb 29, 2024

miparnisari commented Feb 20, 2024 •

edited

Loading

miparnisari Feb 21, 2024 •

edited

Loading

miparnisari commented Feb 21, 2024 •

edited

Loading

miparnisari Feb 25, 2024 •

edited

Loading

miparnisari commented Feb 26, 2024 •

edited

Loading

miparnisari commented Feb 29, 2024 •

edited

Loading