Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

sync: add Mutex.TryLock #45435

Closed
TyeMcQueen opened this issue Apr 7, 2021 · 32 comments
Closed

sync: add Mutex.TryLock #45435

TyeMcQueen opened this issue Apr 7, 2021 · 32 comments

Comments

@TyeMcQueen
Copy link

What version of Go are you using (go version)?

$ go version
go version go1.16.3 darwin/amd64

Does this issue reproduce with the latest release?

Yes

What operating system and processor architecture are you using (go env)?

go env Output
$ go env
GO111MODULE=""
GOARCH="amd64"
GOBIN=""
GOCACHE="/Users/tyem/Library/Caches/go-build"
GOENV="/Users/tyem/Library/Application Support/go/env"
GOEXE=""
GOFLAGS=""
GOHOSTARCH="amd64"
GOHOSTOS="darwin"
GOINSECURE=""
GOMODCACHE="/Users/tyem/go/pkg/mod"
GONOPROXY=""
GONOSUMDB=""
GOOS="darwin"
GOPATH="/Users/tyem/go"
GOPRIVATE=""
GOPROXY="https://proxy.golang.org,direct"
GOROOT="/usr/local/go"
GOSUMDB="sum.golang.org"
GOTMPDIR=""
GOTOOLDIR="/usr/local/go/pkg/tool/darwin_amd64"
GOVCS=""
GOVERSION="go1.16.3"
GCCGO="gccgo"
AR="ar"
CC="clang"
CXX="clang++"
CGO_ENABLED="1"
GOMOD="/dev/null"
CGO_CFLAGS="-g -O2"
CGO_CPPFLAGS=""
CGO_CXXFLAGS="-g -O2"
CGO_FFLAGS="-g -O2"
CGO_LDFLAGS="-g -O2"
PKG_CONFIG="pkg-config"
GOGCCFLAGS="-fPIC -arch x86_64 -m64 -pthread -fno-caret-diagnostics -Qunused-arguments -fmessage-length=0 -fdebug-prefix-map=/var/folders/4w/l3sttpnj00xctgf981g1t2cr0000gq/T/go-build590339154=/tmp/go-build -gno-record-gcc-switches -fno-common"

What did you do?

Try to fix several bugs with locking contention in net/http/h2_bundle.go.

What did you expect to see?

A non-blocking way to test if a mutex is locked.

What did you see instead?

A years-old proposal that was eventually closed due to nobody laying out a convincing enough case to provide such a feature, #6123. I was unable to find a way to re-open an old, locked issue so I am raising it fresh now that I believe I can make a convincing case for the need for this feature.

I think fixing several years-old serious lock-up problems with h2 provides a good justification for this functionality. I believe I have a quite small and simple fix for the following open issues:

x/net/http2: client can hang forever if headers' size exceeds connection's buffer size and server hangs past request time
#23559
x/net/http2: pool deadlock
#32388
net/http: a "bad" https connection which uses Header.Set could potentially block all https requests more than 15 mins in production environments
#33006
x/net/http2: Blocked Write on single connection causes all future calls to block indefinitely
#33425

To lay out the justification for readers not familiar with any of these bugs...

You have a pool of network connections.
You have a sync.Mutex for preventing races when accessing this pool struct.
Each network connection also has a sync.Mutex for preventing races when accessing the connection struct.
Several operations on the pool need to query the state of one or more connections.
In particular, consider a function for making a new connection that needs to see if any existing connections can be reused.
This function must:
Lock the pool mutex.
Look up existing connections to the same endpoint.
Ask each connection if it is in a state where it can be used for a new connection.

That last question requires looking inside of the connection struct which would require locking the connection's mutex.
Despite a great deal of effort over years, the complexity of a network connection is such that it is still possible for the connection mutex to be locked when an operation that is usually fast is done and then that operation takes a long time, leaving the connection mutex locked for a non-trivial duration. It is desirable to prevent such situations but there have been so many cases of this and several of them could not be eliminated after years. So the risk of such is always going to be present.

While having such a problem impact a single connection object is often not much of a problem, having it make it impossible for any new network connections to be made is a very serious problem and we should prevent that. The problem is so serious that h2_bundle has been disabled in important situations just to avoid this.

If I am holding a lock on the pool's mutex and need to ask the question, "Is this particular connection available to take this request?", then, I believe, the connection's mutex being locked means that I should be able to immediately answer with "No, this connection is not available to take a new request, because it is busy doing something that requires its mutex to be locked. We hope this situation will progress very quickly to a state where the mutex is no longer locked and at that point we could ask the question again, but, no, right now, the connection is not ready and available."

But sync.Mutex does not currently allow me to code that, and I think it is completely reasonable to allow such. This can be easily done via the simple addition of:

// LockIfNoWait either locks m immediately if it is not in use, returning
// true, or, when a lock on m is already held, immediately returns false.
func (m *Mutex) LockIfNoWait() bool {
    if atomic.CompareAndSwapInt32(&m.state, 0, mutexLocked) {
        if race.Enabled {
            race.Acquire(unsafe.Pointer(m))
        }
        return true
    }
    return false
}

Such an addition then allows fixes to h2_bundle.go that are also very simple.

Now, even if we do a great deal of complicated work to eliminate all of the places where a connection mutex is held during a call that has a rare chance of taking a bit of time and even convince ourselves that we won't ever let such bugs ever creep in again, I would still prefer that "Is this connection available?" be answered immediately with "no" if the question is asked while the connection's mutex is held. Having the "open a new connection" function have to potentially wait for any number of connection mutexes to be unlocked just does not sound like a good strategy for making a pool that provides high performance when there are tons of connections being managed.

One of the places this has happened is holding the connection mutex when the connection is Close()d. It is completely reasonable for the connection struct to remain locked during such an operation as nothing useful can be done with a connection that is in the process of closing... except, of course, if you need to ask the connection a question and you are in a context where waiting for the answer would tie up all progress for anybody wanting to deal with network connections.

And the cases that cause this to happen are often sudden, ungraceful loss of network connectivity to a service. In such a case, it doesn't matter much if the mutex for connections to such a server are locked or not because you won't be able to make useful connections to the service anyway (better to try a new connection where that gets done in a way that doesn't hold the new connection's mutex locked). But having one service failing slowly block all other network connections is terrible.

Another approach would be to have two layers to a connection object and 2 mutexes. And simple state information would be handled by one mutex and complex information would be handled by another. You'd have to be very careful to always lock the slow mutex before locking the fast one. And be very careful about which things are protected by which mutex. What are the trade-offs of such an approach? Cons: A great increase in complexity, lots of ways to easily make mistakes about what is fast vs slow, pool operations may have to wait for any number of fast connection mutexes to be unlocked, and fixing the above long-lived serious bugs will take other people spending a long time working on them (vs. me already having a simple fix). Pros: You don't have to add a basic feature of mutexes that is provided by most other implementations and has been requested by several people and that I've seen lots of reasonable uses for. (I would have piled on to the request for this myself a few times before this but the issue was already there and locked so I just did a more disruptive design change to work around this basic lack.)

So I'm hoping that case is convincing so I can make more progress on fixing those bugs.

Thanks!

@gopherbot gopherbot added this to the Proposal milestone Apr 7, 2021
@ianlancetaylor
Copy link
Member

See also #27544.

Personally I think TryLock is a better name than LockIfNoWait.

@ianlancetaylor ianlancetaylor changed the title proposal: func (m *sync.Mutex) LockIfNoWait() bool proposal: sync: add Mutex.LockIfNoWait method Apr 8, 2021
@TyeMcQueen
Copy link
Author

TryLock appears to be a pretty common name for this. There were objections to that name in #6123 so I endeavored to pick a name more in keeping with Go traditions. But I'm not picky about the name.

I suspect we don't want net/http depending on a 3rd-party module, a solution offered in #27544. Though I suppose an alternate mutex implementation (using a channel?) could be bundled into h2_bundle.go (and not exported). That would be better than not fixing h2 anyway.

@bcmills
Copy link
Contributor

bcmills commented Apr 13, 2021

You have a pool of network connections.
You have a sync.Mutex for preventing races when accessing this pool struct.

As in #27544, why not replace the sync.Mutex with a 1-buffered channel?

You could even store the pool itself in the channel's buffer, so that the pool isn't even in scope when it isn't locked.
To lock the pool, receive it from the channel.
To unlock the pool, send it back to the channel.
To TryLock, receive in a select with a default case.

(I examined this pattern in much more detail starting around slide 52 of my talk on Rethinking Classical Concurrency Patterns. See in particular the queue implementation around slide 61.)

@neild
Copy link
Contributor

neild commented Apr 13, 2021

As in #27544, why not replace the sync.Mutex with a 1-buffered channel?

By the same argument, why not replace all sync.Mutexes with channels?

@bcmills
Copy link
Contributor

bcmills commented Apr 13, 2021

By the same argument, why not replace all sync.Mutexes with channels?

Honestly? I think that's a great idea. 😁

(But we can't remove sync.Mutex because of Go 1 compatibility, and because it would be a gratuitous breaking change. If we we starting Go over from scratch, I would certainly argue for its omission.)

@TyeMcQueen
Copy link
Author

To avoid a lot of thrash in h2 code, I'd just bundle a replacement implementation of a Mutex using a 1-buffer channel and replace the sync.Mutex on each member with such and use TryLock()...

Why not include such an implementation in sync as a differently named Mutex? Having a Mutex implemented using a channel sounded like something desirable. Or just replace the guts of Mutex with a channel-based implementation?

@bcmills
Copy link
Contributor

bcmills commented Apr 15, 2021

@TyeMcQueen, I would love to see a generic channel-based Mutex implementation. You could even imagine one using typed callbacks, based on the sync.Once signature:

package sync

type TypedMutex[T any] …

func (mu *TypedMutex[T]) Do(func(T) T)
func (mu *TypedMutex[T]) Try(context.Context, func(T) (T, error)) error

puzpuzpuz added a commit to puzpuzpuz/go that referenced this issue Apr 27, 2021
Adds a way to lock a Mutex optimistically, without blocking.

```
var mu Mutex
if mu.TryLock() {
	// lock attempt succeeded
	mu.Unlock()
} else {
	// attempt failed - do something else
}
```

Refs: golang#45435
Refs: golang#6123
@rsc
Copy link
Contributor

rsc commented May 5, 2021

Locks are for protecting invariants. If the lock is held by someone else, there is nothing you can say about the invariant.

TryLock encourages imprecise thinking about locks; it encourages making assumptions about the invariants that may or may not be true. That ends up being its own source of races.

There are definitely locking issues in http2. Adding TryLock would let us paper over them to some extent, but even that would not be a real fix. It would be more like the better your 4-wheel-drive the farther out you get stuck.

I don't believe http2 makes a compelling case for TryLock.

@rsc rsc changed the title proposal: sync: add Mutex.LockIfNoWait method proposal: sync: add Mutex.TryLock May 5, 2021
@rsc
Copy link
Contributor

rsc commented May 5, 2021

This proposal has been added to the active column of the proposals project
and will now be reviewed at the weekly proposal review meetings.
— rsc for the proposal review group

@TyeMcQueen
Copy link
Author

When the question is "Is it ready to be used for another connection?" when it is locked, the precise answer is actually "No, it is not ready".

And I believe that is the better approach even if we eliminated all cases of locks for individual connections being held around blocking operations. In theory, we could achieve that. In practice, it hasn't happened for years and I don't see it happening for years more, and once it is achieved, I expect that success to be short-lived (due to the complexity of an h2 connection).

Building a system where a minor mistake leads to serious lock-up is not good design. In some ways it may be purer; it is not better.

And at least many of the cases of "locking problems" for h2 connections are not really problems when dealing with a single connection (see late in my original, long proposal for more on this) and so I'm not convinced that eliminating them is worthwhile, or at least worth the effort required.

In my experience, preventing the use of techniques because you believe that coders with less skill, care, experience, or judgement may misuse them often leads to even worse approaches being chosen to work around this lack. So I prefer to document the better approaches and the reasons why a feature should not be used in some way, rather than just prohibit it. It leads to more learning, not just less frustration.

Shifting gears...

Having spent some time thinking about the several ways we could get a TryLock() available to h2, I think several of the approaches are reasonable. I think it would be cool to create a channel-based mutex implementation that prevents you from even accessing the object before you obtain the lock and allows you to use a context to even wait "for a while" when trying to obtain a lock. And, if that proves valuable enough, it would be cool to include that in the standard go library (perhaps so it can be used by other things in the standard go library). But I also think such would need time to prove that it deserves such inclusion. And I probably wouldn't replace sync.Mutex with that implementation. The simpler (and probably more efficient version) is probably worth having for when you don't need more.

But I now really do believe that even with alternatives out there in other libraries, this very specific, very simple addition to sync.Mutex is really something that should be added. (In the beginning, I was less convinced of that, more just wanting to fix serious problems in h2.)

It is quite easy to start using sync.Mutex because it is a perfect fit for your situation and then come to a place where "I want to do X, but if I can't do it right now, I don't want to do anything, and waiting would be a terrible choice". And it is a sign of respect for the users of our language to give them the option to use TryLock() to get their work done while they file a "technical debt" ticket to rework how locking is done because the design has shifted to the point that a simple Mutex is not the best fit any longer. Will this lead to that technical debt never getting fixed? Surely it sometimes will. But the better way to encourage such design improvements is documentation, not prohibition.

I've personally run into this situation twice already that I can remember. Forcing the more significant redesign just to get past this hurdle was not helpful to me. And it didn't even always lead to a better design, just one that didn't need TryLock().

There really are situations where Lock, Unlock, and TryLock is just exactly what you need. Those situations can also be satisfied with channels and semaphores and other more complex and powerful things. If you need to ask "Are you still busy?", that's a perfectly reasonable question to be able to ask. And if the thing you need to ask that of is just one thing and you also need a mutex around it and that mutex already perfectly captures "Yes, I'm busy", let people do that.

Just TryLock() is not enough to warrant a whole separate abstraction. This is reinforced by how easy it is to implement for sync.Mutex.

So I like having Lock, Unlock, and TryLock as one set of features over having Lock+Unlock in one object and Lock+Unlock+TryLock in another. And I like having such a simple thing, not just an all-singing, all-dancing thing that can also be used as a mutex.

@rsc
Copy link
Contributor

rsc commented May 12, 2021

It is easy to build a TryLock-able Mutex out of a Mutex and an atomic word. It is probably worth doing that in http2 to see if it solves your problem.

Adding TryLock just for http2 does not seem worthwhile. We would need a more compelling case.

@DmitriyMV
Copy link
Contributor

@rsc I've seen variations of this across several commercial projects I've been involved in. And while I agree that TryLock is not an example of "good code" maybe it can be added just for the sake of preventing people from using such hacks. Although

It is easy to build a TryLock-able Mutex out of a Mutex and an atomic word

Maybe it would be a good idea to implement reference implementation inside x/sync?

@rsc
Copy link
Contributor

rsc commented May 13, 2021

Thinking more about this, there is one important benefit to building TryLock into Mutex compared to a wrapper:
failed TryLock calls wouldn't create spurious happens-before edges to confuse the race detector.

I'm still not convinced we should do this, but I put together a CL to see what it would look like at https://golang.org/cl/319769.

/cc @dvyukov for thoughts

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/319769 mentions this issue: sync: add Mutex.TryLock, RWMutex.TryLock, RWMutex.TryRLock

@prattmic
Copy link
Member

FWIW, the gVisor project uses a modified sync package with TryLock and came up with an implementation almost 100% identical to @rsc's: https://cs.opensource.google/gvisor/gvisor/+/master:pkg/sync/rwmutex_unsafe.go;l=109?ss=gvisor%2Fgvisor

@neild
Copy link
Contributor

neild commented May 13, 2021

I think that TryLock is almost always an indication of sloppy thinking about invariants, but I'm not convinced that it's always a mistake. And I'd rather see a single TryLock implementation that can be reasoned about rather than many custom ones.

A channel-based implementation as @bcmills suggests is possible, but performs poorly in comparison. There's a reason we have sync.Mutex rather than just using channel for locking.

@dvyukov
Copy link
Member

dvyukov commented May 14, 2021

I don't have much to add. Yes, TryLock may be needed in some cases; yes, it's needed very rarely in Go; yes, part of the current uses is papering over sub-optimal design; yes, an implementation in std lib can cooperate with race detector, with starvation prevention logic in Mutex, with contention profiler; yes, adding TryLock to std lib will increase number of unjustified uses.

I've searched our internal code base for "TryLock() bool" and found:

@networkimprov
Copy link

I need TryRLock() for the mnm client app. I can explain the use case if asked.

It's not trivial to implement with a wrapper.

@rsc
Copy link
Contributor

rsc commented May 19, 2021

OK, I retract my objections. Everyone agrees it's unfortunate but sometimes necessary.

Does anyone else object to adding it?

@rsc
Copy link
Contributor

rsc commented May 26, 2021

Based on the discussion above, this proposal seems like a likely accept.
— rsc for the proposal review group

@rsc rsc changed the title proposal: sync: add Mutex.TryLock sync: add Mutex.TryLock Jun 2, 2021
@rsc rsc modified the milestones: Proposal, Backlog Jun 2, 2021
@kaey
Copy link

kaey commented Jun 3, 2021

Can it be exposed in x/sync instead?

@cristaloleg
Copy link

@kaey what is the point for that?

@tandr
Copy link

tandr commented Jun 5, 2021

Re: https://golang.org/cl/319769

Will this make to 1.17, or it is too late, sorry?

@ianlancetaylor
Copy link
Member

We are in the feature freeze for 1.17. This will be for 1.18.

@bradfitz bradfitz modified the milestones: Backlog, Go1.18 Sep 3, 2021
@alecrajeev
Copy link

This is a good idea. I definitely can see my team using this for our internal tooling.

@DmitriyMV
Copy link
Contributor

Kindly ping @rsc and @ianlancetaylor about https://golang.org/cl/319769

@tbal999
Copy link

tbal999 commented Sep 26, 2021

I like this idea.

@seebs
Copy link
Contributor

seebs commented Oct 13, 2021

Another potential advantage of TryLock is: If you TryLock something, and get false, you can report the circumstances and reasoning that led you to believe that a lock should have succeeded, even if there's nothing else you can do.

@neild
Copy link
Contributor

neild commented Oct 13, 2021

FYI, I believe the HTTP/2 client bugs listed in the original issue (#23559, #32388, #33006, #33425) have all been fixed now, without using TryLock. This isn't an argument against adding TryLock, just a data point that it wasn't needed here.

@as
Copy link
Contributor

as commented Oct 14, 2021

I predict that the conditions under which you find this feature useful are not the conditions under which other users will use it. The primitive being non-trivial to implement correctly should not be justification for adding it to the standard library and encouraging others to use it.

The only utility of this feature is when you have a long-lived operation and a small set of workers infrequently racing to acquire the capability to do that operation. Not only is this trivial to implement with a buffered channel it is also extremely uncommon in a system that uses concurrency effectively.

@gopherbot
Copy link
Contributor

Change https://golang.org/cl/363672 mentions this issue: sync: in TryLock try to acquire mutex even if state is not 0

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests