Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve multi-threaded repo locking #2344

Closed
dbnicholson opened this issue Apr 15, 2021 · 3 comments · Fixed by #2348
Closed

Improve multi-threaded repo locking #2344

dbnicholson opened this issue Apr 15, 2021 · 3 comments · Fixed by #2348

Comments

@dbnicholson
Copy link
Member

Currently the repo locking is entirely thread local, which is very safe but makes it difficult to use in a multi-threaded application. The canonical issue would be if you took an exclusive lock and then spawned some threads with tried to take a shared lock. This would deadlock because the threads could never acquire the lock even though the application holds an exclusive lock on the repo.

This isn't theoretical. Back when I was working on #1292 I wanted to make ostree_repo_prune_object take an exclusive lock. However, it's actually used deep within the pull code to try to cleanup a tombstone commit (IIRC). The main thread acquires a shared repo lock when it starts a transaction and then the pull code spawns a bunch of threads. When a pull thread hits this path it deadlocks because it actually has no connection to the main thread lock.

If we're to make the repo locking public, it's pretty likely that someone would take an exclusive lock and then run the pull code. I know I've done that many times in our tooling when doing some maintenance task.

@dbnicholson
Copy link
Member Author

dbnicholson commented Apr 15, 2021

I'm sure there's probably a data structure that's more appropriate for this, but this is the idea I came up with.

The actual lock file descriptor is stored in OstreeRepo with a mutex guarding it. When threads want to change the lock state, they acquire the mutex then change the flock. The reason I didn't do this initially was because of the recursive nature of the locking. For a single thread, it's easy to have a single linear stack that represents the locking recursion.

For multiple threads that's not possible since they can act in any order. The way I thought of to handle this is to maintain a stack per thread in thread local storage like now so that when an individual thread pops a lock state it knows what type of lock state it's dropping. In the shared mutex guarded code, there would just be 2 counters for how many shared and exclusive states are on the stack. This maintains the latching aspect of it. I.e., the lock stays exclusive until all exclusive states have been popped. When all shared and exclusive states have been popped, the lock is unlocked.

@cgwalters
Copy link
Member

OK from your original commit

The lock also attempts to be thread safe by storing the lock state in
thread local storage with GPrivate. This means that each thread will
have an independent lock for each repository it opens. There are some
drawbacks to that, but it seemed impossible to manage the lock state
coherently in the face of multithreaded access.

But...the patch Alexander had here originally https://bugzilla.gnome.org/show_bug.cgi?id=759442#c2 was way simpler and also by benefit of being file based (like the sysroot lock) also works across processes, which is I think what we actually want.

Then you proposed this: #813 (comment)

And we merged it but...it seems to be we really want a simple read/write lock, i.e. what you get in-process with a GRWLock and out of process with flock() (as Alex's original patch did). Everything except prune grabs a read lock (even though they're doing writes, due to how ostree works it's nondestructive, we aren't deleting anything that could be read later).

In other words I think the flaw originally here was thinking of the "read" lock as truly restricting to read operations whereas I think we should interpret to mean "can write new data, just not delete existing data".

@dbnicholson
Copy link
Member Author

But...the patch Alexander had here originally https://bugzilla.gnome.org/show_bug.cgi?id=759442#c2 was way simpler and also by benefit of being file based (like the sysroot lock) also works across processes, which is I think what we actually want.

Then you proposed this: #813 (comment)

And we merged it but...it seems to be we really want a simple read/write lock, i.e. what you get in-process with a GRWLock and out of process with flock() (as Alex's original patch did). Everything except prune grabs a read lock (even though they're doing writes, due to how ostree works it's nondestructive, we aren't deleting anything that could be read later).

The lock I implemented is flock at the core. All the complexity is from the in management of it within a process or thread. In particular, I wanted something that was going to be easy to use both inside and outside ostree. I wanted to be able to liberally take the lock within the ostree code and not have to reason about whether something else had already taken the lock. This is why it's recursive. You can take the lock deep within the ostree code even if it was taken somewhere else in the ostree code or even outside the ostree code by the application. This is basically the source of the complexity. I initially was thinking about something like what Alex wrote until I realized that it would be hard to use correctly, even within ostree itself.

Likewise, since we want to use 2 states, there needs to be some tracking of what state the lock is in. This is particularly important when unwinding the recursion and figuring out what state to go back to. This is the other source of the complexity.

Now let's consider some of the locking types out there.

  • flock - Provides inter-process locking with 2 states. You can even make this provide inter-thread locking by opening a separate file descriptor per thread (this is what currently happens in ostree). However, it's not recursive, so on its own you end up having to use it in a global way like the sysroot lock or very carefully in a local way by reasoning about the current state of the lock.

  • GMutex - Provides inter-thread locking with a single state. This is missing several of the features I wanted - no inter-process locking, doesn't have 2 states, not recursive.

  • GRWLock - Like GMutex but has 2 states. No inter-process locking or recursion, though.

  • GRecMutex - Like GMutex but recursive. No inter-process locking and only single state, though.

So, what I want is the combination of flock and GRecMutex. That's how I ended up with what's there. However, while it is MT-safe by having a per-thread flock file descriptor, I actually want to have more coordination between the threads. I'd like the lock to be per-OstreeRepo but maintain the inter-process locking, 2 states and recursion.

I hadn't given the GMutex family of locks much thought at the time since I knew a mutex wouldn't be sufficient. However, now I think I'll read through them for inspiration. I'm actually really curious to see how GRecMutex is implemented.

In other words I think the flaw originally here was thinking of the "read" lock as truly restricting to read operations whereas I think we should interpret to mean "can write new data, just not delete existing data".

The code always refers to it by the flock names - shared and exclusive. As did Alex's patch. But I agree there has been some confusion about what those mean. However, I don't think it matters what the states are called, just that you have more than 1 state and you know what they refer to. In ostree we could refer to this as a creator/destroyer lock or something.

dbnicholson added a commit to dbnicholson/ostree that referenced this issue Apr 16, 2021
Previously each thread maintained its own lock file descriptor
regardless of whether the thread was using the same `OstreeRepo` as
another thread. This was very safe but it made certain multithreaded
procedures difficult. For example, if a main thread took an exclusive
lock and then spawned worker threads, it would deadlock if one of the
worker threads tried to acquire the lock.

This moves the file descriptor from thread local storage to the
`OstreeRepo` structure so that threads using the same `OstreeRepo` can
share the lock. Some bookkeeping is needed to keep track of the lock
state when recursing and a mutex guards against threads altering the
state concurrently.

The required GLib is bumped to 2.44 to allow usage of `GMutexLocker`.

Fixes: ostreedev#2344
dbnicholson added a commit to dbnicholson/ostree that referenced this issue Apr 22, 2021
Previously each thread maintained its own lock file descriptor
regardless of whether the thread was using the same `OstreeRepo` as
another thread. This was very safe but it made certain multithreaded
procedures difficult. For example, if a main thread took an exclusive
lock and then spawned worker threads, it would deadlock if one of the
worker threads tried to acquire the lock.

This moves the file descriptor from thread local storage to the
`OstreeRepo` structure so that threads using the same `OstreeRepo` can
share the lock. Some bookkeeping is needed to keep track of the lock
state when recursing and a mutex guards against threads altering the
state concurrently.

The required GLib is bumped to 2.44 to allow usage of `GMutexLocker`.

Fixes: ostreedev#2344
dbnicholson added a commit to dbnicholson/ostree that referenced this issue Apr 29, 2021
Previously each thread maintained its own lock file descriptor
regardless of whether the thread was using the same `OstreeRepo` as
another thread. This was very safe but it made certain multithreaded
procedures difficult. For example, if a main thread took an exclusive
lock and then spawned worker threads, it would deadlock if one of the
worker threads tried to acquire the lock.

This moves the file descriptor from thread local storage to the
`OstreeRepo` structure so that threads using the same `OstreeRepo` can
share the lock. A mutex guards against threads altering the lock state
concurrently.

Fixes: ostreedev#2344
dbnicholson added a commit to dbnicholson/ostree that referenced this issue May 7, 2021
Previously each thread maintained its own lock file descriptor
regardless of whether the thread was using the same `OstreeRepo` as
another thread. This was very safe but it made certain multithreaded
procedures difficult. For example, if a main thread took an exclusive
lock and then spawned worker threads, it would deadlock if one of the
worker threads tried to acquire the lock.

This moves the file descriptor from thread local storage to the
`OstreeRepo` structure so that threads using the same `OstreeRepo` can
share the lock. Some bookkeeping is needed to keep track of the lock
state when recursing and a mutex guards against threads altering the
state concurrently.

Fixes: ostreedev#2344
dbnicholson added a commit to dbnicholson/ostree that referenced this issue May 7, 2021
Previously each thread maintained its own lock file descriptor
regardless of whether the thread was using the same `OstreeRepo` as
another thread. This was very safe but it made certain multithreaded
procedures difficult. For example, if a main thread took an exclusive
lock and then spawned worker threads, it would deadlock if one of the
worker threads tried to acquire the lock.

This moves the file descriptor from thread local storage to the
`OstreeRepo` structure so that threads using the same `OstreeRepo` can
share the lock. A mutex guards against threads altering the lock state
concurrently.

Fixes: ostreedev#2344
dbnicholson added a commit to dbnicholson/ostree that referenced this issue Jun 5, 2021
Previously each thread maintained its own lock file descriptor
regardless of whether the thread was using the same `OstreeRepo` as
another thread. This was very safe but it made certain multithreaded
procedures difficult. For example, if a main thread took an exclusive
lock and then spawned worker threads, it would deadlock if one of the
worker threads tried to acquire the lock.

This moves the file descriptor from thread local storage to the
`OstreeRepo` structure so that threads using the same `OstreeRepo` can
share the lock. A mutex guards against threads altering the lock state
concurrently.

Fixes: ostreedev#2344
dbnicholson added a commit to dbnicholson/ostree that referenced this issue Jun 5, 2021
Previously each thread maintained its own lock file descriptor
regardless of whether the thread was using the same `OstreeRepo` as
another thread. This was very safe but it made certain multithreaded
procedures difficult. For example, if a main thread took an exclusive
lock and then spawned worker threads, it would deadlock if one of the
worker threads tried to acquire the lock.

This moves the file descriptor from thread local storage to the
`OstreeRepo` structure so that threads using the same `OstreeRepo` can
share the lock. A mutex guards against threads altering the lock state
concurrently.

Fixes: ostreedev#2344
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants