Split RW and RO Layer locks per method #473

saschagrunert · 2019-11-28T15:17:21Z

The new storage options ReadOnlyLayers indicates that the
layer store should be locked for read only access. This allows use cases
like listing container images while pushing an image to a remote
location in parallel.

Closes #420

saschagrunert · 2019-11-29T08:07:19Z

This is now ready for a first review, I tried to implement with as low overhead as possible.

rhatdan · 2019-11-29T12:15:48Z

Tests are failing?

saschagrunert · 2019-11-29T12:21:36Z

Tests are failing?

Yes, thanks for the hint.

My main concern right now is that we give the responsibility over "locking the right way" to the consumer of the API. But since the option is off per default and not serialized into the config, it should be fine...

rhatdan · 2019-11-29T12:34:23Z

LGTM
Need @nalind @vrothberg to review. Since they have played with locks a lot more then I have.
@mheon PTAL

vrothberg · 2019-12-02T09:04:29Z

Aren't we having a meeting for this topic on Friday?

saschagrunert · 2019-12-02T09:09:08Z

Aren't we having a meeting for this topic on Friday?

The intention of the meeting was more about locking enhancements per layer, see: #421

vrothberg

I am not yet convinced about a dedicated option, mostly because I am scared of the increasing complexity of locking and the locking paths. But also reading the commit message

"This allows use cases like listing container images while pushing an image to a remote location in parallel."

I don't completely understand why we need a dedicated option; I'd expect read-only functions (e.g., listing containers) to always acquire reader locks.

Considering #420 (comment) and the comments in this issue, I begin to think that new{RO,}LayerStore shouldn't lock at all - it doesn't add up to me to have this critical section conditionally as writer or reader since both execute the same code.

Maybe I just need a fresh coffee :^) In any case, @nalind is the expert of this code, so I want him to weigh in.

layers.go

store.go

saschagrunert · 2019-12-11T11:30:15Z

Ready for a new round of review, @vrothberg @nalind @rhatdan PTAL

vrothberg · 2019-12-11T13:35:38Z

Ready for a new round of review, @vrothberg @nalind @rhatdan PTAL

Can you explain what the code is intended to be doing now (and ideally update the commit message)? Why can't we drop the locks (see #473 (comment)) causing parts of the problem?

saschagrunert · 2019-12-12T12:50:24Z

Why can't we drop the locks (see #473 (comment)) causing parts of the problem?

We can, but we change the behavior of the cleanup on init. I don't think that this is harmful and created a dedicated PR for that, since it is not directly related to this topic: #493

rhatdan · 2020-01-15T11:21:15Z

@saschagrunert @vrothberg @nalind Bringing this back up for discussion. Where are we on this now? Should we go forward with this?

rhatdan · 2020-01-31T00:40:00Z

@saschagrunert @vrothberg PING...

vrothberg · 2020-01-31T07:21:38Z

We should reevaluate how far #493 brought us.

saschagrunert · 2020-01-31T09:03:30Z

Yes some details have change but the main approach would be the same. I ping you here when this is ready for review.

saschagrunert · 2020-02-06T11:15:22Z

containers/podman#4660 is now working, PTAL @rhatdan @vrothberg @nalind

rhatdan

LGTM

rhatdan · 2020-02-07T13:01:57Z

@vrothberg @nalind PTAL

vrothberg

I really miss a detailed explanation why we need to go this path but cannot optimize reading code paths. Does setting ForceReadOnlyLayers prevent us from writing? If I set ForceReadOnlyLayers=true, I should not be able to write data. Can we give this guarantee?

It somehow feels like a symptomatic fix for a deeper issue in c/storage. Given we have a database, I expect reading accesses to proceed unless the specific data is currently being written to. Following this analogy, we are now marking the database as RO entirely because for some reason it doesn't perform well. So if I want to write to it at a later point, I had to create a new object which allows for writes without data corruption.

rhatdan · 2020-02-07T16:27:44Z

My understanding is, right now if to users of the storage want to use it at the same time, the second one in, needs to wait til the first is done. In the new design, if both users want to READ, then they are allowed in, and only writers would block. So a writer can not write while a reader exists. And readers can not read when a writer holds the lock.

I guess there is a chance that writers would get frozen out if a continuous stream of readers come in.

vrothberg · 2020-02-07T16:44:43Z

My understanding is, right now if to users of the storage want to use it at the same time, the second one in, needs to wait til the first is done. In the new design, if both users want to READ, then they are allowed in, and only writers would block. So a writer can not write while a reader exists. And readers can not read when a writer holds the lock.

The main issue IMHO is that we decide the READ or WRITE access when creating the store and try to enforce that. So we are not fixing/optimizing the code paths but change the nature of the locks altogether. This only works for short-lived store objects (see https://github.com/containers/libpod/pull/4660/files) and is not generally useful for CRI-O (or the podman system service), for instance, where the store is not created for each request but is a long-lived object.

Assume I want to access the store as a reader: I have to create a new object, then do my work and subsequent writing accesses would require to create a new object. I would prefer to make c/storage smart enough to figure out when we need to acquire the locks as a reader or writer (and further localize the bottlenecks). It really feels like a symptomatic patch while the underlying bottlenecks in c/storage remain.

rhatdan · 2020-02-07T17:38:00Z

Ok so Nalin and I talked about this further, and the problem is each Runtime when it starts takes a Read/Write lock, even if they do not need it. So when you are pushing the image, you only have a read/lock, but when podman images (buildah images) starts it wants to create a read/write lock initially and then waits until podman push is done. Wouldn't a better solution be to take a read/only lock and then when I call into a storage method that requires read/write, the method checks if the lock is read/write and it takes it, if not it ReOpens the lock or Recreates the lock as a read/write lock and then locks it.

vrothberg · 2020-02-10T07:58:23Z

Wouldn't a better solution be to take a read/only lock and then when I call into a storage method that requires read/write, the method checks if the lock is read/write and it takes it, if not it ReOpens the lock or Recreates the lock as a read/write lock and then locks it.

I preferred this ^ approach, yes. It's transparent to consumers of c/storage and seems to tackle the root cause.

rhatdan · 2020-10-30T09:51:41Z

@kolyshkin I would love to have you review this PR.

store.go

saschagrunert · 2020-10-30T21:33:22Z

I'm kind of in a trial and error mode now with the Podman PR containers/podman#7603. I need the integration/system tests there to see if it works, because the issue is not reproducible on my machine.

I'll keep you posted, this can take some iterations.

saschagrunert · 2020-11-02T15:54:32Z

The Podman PR as well as this one is now green, but I have to verify that my changes logically make sense…
containers/podman#7603

rhatdan · 2020-11-03T10:41:55Z

@saschagrunert Just to move the boudary a little further, could you try a PR against CRI-O to see how that goes.

Signed-off-by: Sascha Grunert <[email protected]>

saschagrunert · 2020-11-03T10:48:56Z

@saschagrunert Just to move the boudary a little further, could you try a PR against CRI-O to see how that goes.

Yep, opened the PR there: cri-o/cri-o#4339

rhatdan · 2020-11-03T14:19:26Z

@vrothberg @nalind @TomSweeneyRedHat @mtrmac @mrunalp This PR is now passing tests in Podman and CRI-O, I believe it is time to give it another review.

akostadinov · 2020-12-18T16:26:56Z

@saschagrunert , is your work intended to fix issues like:
containers/podman#8667

layers.go

store.go

Accessing layers in RO or RW mode is now decided in the dedicated method which increases the flexibility in terms of locking. This allows use cases like listing container images while pushing an image to a remote location in parallel. Signed-off-by: Sascha Grunert <[email protected]> Signed-off-by: Sascha Grunert <[email protected]>

saschagrunert

@kolyshkin I fixed your mentioned review points. PTAL

akostadinov · 2021-01-25T20:47:39Z

@saschagrunert , sorry for double posting but I think it is important to know whether your PR is suppose to fix issues like containers/podman#8667

If not, then could you clarify what use cases should be considered supported?

mtrmac

Overall, as a stranger to the codebase, I’d appreciate fairly detailed comments about the design: What do the various locks protect, what is authoritative data vs. caches, and so on and so on.

The concurrent maintenance of two stores seems risky to me, see elsewhere. Wouldn’t it make sense to only have an one-directional “upgrade”, where the store either has a read-only layer store, or replaces it with a read-write, but never creates a read-write one again?
Having two stores means two caches = two reads of the data. Is that really worth it?
If there has to be a separate ShutdownRO and the caller is expected to call it to benefit from it (and the code breaks if the caller calls RO after updates!), this is not completely transparent anyway. How are the library callers going to use it, then? If the code that creates/shuts down the storage.Store must know whether the store is RO or RW, it would seem much cleaner to me to have that code create the storage.Store in a RO or RW mode (and then c/storage itself can be in control of its invariants, e.g. refuse to do any write operations on a RO store). (Yes, I have noticed that this was considered about a year ago; as I said at the start, I’m a stranger to the code base and my opinion should not be authoritative — I can’t really understand the CRI-O argument in there, for example.)

mtrmac · 2021-01-25T21:59:51Z

layers.go

@@ -1470,7 +1455,8 @@ func (r *layerStore) Modified() (bool, error) {
 }

 func (r *layerStore) IsReadWrite() bool {
-	return r.lockfile.IsReadWrite()
+	// the mountsLockfile is only set for read-write enabled stores.


(This is preexisting = non-blocking:) This looks like a important invariant; it should be documented with the structure field.

mtrmac · 2021-01-25T22:08:00Z

layers.go

@@ -524,27 +530,6 @@ func (s *store) newLayerStore(rundir string, layerdir string, driver drivers.Dri
 	return &rlstore, nil
 }

-func newROLayerStore(rundir string, layerdir string, driver drivers.Driver) (ROLayerStore, error) {
-	lockfile, err := GetROLockfile(filepath.Join(layerdir, "layers.lock"))


newLayerStore as a replacement for this will always create a RW lockfile. Is that desirable? It changes semantics of r.lockfile.RecursiveLock == r.RecursiveLock, which seems worrisome.

OTOH

storage/pkg/lockfile/lockfile.go

Lines 92 to 100 in 9f88983

if locker, ok := lockfiles[cleanPath]; ok {

if ro && locker.IsReadWrite() {

return nil, errors.Errorf("lock %q is not a read-only lock", cleanPath)

}

if !ro && !locker.IsReadWrite() {

return nil, errors.Errorf("lock %q is not a read-write lock", cleanPath)

}

return locker, nil

}

does not support having both kinds of lock for a single path.

mtrmac · 2021-01-25T22:09:07Z

layers.go

@@ -1470,7 +1455,8 @@ func (r *layerStore) Modified() (bool, error) {
 }

 func (r *layerStore) IsReadWrite() bool {
-	return r.lockfile.IsReadWrite()
+	// the mountsLockfile is only set for read-write enabled stores.
+	return r.lockfile.IsReadWrite() && r.mountsLockfile != nil


r.lockfile.IsReadWrite is always true now, AFAICS.

mtrmac · 2021-01-25T22:20:11Z

store.go

 func (s *store) LayerStore() (LayerStore, error) {
-	s.graphLock.Lock()
-	defer s.graphLock.Unlock()


Previously, this lock protected all of LayerStore, including s.getGraphDriver; now the lock is scoped to resetIfNeeded; why is that sufficient? (It might be, but it’s not at all obvious to me — the lock’s scope is not documented.)

mtrmac · 2021-01-25T22:28:58Z

layers.go

@@ -491,7 +491,10 @@ func (r *layerStore) saveMounts() error {
 	return r.loadMounts()
 }

-func (s *store) newLayerStore(rundir string, layerdir string, driver drivers.Driver) (LayerStore, error) {
+// newLayerStore creates a new layer storage based on the provided directories
+// and selected driver. The `withMountsLock` bool indicates if we want to


Is it really withMountsLock or readWrite? The latter seems semantically more useful, and might e.g. allow skipping the MkdirAll calls below.

mtrmac · 2021-01-25T22:36:23Z

store.go

@@ -582,6 +593,7 @@ type store struct {
 	autoNsMaxSize   uint32
 	graphDriver     drivers.Driver
 	layerStore      LayerStore


Name this rwLayerStore? Possibly document the fields, even if they are “a private cache for $method”.

mtrmac · 2021-01-25T22:42:52Z

store.go

@@ -582,6 +593,7 @@ type store struct {
 	autoNsMaxSize   uint32
 	graphDriver     drivers.Driver
 	layerStore      LayerStore
+	roLayerStore    ROLayerStore
 	roLayerStores   []ROLayerStore


Suppose the user calls:

s.Names(foo) // Populates roLayerStore, reads from disk

s.PutLayer(…) // Populates layerStore, reads + writes to disk

s.Names(foo)

What happens now? AFAICS:

s.graphLock.TouchedSince is false (no .Shutdown*), so the cached roLayerStore is used

The RO and RW layer stores share lock objects (

storage/pkg/lockfile/lockfile.go

Lines 92 to 100 in 9f88983

if locker, ok := lockfiles[cleanPath]; ok {

if ro && locker.IsReadWrite() {

return nil, errors.Errorf("lock %q is not a read-only lock", cleanPath)

}

if !ro && !locker.IsReadWrite() {

return nil, errors.Errorf("lock %q is not a read-write lock", cleanPath)

}

return locker, nil

}

), so regardless of any .Touch() calls on locks, .Modified() will be false

The RO layer store will continue reading from its cached data, regardless of modifications by the RW store.

(This is only by reading code I’m unfamiliar with, so I may well be missing something or making a stupid mistake.)

saschagrunert · 2021-01-26T08:06:36Z

@saschagrunert , sorry for double posting but I think it is important to know whether your PR is suppose to fix issues like containers/podman#8667

If not, then could you clarify what use cases should be considered supported?

The main use case was to be able to list container images while another push operation in in progress. I think containers/podman#8667 could be solved when locking by layer and not the whole storage. But as folks already mentioned, this would mean a architectural rewrite of containers/storage and I'm not sure if we will get to this point anytime soon.

saschagrunert · 2021-01-26T08:30:16Z

The concurrent maintenance of two stores seems risky to me, see elsewhere. Wouldn’t it make sense to only have an one-directional “upgrade”, where the store either has a read-only layer store, or replaces it with a read-write, but never creates a read-write one again?

I think if we put API compatibility aside for now, then we could create a store which is ready-only locked per default. Just some operations should then be able to lock the certain parts of the store. For example, storing a layer should only RW lock that part of the store. Other operations (read-only or read-write which do not affect the WIP layers) should be still possible.

The API compatibility is the main burden here. c/storage is designed to cleanup the layers after storage shutdown, which will always lock the complete layer store for example.

Having two stores means two caches = two reads of the data. Is that really worth it?

No I don't think so and agree with your concern. The only reason to implement it like this is to stay compatible to current consumers of this project.

If there has to be a separate ShutdownRO and the caller is expected to call it to benefit from it (and the code breaks if the caller calls RO after updates!), this is not completely transparent anyway. How are the library callers going to use it, then?

Manually, library consumers have to decide on their own if they want to shutdown with cleanup or without. For example a podman ps would not have a need for a layer cleanup. I know that is a major drawback of the proposal.

All in all I think we will have a hard time keeping the current API stable while improving the locking. We could probably start with a parallel implementation (which breaks the API) and integrate the enhancements mentioned by @mtrmac.

If we speak only about refactoring the layer store, then the main strategy would be:

Redesign a locker which is able to lock a single layer
Use only one layer store for c/storage which is using the new locking (read-only per default)
Decide which operations will automatically and partially rw lock the layer store
Cleanup the store based on its usage (aka track with a flag if the store needs cleanup).

What do you think?

vrothberg · 2021-01-26T08:46:37Z

Redesign a locker which is able to lock a single layer

Fine-grained locking is very difficult to get right and a common source of concurrency bugs. Having a per-layer lock, for instance, is subject to ABBA dead locks (assuming a process is allowed to lock multiple layers at once).

One thing we must consider as well is backward compatibility. If we fundamentally change how c/storage works, older and newer versions cannot run on the same machine.

I prefer to keep the coarse-grained locks unless there is no alternative.

There may still be more space for optimization (see #420 (comment)) which does not entirely rule out lock monopolization.

mtrmac · 2021-01-26T17:45:21Z

I think if we put API compatibility aside for now, then we could create a store which is ready-only locked per default. Just some operations should then be able to lock the certain parts of the store. For example, storing a layer should only RW lock that part of the store. Other operations (read-only or read-write which do not affect the WIP layers) should be still possible.

What happens with the current consistency fix-ups that write data on load (see shouldSave in layerStore.Load and imageStore.Load at least)? Assuming we don’t somehow modify the code to transparently read inconsistent data everywhere (which might be possible, but seems difficult to make reliable), any operation can obtain a write lock even if the caller “just wants to read something”. That suggests to me that all of this probably could be made API-transparent: keep the existing API, and just lock only for reading if necessary…

… except that, well, that’s what the code is already doing: store.go calls store.RLock() for read-only operations.

So… what’s the actual failure path this PR is trying to address? That layerStore.Load() uses r.lockfile.Lock(), which is read-write, and that always blocks on a concurrent push? Couldn’t that be a fairly local fix, where layerStore.Load() tries to load with a RO lock only, and on consistency failure it gives that lock up, and tries again with RW lock?

vrothberg · 2021-03-02T13:26:10Z

I opened containers/image#1167 which is tackling the issue with a different perspective.

Ultimately, solving the issue in storage is favorable but as shown here, it's very difficult. If callers need to worry about executions paths reading or writing data is likely causing future regressions as it's hard to get right.

containers/image#1167 makes pushing a bit slower but it solves the problem of being network bound until unlocking the storage.

saschagrunert mentioned this pull request Nov 28, 2019

Rootless runs execute storage.GetStore() twice (locking issue) containers/podman#4591

Closed

saschagrunert changed the title ~~WIP: Add ReadOnly storage option~~ Add ReadOnly storage option Nov 29, 2019

saschagrunert changed the title ~~Add ReadOnly storage option~~ Add ReadOnlyLayers storage option Nov 29, 2019

vrothberg reviewed Dec 2, 2019

View reviewed changes

layers.go Outdated Show resolved Hide resolved

store.go Outdated Show resolved Hide resolved

saschagrunert changed the title ~~Add ReadOnlyLayers storage option~~ WIPAdd ReadOnlyLayers storage option Dec 9, 2019

saschagrunert changed the title ~~WIPAdd ReadOnlyLayers storage option~~ WIP: Add ReadOnlyLayers storage option Dec 9, 2019

saschagrunert mentioned this pull request Dec 9, 2019

WIP: Add ability to list images on push containers/podman#4660

Closed

saschagrunert commented Dec 11, 2019

View reviewed changes

store.go Show resolved Hide resolved

saschagrunert changed the title ~~WIP: Add ReadOnlyLayers storage option~~ Add ReadOnlyLayers storage option Dec 11, 2019

rhatdan approved these changes Feb 6, 2020

View reviewed changes

vrothberg reviewed Feb 7, 2020

View reviewed changes

rhatdan approved these changes Oct 30, 2020

View reviewed changes

TomSweeneyRedHat reviewed Oct 30, 2020

View reviewed changes

store.go Show resolved Hide resolved

saschagrunert added a commit to saschagrunert/cri-o that referenced this pull request Nov 3, 2020

Vendor containers/storage#473 for testing

213d022

Signed-off-by: Sascha Grunert <[email protected]>

saschagrunert mentioned this pull request Nov 3, 2020

[WIP] Vendor https://github.com/containers/storage/pull/473 for testing cri-o/cri-o#4339

Closed