Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

open question about race with kernel that liburing had to fix #197

Closed
FrankReh opened this issue Feb 14, 2023 · 12 comments · Fixed by #281
Closed

open question about race with kernel that liburing had to fix #197

FrankReh opened this issue Feb 14, 2023 · 12 comments · Fixed by #281

Comments

@FrankReh
Copy link
Contributor

I found this commit to liburing some time ago. It may have no bearing on this repo because the atomic code is necessarily different anyway. But didn't want to lose track of this before I or someone else had made sure.

axboe/liburing@744f415,

@quininer
Copy link
Member

quininer commented Feb 15, 2023

I suspect that using SeqCst is not the most reasonable solution, but I'm willing to follow liburing

@HeroicKatora

This comment was marked as outdated.

@HeroicKatora
Copy link
Contributor

After having another think about this and re-re-reading the diagrams in the original issue, I don't think I really understood the cross-language-atomic-semantics properly yet. That might mean there are not the proper tools to properly solve this in Rust. Sorry for the noise.

A newer kernel which I was reading does use appear to be using an extra acquire fence. However it establishes another order than the one required for AcqRel to be definitely sufficient: https://github.com/torvalds/linux/blob/a93289b830ce783955b22fbe5d1274a464c05acf/io_uring/sqpoll.c#L334-L352

@almogkh
Copy link

almogkh commented Apr 27, 2024

I had a look at the code here and the same bug is present here as in liburing/the kernel. The code here is almost the same as in liburing, the only difference is that liburing is using C/C++11 whereas Rust follows the C++20 memory model but the differences between the two are not relevant to this bug.

A newer kernel which I was reading does use appear to be using an extra acquire fence. However it establishes another order than the one required for AcqRel to be definitely sufficient

The code you link to is using a full memory barrier like the documentation mentioned in your previous comment talked about, not an acquire barrier. That smp_mb__after_atomics wasn't there originally, I added it in a patch that I sent at the same time as the patch to liburing. That full barrier is meant to synchronize with a full barrier in the user application (In this case, a SeqCst fence).

As for why SeqCst is needed exactly, that requires diving a little into the C++20 standard and making some assumptions about the kernel's memory model (like assuming that smp_mb is equivalent to a SeqCst fence for our purposes). It's crucial that when the application submits new work to the kernel, it first writes to the tail field and only then reads from the flags field. Without a SeqCst fence, nothing in the C++20/Rust memory model prevents the write from being reordered with the read.

@HeroicKatora
Copy link
Contributor

HeroicKatora commented Apr 27, 2024

"making some assumptions" of the equivalence of these two operations is the confusing part (and this comment is the first I see that assumption spelled out as such precisely). Quite obviously Release and AcqRel only synchronize around a write-then-read pair but here we need to synchronize on the inverse. We care about the visibility of events that happened before the read: something similar to a non-existent load(release) order. There's no way to achive these with Acquire and Release semantics in combination.

Under the assumption that SeqCst interacts with smp_mb, the fix is indeed sufficient.

This might be owed to the kernel's difference in approach to that terminology. The descriptions of their ordering terminology focus more on _how is X observed by other threads_ whereas the C/C++/Rust model operate on _what other things can our thread observe_ (thus requiring describing the operations on kernel thread to make _any_ kind of statement on it). My confusion is owed to that [`smp_mb__after_atomics` ends up being the default in the macro to an `acquire` fence](https://github.com/torvalds/linux/blob/5eb4573ea63d0c83bf58fb7c243fc2c2b6966c02/include/linux/atomic.h#L43).

To put it into the semantic model for Rust and visualize happens-before relationships:

Kernel                   User
K1: store(flags, WAKE)   U1: store(head)
|                        |
K2: smp_mb()             U2: fence(SeqCst)
|                        |
V                        V
K3: load(head)           U3: load(flags)

If K2 > U2 then K1 > K2 > U2 > U3 and user space gets the flag, whereas when K2 < U2 then U1 > U2 > K2 > K3 and the kernel observes the updated head. Since there's a synchronizing ordering relationship between any pair of SeqCst operations, including a fence, this provides the necessary correctness guarantee.

As an aside to vindicate AcqRel a little bit, replacing the load with flag.fetch_xor(0, AcqRel) is also sufficient for similar reasons as the writes then generate one of the two happens-before sequences through the operations on the memory location of flags. But that actual write pessimizes codegen, specifically on weak memory order architectures such as arm quite a bit (amusingly, on x86 llvm generates precisely the same mfence; mov sequence).

It's surprising how a SeqCst is sufficient here, but it is. That kind of non-deterministic correctness guarantee is quite unexpected on its own, no? (Do you happen to know of any good CS paper on the transfer of invariants in this manner—the responsibility of waking might as well be some other affine state? It's fascinating.)

Expanding on the red-herring solution of fetch_xor

As by your quoted section: There is an ordering relationship between two atomic writes to the same memory location ("in the modification order of M") in any case. The problem of applying it is that in the code as written only one atomic write to separate locations each occurs—so that portion can't be used for reasoning. This is what fetch_xor would solve, it introduces a second write; which allows case distinction over the order of these just like a pair of SeqCst fences. If the kernel reads the modification the RMW U3 in its RMW K1 then there is a Release-Acquire pair through M=flags and this brings the U1>U3 chain with it— so the kernel will also read the head update. If on the other hand the other modification order occurs, the atomic xor reads the wake bit already.

@FrankReh
Copy link
Contributor Author

I had a look at the code here and the same bug is present here as in liburing/the kernel.

Is there a fix for tokio-rs/io-uring that you think presents itself, or are you saying the API, as it stands now, should come with an unsafe method with a comment saying to the effect that the application must perform the SeqCst fence between their last tail field writes and their next flags field read?

I've been involved in other work since about this issue was raised in liburing so am not in a position to test - plus I wouldn't know how to test for correctness of a memory issue like this. I wonder if anyone has been bitten by this (momentarily) but would not have noticed because if their read was handled prematurely, they would have another read happening later on anyway?

@almogkh
Copy link

almogkh commented Apr 27, 2024

Is there a fix for tokio-rs/io-uring that you think presents itself, or are you saying the API, as it stands now, should come with an unsafe method with a comment saying to the effect that the application must perform the SeqCst fence between their last tail field writes and their next flags field read?

I think you can apply the same fix here that was used in liburing:
in submit[_...] if SQPOLL is enabled insert a SeqCst fence before calling sq_need_wakeup. Inside sq_need_wakeup itself the load can be replaced with a relaxed load since the acquire is not necessary.

As an aside to vindicate AcqRel a little bit, replacing the load with flag.fetch_xor(0, AcqRel) is also sufficient for similar reasons[...]

AcqRel is not sufficient because acquire and release are used to create happens-before relations using the synchronizes-with relation but there is no way as far as I can see to create a happens-before relation between the kernel's store to the flags field and the application's load of the flag, even if the load uses flag.fetch_xor(0, AcqRel). On x86 atomic read-modify-write operations always act like full barriers but that's not the case in all architectures.

The actual reason SeqCst solves the problem is through the coherence-ordered before relation and the constraints on the single total order S of all SeqCst operations:

An atomic operation A on some atomic object M is coherence-ordered before another atomic operation B on
M if
[...]
— A and B are not the same atomic read-modify-write operation, and there exists an atomic modification
X of M such that A reads the value stored by X and X precedes B in the modification order of M [...]

There is a single total order S on all memory_order::seq_cst operations, including fences, that satisfies the
following constraints. [...]
Second, for every pair of atomic operations A and B on an object M,
where A is coherence-ordered before B, the following four conditions are required to be satisfied by S [...]
— if a memory_order::seq_cst fence X happens before A and B happens before a memory_order::seq_-
cst fence Y , then X precedes Y in S.

From the above, if neither side sees the other's write (the kernel doesn't see the store to tail and the application doesn't see the store to flags) then each side's read is coherence ordered before the other's write which means each side's respective SeqCst fences must precede each other in the total order S which is impossible. This means that at least one side must see the other's write.

@HeroicKatora
Copy link
Contributor

HeroicKatora commented Apr 27, 2024

Never mind the AcqRel, it's admittedly a stupid solution that shouldn't be necessary and I've omitted the details on purpose. (Though: edited in above, discuss in another thread if you want¹) The kernel should just ensure that fence(SeqCst) and smp_mb are compatible and in practice it'll only be sensible to do that. I don't think any of the rest of your comment disagrees with the explanation of the two concrete, possible causal relationships ("coherency-order" / "happens-before" is common in CS literature) chains between the annotated RMW operations already provided in the details comment above. Thank you for providing the direct quotation of relevant sections. Proof by contradiction appears to work just as well as the proof over case distinction ("There is a single total order S").

So yes, I've been convinced enough this library should be using SeqCst when the wake flag must be observed reliably. I don't think the operation without a fence needs to be unsafe though. There's no safety problem associated with this failure to synchronize—blocking forever is memory safe as far as Rust is concerned. I agree it shouldn't be the default behavior and deserves expert treatment, providing a separate sq_need_wakeup_with_intermittent_seqcst might work?

@FrankReh
Copy link
Contributor Author

Big thanks to you for looking into this again. In the past year, I've gained a new appreciation for just how many crates are building their io_uring features off of this one. The Rust on Linux community really benefits from getting this right.

Can I reopen this just long enough to raise a related question given the work and PR you-all did yesterday?

There are these two reads now: https://github.com/tokio-rs/io-uring/blob/4e52bca1f2c3ee24b2a51f4b5d9f3d8652f26cab/src/submit.rs#L57C5-L72C1

    /// Whether the kernel thread has gone to sleep because it waited for too long without
    /// submission queue entries.
    #[inline]
    fn sq_need_wakeup(&self) -> bool {
        unsafe {
            (*self.sq_flags).load(atomic::Ordering::Relaxed) & sys::IORING_SQ_NEED_WAKEUP != 0
        }
    }

    /// CQ ring is overflown
    fn sq_cq_overflow(&self) -> bool {
        unsafe {
            (*self.sq_flags).load(atomic::Ordering::Acquire) & sys::IORING_SQ_CQ_OVERFLOW != 0
        }
    }

Without a comment explaining why, it is glaring that they perform the loads with different atomic ordering.

Fair to say the three of you know this stuff way better than I so I'm just in a position to ask the question. Is the second one Acquire and not Relaxed related to no fence being required for it?

These two functions are not public so their documentation or comments don't have to say much but perhaps while this context is still fresh for you, one of you could suggest a comment for one or the other that says why their load instructions are different?

@almogkh
Copy link

almogkh commented Apr 28, 2024

The load in sq_need_wakeup is relaxed because we are not acquiring anything and the kernel's setting of the NEED_WAKEUP flag is not done using a release operation. The SeqCst fence that precedes the call to sq_need_wakeup gives us all the synchronization we need since the only constraint on that code is that if we submitted new work right as the kernel's SQPOLL thread went to sleep, we need to be able to see that and wake it up.

I believe the load in sq_cq_overflow can also be relaxed since again we are not acquiring anything and the kernel doesn't set that flag with a release store. That flag is used to indicate that the CQ overflowed and in order to get the extra CQEs we need to enter the kernel with GETEVENTS. This will cause the kernel to flush the overflowed entries to the CQ after which it will perform a release store to the CQ's tail. When we later try to get completions, we acquire the CQ's tail, resulting in correct synchronization.

Edit:
FWIW, liburing's sq_cq_overflow equivalent also uses a relaxed load.

@FrankReh
Copy link
Contributor Author

Thank you for these details. At least for now to me, it seems much clearer.

@quininer Would you be amenable to a PR for the sq_cq_overflow function being relaxed?

@quininer
Copy link
Member

@FrankReh of course.

FrankReh added a commit to FrankReh/io-uring that referenced this issue Apr 29, 2024
Change the atomic read of the SQ flags field to be `relaxed`.

Refer to discussions in tokio-rs#197 and in particular, as pointed out in tokio-rs#197,
refer to the liburing library loads of this same field.

The liburing library names this field sq.kflags and all its atomic loads
are performed with their IO_URING_READ_ONCE macro, which it defines
to be the `relaxed` atomic load.
FrankReh added a commit to FrankReh/io-uring that referenced this issue Apr 29, 2024
Change the atomic read of the SQ flags field to be `relaxed`.

Refer to discussions in tokio-rs#197 and in particular, as pointed out in tokio-rs#197,
refer to the liburing library loads of this same field.

The liburing library names this field sq.kflags and all its atomic loads
are performed with their IO_URING_READ_ONCE macro, which it defines
to be the `relaxed` atomic load.
quininer pushed a commit that referenced this issue Apr 30, 2024
Change the atomic read of the SQ flags field to be `relaxed`.

Refer to discussions in #197 and in particular, as pointed out in #197,
refer to the liburing library loads of this same field.

The liburing library names this field sq.kflags and all its atomic loads
are performed with their IO_URING_READ_ONCE macro, which it defines
to be the `relaxed` atomic load.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants