Skip to content
This repository has been archived by the owner on Nov 5, 2018. It is now read-only.

Use optimistic reads in AtomicCell #39

Merged
4 commits merged into from Aug 9, 2018
Merged

Use optimistic reads in AtomicCell #39

4 commits merged into from Aug 9, 2018

Conversation

ghost
Copy link

@ghost ghost commented Jul 31, 2018

When using global locks, two concurrent load operations on the same AtomicCell will contend on the same lock, which severely hurts performance. This PR changes the load operations to use optimistic reads instead. The idea is basically the same as in the seqlock crate.

Benchmarks:

 name                    before ns/iter  after ns/iter  diff ns/iter   diff %  speedup
 compare_and_swap_u8     9               8                        -1  -11.11%   x 1.12
 compare_and_swap_usize  6               6                         0    0.00%   x 1.00
 concurrent_load_u8      25,239,685      716,611         -24,523,074  -97.16%  x 35.22
 concurrent_load_usize   355,220         346,668              -8,552   -2.41%   x 1.02
 fetch_add_u8            9               8                        -1  -11.11%   x 1.12
 fetch_add_usize         6               6                         0    0.00%   x 1.00
 load_u8                 8               1                        -7  -87.50%   x 8.00
 load_usize              0               0                         0     NaN%    x NaN
 store_u8                9               8                        -1  -11.11%   x 1.12
 store_usize             6               6                         0    0.00%   x 1.00

Note the difference in load_u8 and concurrent_load_u8 - that's what this PR is about.

@ghost ghost requested review from Amanieu and jeehoonkang July 31, 2018 09:17
Copy link
Contributor

@Amanieu Amanieu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great! Just a few minor issues.

}

if step < 5 {
// Just try again.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Always use spin_loop_hint if you are in a spin loop.

#[inline]
fn validate_read(&self, stamp: usize) -> bool {
atomic::fence(Ordering::Acquire);
self.state.load(Ordering::SeqCst) == stamp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The load here can be Relaxed: the ordering is already enforced by the fence.

In terms of ordering: we only need to ensure that state.load is not reordered before the volatile_read.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SeqCst is here just to make the operation SeqCst like the atomic one without locks.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not exactly sure if that's enough to make the operation equivalent to SeqCst.

cc @jeehoonkang

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My reasoning is:

If this SeqCst load returns a version that is not locked and is equal to the version before reading, then we've read a value that is still the same as the value present at the exact moment this SeqCst load happened. So this load operation is when the AtomicCell::load really "atomically" happens.

But yeah, I wonder what @jeehoonkang thinks, too. :)

Copy link
Collaborator

@jeehoonkang jeehoonkang Jul 31, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sequence lock (stamped lock) is... strange. I think the best introduction to the seqlock and its strangeness is Hans Boehm's slide. I think we should implement the "version 3" in this slide, as (1) it's correct, as far as Hans (and I) think, and (2) it's the most efficient scheme among those implementable in C/C++.

According to the slide, (1) this load can be Relaxed, and (2) the write-lock should issue a Release fence. (2) is necessary even if this load is SeqCst: otherwise, bad reorderings may happen on the writer side.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeehoonkang

The high-level explanation is that two consecutive load(SeqCst) are not strictly ordered, while Mutex<Cell> guarantees two consecutive reads are strictly ordered.

Just to make sure I understood, can you also answer whether the following statements are true or false?

  1. Two consecutive load(SeqCst) are strictly ordered by the C++ standard.
  2. Two consecutive load(SeqCst) are strictly ordered on x86-64.
  3. Two consecutive load(SeqCst) are not strictly ordered on ARM.

The release fence should be issued at the end of write_lock(), not inside write_unlock().

So we just need to change the ordering of the swap operation inside write() from Acquire to AcqRel?

Copy link
Collaborator

@jeehoonkang jeehoonkang Aug 2, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, maybe what I said was not clear. Sorry. By "consecutive loads" I mean ordered by (extended) coherence, not by intra-thread program order. Suppose two threads A and B are reading the same value from an AtomicCell<T>. If it's lock-free, two reads are not ordered at all in C++, ARMv8, and ARMv7. (In x86, probably the order doesn't matter at the beginning.) But if it's based on a spinlock, then two loads are strictly ordered, either A's read before B's read or the other way around.

Regarding seqlock, acquire and release fence should be issued. AFAIK, in "version 3", W issues a release fence right after updating the stamp value in write_lock(), and R issues an acquire fence just before reading the stamp value in the validation phase. If R reads any value in the protected data written by W, then there's a acquire/release synchronization between those fences, forcing R to read the stamp value updated by W or later.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If it's lock-free, two reads are not ordered at all in C++, ARMv8, and ARMv7.

Doesn't at least the C++ standard guarantee that two SeqCst reads are always ordered (sequential consistency)?

W issues a release fence right after updating the stamp value in write_lock()

Did you mean "right before"? (so that writes happen before the stamp value is updated)

If R reads any value in the protected data written by W, then there's a acquire/release synchronization between those fences

Why not make the swap operation by W use AcqRel so that this operation establishes acquire/release synchronization with the fence issued by R? That might be slightly faster than issuing a fence by W. Or is that not going to work?

Copy link
Collaborator

@jeehoonkang jeehoonkang Aug 3, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • SeqCst loads and stores have very vague meaning in C++. As a theorem (not normative), C++ guarantees that if all atomic accesses are SeqCst and there's no data race for non-atomic locations in a program, then the program's semantics is sequentially consistent (so-called "DRF1" or "DRF-SC"; here "DRF" stands for data-race freedom). But when even a single access to an irrelevant location is e.g. Relaxed, then the program's semantics may no longer be sequentially consistent.

    In short, the semantics of SeqCst loads and stores is surprisingly weaker than expected. In particular, it's strictly weaker than spinlock-protected accesses. For the latest discussion on this subject, the RC11 paper is worth reading.

  • I meant "right after", not "right before". Also, release swap will not work. (Yes, seqlock is strange...) Because the synchronization occurs via data variables, not stamp variable: writer issues release fence and then writes a value to data, and if reader reads the new value from data and then issues acquire fence, then after that the reader should read the new stamp value written by the writer (or a later one).

Copy link
Author

@ghost ghost Aug 8, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@jeehoonkang

But when even a single access to an irrelevant location is e.g. Relaxed, then the program's semantics may no longer be sequentially consistent.

Whoa, that's really surprising!

By the way, in the previous pull request you said:

I think SeqCst obscures the specification: what do we expect from the methods of AtomicCell? I think the word "sequentially consistent" doesn't really help, because practically none understand the term precisely. Even worse, using SeqCst may guarantee some ordering properties which users will rely on, after which we're stuck and cannot optimize it further.

I'd just like to point out that someone might rely on the fact that spinlock-protected AtomicCell has sequentially consistent semantics, but then we might switch from spinlock to e.g. AtomicU16 (once it gets into stable Rust) and thus relax the semantics behind the scenes. So that's a similar problem.

Anyway... I pushed some changes to this PR:

  • The load in the validation step now uses Relaxed rather than SeqCst.
  • Added a Release fence in write.

Does the code look good now?

Copy link
Collaborator

@jeehoonkang jeehoonkang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks! I think it's good to go.

@Amanieu what do you think?


// Try doing an optimistic read first.
if let Some(stamp) = lock.optimistic_read() {
// We need a volatile read here because other threads might concurrently modify the
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From an extremely pedantic view (which I don't agree that much), volatile read is still UB in C/C++. But everyone is using volatile read/write to mark an access concurrent, including Linux for starters. Maybe commenting on it would be helpful to readers or ASAN?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a more elaborate comment explaining the issue.

@Amanieu
Copy link
Contributor

Amanieu commented Aug 9, 2018

LGTM

@ghost ghost merged commit 5123ec4 into crossbeam-rs:master Aug 9, 2018
@ghost ghost deleted the optimistic-read branch August 9, 2018 17:23
@ghost ghost mentioned this pull request Sep 1, 2018
This pull request was closed.
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants