Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug in atomic operations on aarch64 with multi-threading #13010

Closed
straight-shoota opened this issue Jan 26, 2023 · 12 comments · Fixed by #13050
Closed

Bug in atomic operations on aarch64 with multi-threading #13010

straight-shoota opened this issue Jan 26, 2023 · 12 comments · Fixed by #13050
Labels
kind:bug A bug in the code. Does not apply to documentation, specs, etc. platform:aarch64 topic:multithreading topic:stdlib:concurrency

Comments

@straight-shoota
Copy link
Member

straight-shoota commented Jan 26, 2023

There seems to be a bug with atomic operations in multi-threaded mode (-Dpreview_mt) on aarch64.

It was discovered through use of atomic operations in Crystal::SpinLock for guarding Channel access.
This is a reproducible (yet not very minimalized) example that should fail due to invalid memory access:

# build with `crystal build -Dpreview_mt` on aarch64
require "socket"
require "log"

NUMBER_OF_CLIENTS = 100

Log.setup_from_env(default_level: Log::Severity::Info, backend: Log::IOBackend.new(dispatcher: Log::AsyncDispatcher.new(buffer_size: 256)))

def handle_client(client)
  Fiber.current.name = "client #{client.remote_address.to_s}"
  Log.info { "new client connected" }
  loop do
    msg_length = UInt32.from_io(client, IO::ByteFormat::SystemEndian)
    msg = client.read_string(msg_length)
    Log.info { "Got: #{msg}" }
    response_msg = "Ok #{Random.rand}"
    response_msg.bytesize.to_io(client, IO::ByteFormat::SystemEndian)
    client.write_string(response_msg.to_slice)
  end
end

sync = Channel(Nil).new
server_fiber = spawn(name: "server") do
  Log.info { "creating server" }
  TCPServer.open("localhost", 12345) do |listener|
    sync.send nil
    while client = listener.accept?
      spawn handle_client(client)
    end
  end
end

sync.receive

NUMBER_OF_CLIENTS.times do |i|
  spawn do
    TCPSocket.open("localhost", 12345) do |server|
      Fiber.current.name = "server #{i}"
      loop do
        msg = "hello #{Random.rand}"
        msg.bytesize.to_io(server, IO::ByteFormat::SystemEndian)
        server.write_string(msg.to_slice)
        response_length = UInt32.from_io(server, IO::ByteFormat::SystemEndian)
        response_msg = server.read_string(response_length)
        Log.info { "Got: #{response_msg}" }
      end
    end
  end
end

sleep

This bug is confirmed to appear on aarch64-apple-darwin as well as aarch64-unknown-linux-gnu (although I have not been able to reproduce it on linux yet). x86 architectures seem to be unaffected.

As a workaround we can use Thread::Mutex instead of SpinLock for Channel. This has of course an impact on performance.

diff --git i/src/channel.cr w/src/channel.cr
index 74010d9c2..15efda377 100644
--- i/src/channel.cr
+++ w/src/channel.cr
@@ -23,7 +23,11 @@ require "crystal/pointer_linked_list"
 # will be indistinguishable from a closed channel.
 #
 class Channel(T)
-  @lock = Crystal::SpinLock.new
+  {% if flag?(:preview_mt) && flag?(:aarch64) %}
+    @lock = Thread::Mutex.new
+  {% else %}
+    @lock = Crystal::SpinLock.new
+  {% end %}
   @queue : Deque(T)?

   # :nodoc:
diff --git i/src/crystal/system/thread_mutex.cr w/src/crystal/system/thread_mutex.cr
index e3cf9ffeb..1222d7834 100644
--- i/src/crystal/system/thread_mutex.cr
+++ w/src/crystal/system/thread_mutex.cr
@@ -14,6 +14,12 @@ class Thread

     # Locks the mutex, yields to the block and ensures it unlocks afterwards.
     # def synchronize(&block)
+
+    def sync(&)
+      synchronize do
+        yield self
+      end
+    end
   end
 end

A potential angle for a fix could be this: https://www.llvm.org/docs/Atomics.html

Atomic operations are represented in the SelectionDAG with ATOMIC_* opcodes. On architectures which use barrier instructions for all atomic ordering (like ARM), appropriate fences can be emitted by the AtomicExpand Codegen pass if setInsertFencesForAtomic() was used.

However, it's not clear if and what we would need to do for that as it's understood this would be handled implicitly by LLVM.

@straight-shoota straight-shoota added kind:bug A bug in the code. Does not apply to documentation, specs, etc. topic:stdlib:concurrency topic:multithreading platform:aarch64 labels Jan 26, 2023
@spuun
Copy link
Contributor

spuun commented Jan 27, 2023

l = Crystal::SpinLock.new
q = Deque(Int32).new(initial_capacity: 10)

200.times do |i|
  spawn do
    loop do
      l.lock
      if q.size < 10
        q << i
      end
      if q.@capacity > 10
        raise "BUG: q.@capacity=#{q.@capacity}"
      end
      l.unlock
      Fiber.yield
    end
  end
end

spawn do
  loop do
    l.lock
    if i = q.shift?
      puts i
    end
    l.unlock
    Fiber.yield
  end
end
sleep

This also reproduces the bug. It won't crash with invalid memory access, but it shows that we've managed to insert an extra item in the Deque causing it to increase its capacity (which is what happens in Channel) even though we have the if q.size < 10 guard.

@spuun
Copy link
Contributor

spuun commented Jan 27, 2023

Just to bring it to the protocol: this also affects e.g. Mutex, which is using Atomic and SpinLock internally. It's possible to use a Mutex in the above example to get the same result.

@ggiraldez
Copy link
Contributor

I took a look at this after a conversation with @bcardiff. The atomic implementation is fine from what I understand, but the problem is the compiler does not emit memory barriers at the lock/unlock points. This means the atomic operations are executed correctly and the sequential consistency is maintained for them, but not for the rest of the memory. Since ARM has a weak consistency memory model, there is nothing preventing the CPU from reordering the final write to the Deque object and executing it after the SpinLock was unlocked (thus allowing a second thread to read a stale state for the deque).

Warning: I don't have a ARM processor at hand to try this. All of what I'm saying is inferred from reading the linked LLVM document and these additional two posts and inspecting the assembler output for the given test program. Just saying I may be way off 😁

A blunt approach to fixing this would be to insert an acquire-release memory barrier right before the unlock. From what I can see that does generate a dmb ARM instruction which should prevent reordering writes and ensure all threads have a consistent view of the Deque.

l = Crystal::SpinLock.new
q = Deque(Int32).new(initial_capacity: 10)

@[Primitive(:fence)]
def fence(ordering : LLVM::AtomicOrdering, singlethread : Bool) : Nil
end

200.times do |i|
  spawn do
    loop do
      l.lock
      if q.size < 10
        q << i
      end
      if q.@capacity > 10
        raise "BUG: q.@capacity=#{q.@capacity}"
      end
      fence :sequentially_consistent, false   # <---- HERE
      l.unlock
      Fiber.yield
    end
  end
end

spawn do
  loop do
    l.lock
    if i = q.shift?
      puts i
    end
    fence :sequentially_consistent, false    # <---- HERE
    l.unlock
    Fiber.yield
  end
end
sleep

A better approach would be to add memory ordering modifiers to the SpinLock class and allow the compiler to emit more fine-grained fences.

I'm not sure if there is anything to be tweaked for the LLVM backend. The emitted assembler looks sound from a semantic point of view regarding the atomic behavior. It uses ldaxr and stlxr instructions which implement more granular memory barriers. I'm sure there are cases when issuing a global memory barrier on each atomic operation is the wrong thing to do.

@bcardiff
Copy link
Member

bcardiff commented Feb 7, 2023

🎉 it seems to work here. A simpler monkey patch would be

class Crystal::SpinLock
  def unlock
    ::Atomic::Ops.fence :sequentially_consistent, false 
    previous_def
  end
end

Anyone else wants to validate?

@ggiraldez
Copy link
Contributor

Nice!

A couple more things to consider:

  • It may be necessary to also add a memory barrier in SpinLock#lock as well. Using the same reasoning, it should otherwise be possible for the CPU to reorder a read operation that in the code happens inside the synchronized block to happen before it begins. That's another reason to try to use a more granular barrier: an acquire barrier in #lock and a release one in #unlock.
  • Since the barriers don't seem to be necessary in x86 (I assume due to the non-weak (strong?) memory model), it would be desirable to avoid emitting them for that architecture.
  • Should Atomic::Ops be exposed more publicly in the stdlib? Maybe wrapped in a slightly higher level API.

@bcardiff
Copy link
Member

bcardiff commented Feb 7, 2023

It may be necessary to also add a memory barrier in SpinLock#lock as well.

Before the set, right? Or after it?

@ggiraldez
Copy link
Contributor

After obtaining the lock, after the outer while in #lock.

@ysbaddaden
Copy link
Contributor

ysbaddaden commented Feb 8, 2023

Reading the ARM and Lock-Free Programming article that @ggiraldez linked, I understand that std::atomic will set the memory barriers by itself when needed (aka on any CPU but x86 and x86_64).

Isn't Atomic::Flag (and maybe Atomic) unsafe on CPUs with weak memory models, and SpinLock was only a symptom? Or am I missing something?

@ggiraldez
Copy link
Contributor

Yeah, I thought about that too. But std::atomic allows you to specify the memory ordering guarantees that you want. That's not something we have in Atomic. There may be instances where you don't need the ordering guarantees and the current semantics of Atomic would be enough. That's the minimalist approach. But I agree we should be careful with making the semantics explicit (in the documentation at least) to avoid pitfalls for people coming from other languages. Either that or we choose different names for the concepts.

@HertzDevil
Copy link
Contributor

HertzDevil commented Feb 9, 2023

I believe all of our atomic operations are already sequentially consistent since that is what we pass to LLVM, the culprit may be the use of Atomic#lazy_get and #lazy_set.

Also fence seems to be the only operation that is not exposed by Atomic's public API. We could easily have that as a class method or something.

@ggiraldez
Copy link
Contributor

It's true that atomic operations are sequentially consistent. But for ARM at least, that only ensures sequential consistency on the atomic themselves. Memory barriers are required to synchronize the rest of the memory. I don't think the use of Atomic#lazy_set in SpinLock is harmful because it's the release of the lock. If I understand correctly, it could only delay a pending lock a bit.

@ysbaddaden
Copy link
Contributor

I'm wondering again if Atomic::Flag shouldn't be safe by default and maybe exposing fences. That's something important for MT on CPUs with weak memory models.

Reading into the linked articles again, I don't understand the requirement for an explicit memory barrier while we already have the explicit sequentially consistent memory order in the atomics...

Both articles talk about MSVC that does put generate explicit memory barriers for AArch64. I built Aarch64 binaries using stdatomic in C11 and std::atomic in C++11 with clang-12 (the one I have on hand) and I don't see explicit memory barriers being written. But this is only for the atomics, what about locks?

Exploring the internals of musl-libc the atomics use the same set of instructions (LDAXR/STLXR/CNBZ) but does set an explicit memory barrier (DMB ISH) in atomic CAS for AArch64.

The SpinLock in the linux kernel for ARM64 explicitly states:

The memory barriers are implicit with the load-acquire and store-release instructions.

The difference is as @HertzDevil is pointing: the #lazy_set in Crystal::SpinLock (and RWLock, ...) which are breaking the atomic contracts.

That being said, it seems we should have a memory barrier on ARM (v6 and v7).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind:bug A bug in the code. Does not apply to documentation, specs, etc. platform:aarch64 topic:multithreading topic:stdlib:concurrency
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants