Skip to content

A special smart pointer to exchange data between threads

Notifications You must be signed in to change notification settings

martong/rcu_ptr

Repository files navigation

rcu_ptr

Introduction

Read-copy-update pointer (rcu_ptr) is a special smart pointer which can be used to exchange data between threads. Read-copy-update (RCU) is a general synchronization mechanism, which is similar to readers-writers lock. It allows extremely low overhead for reads. However, RCU updates can be expensive, as they must leave the old versions of the data structure in place to accommodate pre-existing readers [1, 2]. RCU is useful when you have several readers and few writers. Depending on the size of the data you want to update, writes can be really slow, since they need to copy. Therefore, it's worth to do measurments and analyze the characteristics of the inputs and environment of your system.

rcu_ptr implements the read-copy-update mechanism by wrapping a std::shared_ptr (in a way, it has some similarity to std::weak_ptr).

atomic_shared_ptr

rcu_ptr relies on the free atomic_... function overloads for std::shared_ptr. Would be nice to use an atomic_shared_ptr, but currently that is still in experimental phase. We use atomic shared_ptr operations which are implemented in terms of a spin-lock (most probably that's how it is implemented in the currently available standard libraries). Having a lock-free atomic_shared_ptr would be really benefitial. However, implementing a lock-free atomic_shared_ptr in a portable way can have extreme difficulties [3]. Thought it might be easier on architectures, where we have double word CAS operations.

Why do we need RCU and rcu_ptr?

Imagine we have a collection and several readers and some writer threads on it. It is a common way to make the collection thread safe by holding a lock until the iteration is finished (on the reader thread). Example:

class X {
    std::vector<int> v;
    mutable std::mutex m;
public:
    int sum() const { // read operation
        std::lock_guard<std::mutex> lock{m};
        return std::accumulate(v.begin(), v.end(), 0);
    }
    void add(int i) { // write operation
        std::lock_guard<std::mutex> lock{m};
        v.push_back(i);
    }
};

This does not scale well, and prone to some errors (e.g. you can have a deadlock only if you use a lock). The first idea to make it better is to have a shared_ptr and hold the lock only until that is copied by the reader or updated by the writer:

class X {
    std::shared_ptr<std::vector<int>> v;
    mutable std::mutex m;
public:
    X() : v(std::make_shared<std::vector<int>>()) {}
    int sum() const { // read operation
        std::shared_ptr<std::vector<int>> local_copy;
        {
            std::lock_guard<std::mutex> lock{m};
            local_copy = v;
        }
        return std::accumulate(local_copy->begin(), local_copy->end(), 0);
    }
    void add(int i) { // write operation
        std::shared_ptr<std::vector<int>> local_copy;
        {
            std::lock_guard<std::mutex> lock{m};
            local_copy = v;
        }
        local_copy->push_back(i);
        {
            std::lock_guard<std::mutex> lock{m};
            v = local_copy;
        }
    }
};

Now we have a race on the pointee itself during the write. So we need to have a deep copy:

    void add(int i) { // write operation
        std::shared_ptr<std::vector<int>> local_copy;
        {
            std::lock_guard<std::mutex> lock{m};
            local_copy = v;
        }
        auto local_deep_copy = std::make_shared<std::vector<int>>(*local_copy);
        local_deep_copy->push_back(i);
        {
            std::lock_guard<std::mutex> lock{m};
            v = local_deep_copy;
        }
    }

The copy construction of the underlying data (vector) is thread safe, since the copy ctor param is a const ref const std::vector<int>&. Now, if there are two concurrent write operations than we might miss one update. We'd need to check whether the other writer had done an update after the actual writer has loaded the local copy. If it did then we should load the data again and try to do the update again. This leads to the general idea of using an atomic_compare_exchange in a while loop. So we could use an atomic_shared_ptr from C++17, but until then we have to settle for the free funtcion overloads for shared_ptr:

class X {
    std::shared_ptr<std::vector<int>> v;
public:
    X() : v(std::make_shared<std::vector<int>>()) {}
    int sum() const { // read operation
        auto local_copy = std::atomic_load(&v);
        return std::accumulate(local_copy->begin(), local_copy->end(), 0);
    }
    void add(int i) { // write operation
        auto local_copy = std::atomic_load(&v);
        auto exchange_result = false;
        while (!exchange_result) {
            // we need a deep copy
            auto local_deep_copy = std::make_shared<std::vector<int>>(
                    *local_copy);
            local_deep_copy->push_back(i);
            exchange_result =
                std::atomic_compare_exchange_strong(&v, &local_copy,
                                                    local_deep_copy);
        }
    }
};

Also (regarding the write operation), since we are already in a while loop we can use atomic_compare_exchange_weak. That can result in a performance gain on some platforms. See http://stackoverflow.com/questions/25199838/understanding-stdatomiccompare-exchange-weak-in-c11 Note, we could move construct the 3rd parameter of atomic_compare_exchange_strong, therefore we could spare a reference count increment and decrement:

            exchange_result =
                std::atomic_compare_exchange_strong(&v, &local_copy,
                                                    std::move(local_deep_copy));

In the current form of class X, nothing stops an other programmer (e.g. a naive maintainer of the code years later) to add a new reader operation, like this:

    int another_sum() const {
        return std::accumulate(v->begin(), v->end(), 0);
    }

This is definetly a race condition and a problem. And this is the exact reason why rcu_ptr was created. The goal is to provide a general higher level abstraction above atomic_shared_ptr:

class X {
    rcu_ptr<std::vector<int>> v;

public:
    X() { v.reset(std::make_shared<std::vector<int>>()); }
    int sum() const { // read operation
        std::shared_ptr<const std::vector<int>> local_copy = v.read();
        return std::accumulate(local_copy->begin(), local_copy->end(), 0);
    }
    void add(int i) { // write operation
        v.copy_update([i](std::vector<int>* copy) {
            copy->push_back(i);
        });
    }
};

The read method of rcu_ptr returns a shared_ptr<const T> by value, therefore it is thread safe. The existence of the shared_ptr in the scope enforces that the read object will live at least until this read operation finishes. By using the shared_ptr this way, we are free from ABA problems, see Anthony Williams - Why do we need atomic_shared_ptr?. The copy_update method receives a lambda. This lambda is called whenever an update needs to be done, i.e. it will be called continuously until the update is successful. The lambda receives a T* for the copy of the actual data. We can modify the copy of the actual data inside the lambda. The reset method receives a const shared_ptr<T>& with which we can overwrite the actual contained shared_ptr.

Design Rationale

Ordering

In case of reset and read we use memory_order_relaxed.

For copy_update we can use consume-release semantics. There is a nice long data dependency chain here:

        std::shared_ptr<T> sp_l =
            std::atomic_load_explicit(&sp, std::memory_order_consume);
        auto exchange_result = false;
        while (!exchange_result) {

            // deep copy
            auto r = std::make_shared<T>(*sp_l);

            // update
            std::forward<R>(fun)(r.get());

            exchange_result = std::atomic_compare_exchange_strong_explicit(
                &sp, &sp_l, r, std::memory_order_release,
                std::memory_order_release);

If we'd use relaxed ordering and if the fun is inlined and fun itself is not an ordering operation or it does not contain any fences then the load or the compare_exchange might be reordered in between the middle of fun. Though, there is a data dependency chain: sp->sp_l->r->compare_exchange(...,r). So if all the architectures would be preserving data dependency ordering, than we'd be fine with relaxed. But, some architectures don't preserve data dependency ordering (e.g. DEC Alpha), therefore we need to explicitly state that we rely on that the cpu will not reorder data dependent operations. This is what we express with the consume-release semantics. This is perfecty aligned with the Linux kernel RCU implmentation, they use consume-relase too.

Of course, all the mentioned ordering constraints has sense only if we use a lock-free atomic_shared_ptr. If we use a non-lock free one, then that enforces an acquire-release semantics because it must use internally a spinlock for all of its member functions.

Public interface

copy_update
The signature of the lambda we pass to copy_update could have different forms.

  • T(const T&) We receive a reference to the actual value held. The user of rcu_ptr has to do the copy, and return with that.

    • Pros:
      • The user must do the copy, but they will surely know there is a copy happening.
    • Cons:
      • Longer repetative work in user code (always do the copy)
  • shared_ptr<T>(shared_ptr<const T>) We receive a shared_ptr to const to the actual value held. The user had to do the copy with make_shared.

    • Pros:
      • It is consistent with the rest of the member functions. I.e. reset takes a shared_ptr too.
    • Cons:
      • We copy the shared pointer
  • shared_ptr<T>(const shared_ptr<T>&) We receive a shared_ptr to the actual value held. The user had to do the copy with make_shared.

    • Pros:
      • It is consistent with the rest of the member functions. I.e. reset takes a shared_ptr too.
      • We do not copy the shared_ptr
    • Cons:
      • We give a non-const pointer the actual data, opens possibility to a data-race
  • void(T&) We receive a reference to the copied value.

    • Pros:
      • Simple
      • Less things to write in user code
    • Cons:
      • It might not be ovious that we do a copy in the background.
      • The signature is not consistent with the other member functions, which are all related to a shared_ptr (e.g. read returns with shared_ptr).
  • void(T*) We receive a pointer to the copied value.

    • Pros:
      • Simple
      • Less things to write in user code
      • The signature is quite consistent with the other member functions, we pass a pointer.
      • Can handle the case when the contained shared_ptr is a nullptr.
    • Cons:
      • It might not be ovious that we do a copy in the background.

At the moment, the last one is the chosen one.

Usage

rcu_ptr depends on the features of the C++11 standard. rcu_ptr has two default template parameters which makes it possible to use a different atomic_shared_ptr other than the default setting. By default we use a wrapper class which uses the [free function overloads for std::shared_ptr] (http://en.cppreference.com/w/cpp/memory/shared_ptr/atomic). Note, these overloads are not implemented in GCC/libstdc++ if the version is less than 5.0 (https://gcc.gnu.org/bugzilla/show_bug.cgi?id=57250). We can use the lock-free atomic_shared_ptr from Anthony Williams like this:

#include <rcu_ptr.hpp>
#include <jss/atomic_shared_ptr>
#include <jss/atomic_shared_ptr_traits.hpp>

using asp_traits = jss::atomic_shared_ptr_traits<jss::atomic_shared_ptr>;

template <typename T>
using RcuPtr = rcu_ptr<T, jss::atomic_shared_ptr, asp_traits>;

void bar(asp_traits::shared_ptr<int> sp);
void foo() {
    auto i = asp_traits::make_shared<int>(42);
    bar(i);
}
void f() {
    RcuPtr<int> p;
    auto const new_ = asp_traits::make_shared<int>(42);
    p.reset(new_);
}

asp_traits provides the actual type of the shared_ptr (and make_shared) which is connected to the underlying atomic_shared_ptr. For extensive usage examples please check in test/rcu_race.cpp.

Building

The library is header only: rcu_ptr.hpp, thus it requires no build. The directory detail must be in the include path as well. If you want to use the lock-free implementation of atomic_shared_ptr (from Anthony Williams) as the underlying shared_ptr then jss must be in the path too and rcu_ptr shall be instantiated with types from the jss namespace.

Tests and measurements are included to verify the concept and expected behaviour of the rcu_ptr. We use [CMake] (https://cmake.org/) and the tests may be built with [ThreadSanitizer] (https://code.google.com/archive/p/data-race-test/wikis/ThreadSanitizer.wiki) introduced in GCC 4.8. The tests require polimorphic lambdas from C++14. To build the measurements as well, we need to install URCU in the system (http://liburcu.org/).

git clone [email protected]:martong/rcu_ptr.git
cd rcu_ptr
git submodule init
git submodule update
mkdir build
cd build
cmake -G Ninja ..
ninja
ctest

We tested rcu_ptr on Linux with gcc 5 and 7 and our primary target is Linux/gcc7. On macOs we could not use the jss::atomic_shared_ptr, so the jss targets are not supported on macOs.

Acknowledgement

Many thanks to bucienator for having valuable discussions about the implementation, and the public interface. Thanks for Máté Cserna, for his really helpful comments. Thanks for Jeff Preshing for his wonderful blog, and Anthony Williams for his book.

About

A special smart pointer to exchange data between threads

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published