Rewrite std::comm #10830

alexcrichton · 2013-12-06T02:18:24Z

This pull request completely rewrites std::comm and all associated users. Some major bullet points

Everything now works natively
oneshots have been removed
shared ports have been removed
try_recv no longer blocks (recv_opt blocks)
constructors are now Chan::new and SharedChan::new
failure is propagated on send
stream channels are 3x faster

I have acquired the following measurements on this patch. I compared against Go, but remember that Go's channels are fundamentally different than ours in that sends are by-default blocking. This means that it's not really a totally fair comparison, but it's good to see ballpark numbers for anyway

          oneshot         stream          shared1
std         2.111           3.073          1.730 
my          6.639           1.037          1.238 
native      5.748           1.017          1.250 
go8         1.774           3.575          2.948 
go8-inf     slow            0.837          1.376 
go8-128     4.832           1.430          1.504 
go1         1.528           1.439          1.251 
go2         1.753           3.845          3.166

I had three benchmarks:

oneshot - N times, create a "oneshot channel", send on it, then receive on it (no task spawning)
stream - N times, send from one task to another task, wait for both to complete
shared1 - create N threads, each of which sends M times, and a port receives N*M times.

The rows are as follows:

std - the current libstd implementation (before this pull request)
my - this pull request's implementation (in M:N mode)
native - this pull request's implementation (in 1:1 mode)
goN - go's implementation with GOMAXPROCS=N. The only relevant value is 8 (I had 8 cores on this machine)
goN-X - go's implementation where the channels in question were created with buffers of size X to behave more similarly to rust's channels.

brson · 2013-12-06T02:39:21Z

This is very exciting.

brson · 2013-12-06T03:04:54Z

I want to read this all carefully, but I've only gotten through the doc comments tonight.

bblum · 2013-12-06T07:15:52Z

do you have a sense for why oneshot performance is 3x slower despite stream being faster?

alexcrichton · 2013-12-06T07:54:50Z

I do indeed. Right now oneshots are actually super-optimized for their use case. An allocation of a oneshot channel is one tiny allocation of a box to share between the two ends. An allocation of a stream in this pull request, however, is three allocations (hence the ~3x slowdown).

In profiling, the creation of the channel completely dominated the oneshot benchmark.

TeXitoi · 2013-12-06T08:35:34Z

src/libstd/comm.rs

+//
+// Rust channels come in two flavors: streams and shared channels. A stream has
+// one sender and one receiver while a shared channel could have multiple
+// receivers. This choice heavily influences the design of the protocol set


a shared channel could have multiple senders, no?

Whoops, good catch!

brendanzab · 2013-12-06T14:53:21Z

@alexcrichton Are oneshots going to be added again in the future if folks want better performance for that use case?

brendanzab · 2013-12-06T14:59:05Z

src/libstd/comm.rs

+//!
+//! # Example
+//!
+//!     // Create a simple streaming channel


Wouldn't it be better to enclose the code sample in ~~~rust-~~~ fences for syntax highlighting?

alexcrichton · 2013-12-06T17:26:36Z

I would rather try out not adding oneshots back to begin with. The use case in which they are more efficient is when the creation of the channel is far more common than the usage of the channel. We currently don't have much code that has its bottleneck in that area.

Another snag would be implementing select over oneshot ports as well. In theory I should be able to select over a oneshot port as well as a port-port, and I don't think that the trait-bound of selection we have today cuts it. This could in theory be possible, but it's not as elegant as I would like, so I'd like to try running around without oneshots first.

bill-myers · 2013-12-06T17:29:20Z

@alexcrichton There's an issue that is still unsolved: SharedChan can be used to create cycles

It has always existed, but it might be the right time to consider it since we are rewriting it and changing the API anyway.

I filed it as issue #10835 so the discussion can be done there instead of on this pull request since it's somewhat orthogonal to implementation issues.

bill-myers · 2013-12-06T17:45:14Z

@alexcrichton Is access to the to_wake field properly synchronized?

It doesn't seem to be protected by any locks, and doesn't seem to be accessed with atomics, but it's very possible I misunderstand something since I only glanced at the code.

bill-myers · 2013-12-06T18:17:39Z

@alexcrichton The SPSC web pages adds padding to put the consumer and producer side of the queue on different cachelines, but this pull request forgets to do so

bill-myers · 2013-12-06T18:31:35Z

@alexcrichton The SPSC code uses Relaxed which seems wrong: it should use Release on most stores and Consume on most loads like the code in the linked web page does.

In general, it seems that all the synchronization-related parts need careful expert review, although the high-level design might be correct.

alexcrichton · 2013-12-06T18:49:11Z

Is access to the to_wake field properly synchronized?

To the best of my knowledge, yes.

The SPSC web pages adds padding to put the consumer and producer side of the queue

Yes, I wasn't able to measure any difference and I wanted to make the struct smaller (while I was optimizing the oneshot case)

The SPSC code uses Relaxed which seems wrong

Whoops, I reverted everything away from SeqCst while benchmarking b/c I was seeing some weird codegen, and I forgot to go back and change them.

bill-myers · 2013-12-06T18:49:13Z

@alexcrichton It seems that the handling of Packet::cnt is wrong.

Specifically Packet::increment can do a fetch_add on a channel with cnt == DISCONNECT, and the code then resets it to DISCONNECT, but this means that in the race window any negative value can represent a disconnected channel.

Hence Packet.decrement should consider any negative cnt as disconnected instead of asserting that if cnt != DISCONNECT then cnt >= 0, which is false; likewise all other checks for DISCONNECTED need to be fixed.

BTW, this assumes that the code never makes cnt negative for reason other than disconnection, which seems true but isn't totally clear.

Alternatively, you can use a cmpxchg in Packet.increment so that if cnt == DISCONNECTED, then cnt is never changed, and you can then use negative cnt for other things.

bill-myers · 2013-12-06T18:51:36Z

Is access to the to_wake field properly synchronized?

To the best of my knowledge, yes.

EDIT: ok, it seems that the cnt field is used to represent whether to_wake is valid.

However, what if a producer thread stops indefinitely while in the middle of reading the to_wake?

Producers don't seem to block consumers writing to_wake, so it seems it could be overwritten while a producer reads it.

In other words, I think you need to make to_wake a 1-word structure and manage it with atomics, or put a mutex around it.

Also, anything that to_wake points to must live infinitely if you don't use a mutex (since a producer might stall forever while waking up).

This seems to be the case already on 1:1, since the wakeup mutex is either in an Arc, or in the channel, but I'm not sure whether the M:N case is correct.

bill-myers · 2013-12-06T18:54:31Z

BTW, if you use Acquire and Consume there should be no code generation difference on x86 compared to Relaxed, because the x86 hardware provides those guarantees for all code (all stores are ordered, and dependent loads are too)

bill-myers · 2013-12-06T19:23:44Z

@alexcrichton What if the user specifies the same port multiple times in the select() array?

It seems this will break all sorts of invariants in the current code, for instance by decrementing cnt multiple times.

There should be a check for it, or maybe select should take &muts.

BTW, select is inefficient because it is O(n^2) since reading from each of n pipes requires selecting on all n: it should be replaced by some sort of Select struct where ports can be added and removed and the actual select() call has no arguments

bill-myers · 2013-12-06T19:35:02Z

% RESCHED_FREQ should be & RESCHED_FREQ_MASK since division is extremely slow, and power-of-two resched frequencies should be enough, assuming that this thing is needed at all.

As the code stands, try() might well spend the majority of the execution time dividing unnecessarily on some architectures...

HOWEVER, this algorithm is probably completely broken anyway because with SharedChan a single unlucky producer might never resched because other producers might hit all the cnt values multiple of RESCHED_FREQ...

Likewise, if one sends on several different channels, it may also never reschede.

It probably should just call maybe_yield which should do the & RESCHED_FREQ internally on a #[thread_local] variable that it also increments (#[thread_local] is very fast to access, certainly far faster than dividing...)

thestinger · 2013-12-06T19:39:11Z

#[thread_local] is a pointer offset in a static binary but in a dynamically linked one it's going to take hundreds of instructions

mstewartgallus · 2013-12-06T19:42:19Z

I heavily use oneshots in my application and feel they offer a really nice API for asynchronous method calls. But I'm basically using oneshots as promises or futures so if a library along those lines is planned for the future I'd be fine with oneshots being removed.

Also I'm not sure why @alexcrichton would think oneshots can't be selected over. Wouldn't any of the suggestions given in issue #10624 work?

bill-myers · 2013-12-06T19:43:38Z

#[thread_local] is a pointer offset in a static binary but in a dynamically linked one it's going to take hundreds of instructions

Huh?

At worst it will get the TLS offset from a global variable in the dynamic library (or from the GOT to be more precise on x86-64), which might take something like 5 instructions.

Of course this assumes a good implementation like the one in Linux/glibc.

thestinger · 2013-12-06T19:46:57Z

It will end up making calls to the __tls_get_addr function at runtime on Linux. It definitely runs more than 5 instructions. If you're using more restricted (more local) thread-local variables in dynamic libraries, then it has to do less work.

bill-myers · 2013-12-06T20:04:59Z

Sorry, 5 instructions was optimistic, I guess it's more around 20 for dynamic libraries in the fast path.

__tls_get_addr is this, where of course the ifs are not executed for accesses after the first if no libraries have been loaded in between:

void *
__tls_get_addr (tls_index* ti)
{
    dtv_t *dtv = THREAD_DTV ();

    if (__builtin_expect (dtv[0].counter != _rtld_global.dl_tls_generation, 0))
        return update_get_addr (ti);

    void *p = dtv[ti->ti_module].pointer.val;

    if (__builtin_expect (p == TLS_DTV_UNALLOCATED, 0))
        return tls_get_addr_tail (ti, dtv, NULL);

    return (char *) p + ti->ti_offset;
}

alexcrichton · 2013-12-06T20:23:27Z

It seems that the handling of Packet::cnt is wrong

It's perfectly normal to see a large negative value in cnt. The only significant value is -1.

this assumes that the code never makes cnt negative for reason other than disconnection

That is not true. I mention explicitly in the documentation about how the channels work that the count can be very negative.

Alternatively, you can use a cmpxchg in Packet.increment

This is not does not have the progress semantics I would want. While most senders would reasonably only execute a cmpxchg a few times, it's much nicer to guarantee that only one atomic instruction is run.

However, what if a producer thread stops indefinitely while in the middle of reading the to_wake?

I do not see the problem. The producer owns to_wake and the receiver just goes to sleep.

Producers don't seem to block consumers writing to_wake, so it seems it could be overwritten while a producer reads it.

No, as you found out, this is only read when the value is -1. The value is only -1 when to_wake is set.

In other words, I think you need to make to_wake a 1-word structure and manage it with atomics, or put a mutex around it.

No, acquire/release semantics will guarantee that all writes on behalf of the producer are visible to the consumer.

Also, anything that to_wake points to must live infinitely if you don't use a mutex (since a producer might stall forever while waking up).

No, it just needs to be guaranteed to be alive for its use. The scheduler provides separate guarantees which enforce this. Additionally, see my XXX comment about how the SchedHandle is currently kept alive.

What if the user specifies the same port multiple times in the select() array?

Ah I forgot that I forgot to handle this. My original redesign had all the methods take &mut self, which as you pointed out, would certainly fix this. I plan on likely fail!() ing in this situation, but you're right that the code right now doesn't handle this.

BTW, select is inefficient

Sounds like you're describing the select syscall, which has nothing to do with comm::select. This implementation is O(n) in blocking and wakeup.

% RESCHED_FREQ should be & RESCHED_FREQ_MASK

I have seen this nowhere in my profiles. Compared to an atomic instruction, this is not a concern.

As the code stands, try() might well spend the majority of the execution time dividing unnecessarily on some architectures...

I do not believe that this is a matter of concern unless a concrete profile shows it to be.

this algorithm is probably completely broken anyway

This rescheduling is not used for correctness, it is used to prevent starvation. If a task only sends a few times on a few channels, then there's no manual rescheduling necessary. The starvation we're protecting against is when one task simply infinitely sends data without ever yielding to the scheduler (which this implementation is guaranteed to eventually check for a rescheduling).

Likewise, if one sends on several different channels, it may also never reschedule

That is not true. If the task exits after sending on channels, then there is most certainly a rescheduling. As mentioned above, this is just starvation prevention, not pre-emption.

#[thread_local] is very fast to access

Not in my profiling. This is exactly why I added the only "check to maybe reschedule every so often". The TLS accesses were slowing the benchmark down by about 50-80%. The executables I was working with were all statically linked, and from what @thestinger mentioned, something may be going wrong if they're so slow.

Also I'm not sure why @alexcrichton would think oneshots can't be selected over

There is no magical "let's select over everything all at once" function to call. That has to be implemented by someone. Having implemented this iteration of select, I can tell you personally that it would be difficult. I'm not saying it's impossible.

My biggest concern is the type signature of select. This rewrite changes it to fn select(ports: &[&Port]) -> int which is exactly what one would expect. There are a few possibilities for how oneshots would fit into this argument list:

fn ChanOne::new() -> (Port<T>, ChanOne<T>) - the "onceness" is expressed as an invariant on the sender, not the receiver
trait Selectable { ... } and fn select(ports: &[&Selectable]) - this is kind of what it is today, except the version where it takes anything. The problem with this is what is the Selectable trait? How can fn select actually do anything with it? Can user types implement this soundly?
fn ChanOne::new() -> (PortOne<T>, ChanOne<T>) plus fn PortOne::unwrap(self) -> Port<T>. This would allow the common way of expressing the invariant that a PortOne may be received on only once, while still allowing for selection over all comm ports. You could still call unwrap and recv twice, but at least you have to explicitly opt-in to it.

Of those methods, I prefer the third, but it still doesn't sit well with me. As a result, I have chosen to not reimplement oneshot ports. Remember though, the only semantic difference about a oneshot is that you can only use it once. A normal channel will suffice as a replacement in all circumstances (without having the same invariant of being used once). The only reason to prefer a oneshot (ignoring semantics) is if your application is dominated with the creation of channels (in which case allocating a stream is slower). I do not forsee this as being very common (I could very well be wrong).

__tls_get_addr is this

Let's please keep this pull request on topic. Discussions of how thread_local works/doesn't work as implemented today should be discussed elsewhere.

bill-myers · 2013-12-06T20:53:07Z

It's perfectly normal to see a large negative value in cnt. The only significant value is -1.

The code asserts in several places that n >= 0 in matches where the only other case is n == DISCONNECTED, as far as I can tell.

This is wrong since a disconnected socket can have DISCONNECTED + 1 in the race window caused by your use of xadd instead of cmpxchg and in fact DISCONNECTED + k where k < max_simultaneously_running_tasks.

I do not see the problem. The producer owns to_wake and the receiver just goes to sleep.
The receiver won't go to sleep if after filling to_wake it finds out that the queue is actually non-empty (otherwise you have a race condition where the receiver sleeps forever).

And then the producer owns to_wake, but the receiver can overwrite it as it writes again, as far as I can tell.

No, it just needs to be guaranteed to be alive for its use.

If the producer does not block receivers while reading to_wake (such as by having both take a mutex), then the producer thread might stop indefinitely between reading to_wake and waking up the task while receivers select on it multiple times, which means to_wake must be valid forever (the only thing that is guaranteed is that the channel will stay alive, so storing stuff in the channel is OK).

Sounds like you're describing the select syscall

If you are waiting on a million ports, you'll need to enqueue the task to wait on each of the million ports every time you receive a message.

So to receive a message each from a million ports, it takes a trillion iterations of the select inner loop, which is clearly not good.

This is exactly the same issue the OS select() and poll() syscalls have (since the poll() syscall does exactly the same thing as your code does in kernel), and it's why it has been replaced with epoll and kqueue in properly written software with unbounded amounts of fds to select over.

As the code stands, try() might well spend the majority of the execution time dividing unnecessarily on some architectures...
I do not believe that this is a matter of concern unless a concrete profile shows it to be.

That's probably because you profiled on an out-of-order architecture with hardware division (like an x86-64 CPU) where the branch predictor allowed to continue executing speculatively despite the division not having finished.

Try it on an in-order architecture without hardware division (low-end ARM CPUs) and it should be far more visible.

In general, one should absolutely never divide (or use % which is the same) unless it's unavoidable (btw, this is a 64-bit division on 64-bit machines, which is even worse).

In this case, there's absolutely no justification to use a division, since the resched frequency is arbitrary anyway and can thus be made a power of two.

alexcrichton · 2013-12-06T21:16:37Z

This is wrong since a disconnected socket can have DISCONNECTED + 1 in the race window caused by your use of xadd instead of cmpxchg and in fact DISCONNECTED + k where k < max_simultaneously_running_tasks.

Sorry this was a little unclear. I've gone through a bunch of iterations and the relevant comment appears to have been removed. Regardless, you are correct in this description. The bug would occur if there were enough increments to bring the value to -1 from the disconnected state before the original task put disconnected back into the slot. On a 64-bit architecture, that will never happen. On a 32-bit architecture, I'm willing to bet money that will never happen.

And then the producer owns to_wake, but the receiver can overwrite it as it writes again, as far as I can tell.

...

If the producer does not block receivers while reading to_wake (such as by having both take a mutex), then the producer thread might stop indefinitely between reading to_wake and waking up the task while receivers select on it multiple times, which means to_wake must be valid forever (the only thing that is guaranteed is that the channel will stay alive, so storing stuff in the channel is OK).

I'm not understanding where you think the problem is. Can you provide me a trace which exposes the bug?

This is exactly the same issue the OS select() and poll() syscalls have

Ok, but it's also not helpful to say "select is slow". The select implementation is not slow at all, rather the interface inherently prevents an "efficient implementation" as you're expecting. This just means that we would need to evaluate whether we would need another abstraction. This abstraction would probably be something along the lines of:

fn selector<'a>(ports: &'a [&Port]) -> Selector<'a> { ... }
impl Selector {
  fn select(&mut self) -> int;
}

And that would prevent having to re-sleep on ports all the time (possibly, I have not considered implementation details).

Regardless, the speed of select() versus another abstraction is not the main focus of this pull request, so I would like to continue this conversation elsewhere. If you have concrete suggestions for how to make the interface as-is today faster, then I'm willing to listen, but discussions of a different abstraction should occur elsewhere

huonw · 2013-12-06T22:02:27Z

( @bill-myers you may already know this, but one can put comments on individual lines of the code on the "Files Changed" tab of pull requests, which gives better locality for a review.)

* Streams are now ~3x faster than before (fewer allocations and more optimized) * Based on a single-producer single-consumer lock-free queue that doesn't always have to allocate on every send. * Blocking via mutexes/cond vars outside the runtime * Streams work in/out of the runtime seamlessly * Select now works in/out of the runtime seamlessly * Streams will now fail!() on send() if the other end has hung up * try_send() will not fail * PortOne/ChanOne removed * SharedPort removed * MegaPipe removed * Generic select removed (only one kind of port now) * API redesign * try_recv == never block * recv_opt == block, don't fail * iter() == Iterator<T> for Port<T> * removed peek * Type::new * Removed rt::comm

This pull request completely rewrites std::comm and all associated users. Some major bullet points * Everything now works natively * oneshots have been removed * shared ports have been removed * try_recv no longer blocks (recv_opt blocks) * constructors are now Chan::new and SharedChan::new * failure is propagated on send * stream channels are 3x faster I have acquired the following measurements on this patch. I compared against Go, but remember that Go's channels are fundamentally different than ours in that sends are by-default blocking. This means that it's not really a totally fair comparison, but it's good to see ballpark numbers for anyway ``` oneshot stream shared1 std 2.111 3.073 1.730 my 6.639 1.037 1.238 native 5.748 1.017 1.250 go8 1.774 3.575 2.948 go8-inf slow 0.837 1.376 go8-128 4.832 1.430 1.504 go1 1.528 1.439 1.251 go2 1.753 3.845 3.166 ``` I had three benchmarks: * oneshot - N times, create a "oneshot channel", send on it, then receive on it (no task spawning) * stream - N times, send from one task to another task, wait for both to complete * shared1 - create N threads, each of which sends M times, and a port receives N*M times. The rows are as follows: * `std` - the current libstd implementation (before this pull request) * `my` - this pull request's implementation (in M:N mode) * `native` - this pull request's implementation (in 1:1 mode) * `goN` - go's implementation with GOMAXPROCS=N. The only relevant value is 8 (I had 8 cores on this machine) * `goN-X` - go's implementation where the channels in question were created with buffers of size `X` to behave more similarly to rust's channels.

zargony · 2013-12-17T15:53:16Z

This is awesome (and the docs are great)! I assume select! will become available as a std_macro?

alexcrichton · 2013-12-17T17:18:09Z

Eventually I want select! to be a general for-use macro, but for now I'm leaving it inside of std::comm::select because it's a little experimental. If you copy it though (for now), it should definitely be usable!

TeXitoi reviewed Dec 6, 2013
View reviewed changes

brendanzab reviewed Dec 6, 2013
View reviewed changes

alexcrichton added 2 commits December 16, 2013 17:47

Fallout of rewriting std::comm

529e268

Test fallout from std::comm rewrite

39a6c9d

bors closed this Dec 17, 2013

huonw mentioned this pull request Dec 17, 2013

Rustdoc is generating everything into a single directory #11021

Closed

alexcrichton deleted the spsc-queue branch December 17, 2013 17:05

This was referenced Dec 17, 2013

Finish the Chan/Port types #10459

Closed

Peeking at a port that is used after a sleep rarely terminates #9396

Closed

Select trait exposes private types #8422

Closed

mstewartgallus mentioned this pull request Dec 20, 2013

There needs to be a way to combine the effects of try_recv and recv_opt #11087

Closed

alexcrichton mentioned this pull request Jan 4, 2014

Port and Chan methods should take &mut self #5372

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite std::comm #10830

Rewrite std::comm #10830

alexcrichton commented Dec 6, 2013

brson commented Dec 6, 2013

brson commented Dec 6, 2013

bblum commented Dec 6, 2013

alexcrichton commented Dec 6, 2013

TeXitoi Dec 6, 2013

alexcrichton Dec 6, 2013

brendanzab commented Dec 6, 2013

brendanzab Dec 6, 2013

alexcrichton commented Dec 6, 2013

bill-myers commented Dec 6, 2013

bill-myers commented Dec 6, 2013

bill-myers commented Dec 6, 2013

bill-myers commented Dec 6, 2013

alexcrichton commented Dec 6, 2013

bill-myers commented Dec 6, 2013

bill-myers commented Dec 6, 2013

bill-myers commented Dec 6, 2013

bill-myers commented Dec 6, 2013

bill-myers commented Dec 6, 2013

thestinger commented Dec 6, 2013

mstewartgallus commented Dec 6, 2013

bill-myers commented Dec 6, 2013

thestinger commented Dec 6, 2013

bill-myers commented Dec 6, 2013

alexcrichton commented Dec 6, 2013

bill-myers commented Dec 6, 2013

alexcrichton commented Dec 6, 2013

huonw commented Dec 6, 2013

zargony commented Dec 17, 2013

alexcrichton commented Dec 17, 2013

Rewrite std::comm #10830

Rewrite std::comm #10830

Conversation

alexcrichton commented Dec 6, 2013

brson commented Dec 6, 2013

brson commented Dec 6, 2013

bblum commented Dec 6, 2013

alexcrichton commented Dec 6, 2013

TeXitoi Dec 6, 2013

Choose a reason for hiding this comment

alexcrichton Dec 6, 2013

Choose a reason for hiding this comment

brendanzab commented Dec 6, 2013

brendanzab Dec 6, 2013

Choose a reason for hiding this comment

alexcrichton commented Dec 6, 2013

bill-myers commented Dec 6, 2013

bill-myers commented Dec 6, 2013

bill-myers commented Dec 6, 2013

bill-myers commented Dec 6, 2013

alexcrichton commented Dec 6, 2013

bill-myers commented Dec 6, 2013

bill-myers commented Dec 6, 2013

bill-myers commented Dec 6, 2013

bill-myers commented Dec 6, 2013

bill-myers commented Dec 6, 2013

thestinger commented Dec 6, 2013

mstewartgallus commented Dec 6, 2013

bill-myers commented Dec 6, 2013

thestinger commented Dec 6, 2013

bill-myers commented Dec 6, 2013

alexcrichton commented Dec 6, 2013

bill-myers commented Dec 6, 2013

alexcrichton commented Dec 6, 2013

huonw commented Dec 6, 2013

zargony commented Dec 17, 2013

alexcrichton commented Dec 17, 2013