-
Notifications
You must be signed in to change notification settings - Fork 13.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Rewrite std::comm #10830
Rewrite std::comm #10830
Conversation
This is very exciting. |
I want to read this all carefully, but I've only gotten through the doc comments tonight. |
do you have a sense for why oneshot performance is 3x slower despite stream being faster? |
I do indeed. Right now oneshots are actually super-optimized for their use case. An allocation of a oneshot channel is one tiny allocation of a box to share between the two ends. An allocation of a stream in this pull request, however, is three allocations (hence the ~3x slowdown). In profiling, the creation of the channel completely dominated the oneshot benchmark. |
// | ||
// Rust channels come in two flavors: streams and shared channels. A stream has | ||
// one sender and one receiver while a shared channel could have multiple | ||
// receivers. This choice heavily influences the design of the protocol set |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a shared channel could have multiple senders, no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Whoops, good catch!
@alexcrichton Are oneshots going to be added again in the future if folks want better performance for that use case? |
//! | ||
//! # Example | ||
//! | ||
//! // Create a simple streaming channel |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wouldn't it be better to enclose the code sample in ~~~rust
-~~~
fences for syntax highlighting?
I would rather try out not adding oneshots back to begin with. The use case in which they are more efficient is when the creation of the channel is far more common than the usage of the channel. We currently don't have much code that has its bottleneck in that area. Another snag would be implementing |
@alexcrichton There's an issue that is still unsolved: SharedChan can be used to create cycles It has always existed, but it might be the right time to consider it since we are rewriting it and changing the API anyway. I filed it as issue #10835 so the discussion can be done there instead of on this pull request since it's somewhat orthogonal to implementation issues. |
@alexcrichton Is access to the to_wake field properly synchronized? It doesn't seem to be protected by any locks, and doesn't seem to be accessed with atomics, but it's very possible I misunderstand something since I only glanced at the code. |
@alexcrichton The SPSC web pages adds padding to put the consumer and producer side of the queue on different cachelines, but this pull request forgets to do so |
@alexcrichton The SPSC code uses Relaxed which seems wrong: it should use Release on most stores and Consume on most loads like the code in the linked web page does. In general, it seems that all the synchronization-related parts need careful expert review, although the high-level design might be correct. |
To the best of my knowledge, yes.
Yes, I wasn't able to measure any difference and I wanted to make the struct smaller (while I was optimizing the oneshot case)
Whoops, I reverted everything away from |
@alexcrichton It seems that the handling of Packet::cnt is wrong. Specifically Packet::increment can do a fetch_add on a channel with cnt == DISCONNECT, and the code then resets it to DISCONNECT, but this means that in the race window any negative value can represent a disconnected channel. Hence Packet.decrement should consider any negative cnt as disconnected instead of asserting that if cnt != DISCONNECT then cnt >= 0, which is false; likewise all other checks for DISCONNECTED need to be fixed. BTW, this assumes that the code never makes cnt negative for reason other than disconnection, which seems true but isn't totally clear. Alternatively, you can use a cmpxchg in Packet.increment so that if cnt == DISCONNECTED, then cnt is never changed, and you can then use negative cnt for other things. |
EDIT: ok, it seems that the cnt field is used to represent whether to_wake is valid. However, what if a producer thread stops indefinitely while in the middle of reading the to_wake? Producers don't seem to block consumers writing to_wake, so it seems it could be overwritten while a producer reads it. In other words, I think you need to make to_wake a 1-word structure and manage it with atomics, or put a mutex around it. Also, anything that to_wake points to must live infinitely if you don't use a mutex (since a producer might stall forever while waking up). This seems to be the case already on 1:1, since the wakeup mutex is either in an Arc, or in the channel, but I'm not sure whether the M:N case is correct. |
BTW, if you use Acquire and Consume there should be no code generation difference on x86 compared to Relaxed, because the x86 hardware provides those guarantees for all code (all stores are ordered, and dependent loads are too) |
@alexcrichton What if the user specifies the same port multiple times in the select() array? It seems this will break all sorts of invariants in the current code, for instance by decrementing cnt multiple times. There should be a check for it, or maybe select should take &muts. BTW, select is inefficient because it is O(n^2) since reading from each of n pipes requires selecting on all n: it should be replaced by some sort of Select struct where ports can be added and removed and the actual select() call has no arguments |
% RESCHED_FREQ should be & RESCHED_FREQ_MASK since division is extremely slow, and power-of-two resched frequencies should be enough, assuming that this thing is needed at all. As the code stands, try() might well spend the majority of the execution time dividing unnecessarily on some architectures... HOWEVER, this algorithm is probably completely broken anyway because with SharedChan a single unlucky producer might never resched because other producers might hit all the cnt values multiple of RESCHED_FREQ... Likewise, if one sends on several different channels, it may also never reschede. It probably should just call maybe_yield which should do the & RESCHED_FREQ internally on a #[thread_local] variable that it also increments (#[thread_local] is very fast to access, certainly far faster than dividing...) |
|
I heavily use oneshots in my application and feel they offer a really nice API for asynchronous method calls. But I'm basically using oneshots as promises or futures so if a library along those lines is planned for the future I'd be fine with oneshots being removed. Also I'm not sure why @alexcrichton would think oneshots can't be selected over. Wouldn't any of the suggestions given in issue #10624 work? |
Huh? At worst it will get the TLS offset from a global variable in the dynamic library (or from the GOT to be more precise on x86-64), which might take something like 5 instructions. Of course this assumes a good implementation like the one in Linux/glibc. |
It will end up making calls to the |
Sorry, 5 instructions was optimistic, I guess it's more around 20 for dynamic libraries in the fast path. __tls_get_addr is this, where of course the ifs are not executed for accesses after the first if no libraries have been loaded in between:
|
It's perfectly normal to see a large negative value in cnt. The only significant value is -1.
That is not true. I mention explicitly in the documentation about how the channels work that the count can be very negative.
This is not does not have the progress semantics I would want. While most senders would reasonably only execute a cmpxchg a few times, it's much nicer to guarantee that only one atomic instruction is run.
I do not see the problem. The producer owns
No, as you found out, this is only read when the value is -1. The value is only -1 when
No, acquire/release semantics will guarantee that all writes on behalf of the producer are visible to the consumer.
No, it just needs to be guaranteed to be alive for its use. The scheduler provides separate guarantees which enforce this. Additionally, see my XXX comment about how the
Ah I forgot that I forgot to handle this. My original redesign had all the methods take
Sounds like you're describing the select syscall, which has nothing to do with
I have seen this nowhere in my profiles. Compared to an atomic instruction, this is not a concern.
I do not believe that this is a matter of concern unless a concrete profile shows it to be.
This rescheduling is not used for correctness, it is used to prevent starvation. If a task only sends a few times on a few channels, then there's no manual rescheduling necessary. The starvation we're protecting against is when one task simply infinitely sends data without ever yielding to the scheduler (which this implementation is guaranteed to eventually check for a rescheduling).
That is not true. If the task exits after sending on channels, then there is most certainly a rescheduling. As mentioned above, this is just starvation prevention, not pre-emption.
Not in my profiling. This is exactly why I added the only "check to maybe reschedule every so often". The TLS accesses were slowing the benchmark down by about 50-80%. The executables I was working with were all statically linked, and from what @thestinger mentioned, something may be going wrong if they're so slow.
There is no magical "let's select over everything all at once" function to call. That has to be implemented by someone. Having implemented this iteration of select, I can tell you personally that it would be difficult. I'm not saying it's impossible. My biggest concern is the type signature of select. This rewrite changes it to
Of those methods, I prefer the third, but it still doesn't sit well with me. As a result, I have chosen to not reimplement oneshot ports. Remember though, the only semantic difference about a oneshot is that you can only use it once. A normal channel will suffice as a replacement in all circumstances (without having the same invariant of being used once). The only reason to prefer a oneshot (ignoring semantics) is if your application is dominated with the creation of channels (in which case allocating a stream is slower). I do not forsee this as being very common (I could very well be wrong).
Let's please keep this pull request on topic. Discussions of how |
The code asserts in several places that n >= 0 in matches where the only other case is n == DISCONNECTED, as far as I can tell. This is wrong since a disconnected socket can have DISCONNECTED + 1 in the race window caused by your use of xadd instead of cmpxchg and in fact DISCONNECTED + k where k < max_simultaneously_running_tasks.
And then the producer owns to_wake, but the receiver can overwrite it as it writes again, as far as I can tell.
If the producer does not block receivers while reading to_wake (such as by having both take a mutex), then the producer thread might stop indefinitely between reading to_wake and waking up the task while receivers select on it multiple times, which means to_wake must be valid forever (the only thing that is guaranteed is that the channel will stay alive, so storing stuff in the channel is OK).
If you are waiting on a million ports, you'll need to enqueue the task to wait on each of the million ports every time you receive a message. So to receive a message each from a million ports, it takes a trillion iterations of the select inner loop, which is clearly not good. This is exactly the same issue the OS select() and poll() syscalls have (since the poll() syscall does exactly the same thing as your code does in kernel), and it's why it has been replaced with epoll and kqueue in properly written software with unbounded amounts of fds to select over.
That's probably because you profiled on an out-of-order architecture with hardware division (like an x86-64 CPU) where the branch predictor allowed to continue executing speculatively despite the division not having finished. Try it on an in-order architecture without hardware division (low-end ARM CPUs) and it should be far more visible. In general, one should absolutely never divide (or use % which is the same) unless it's unavoidable (btw, this is a 64-bit division on 64-bit machines, which is even worse). In this case, there's absolutely no justification to use a division, since the resched frequency is arbitrary anyway and can thus be made a power of two. |
Sorry this was a little unclear. I've gone through a bunch of iterations and the relevant comment appears to have been removed. Regardless, you are correct in this description. The bug would occur if there were enough increments to bring the value to -1 from the disconnected state before the original task put disconnected back into the slot. On a 64-bit architecture, that will never happen. On a 32-bit architecture, I'm willing to bet money that will never happen.
I'm not understanding where you think the problem is. Can you provide me a trace which exposes the bug?
Ok, but it's also not helpful to say "select is slow". The select implementation is not slow at all, rather the interface inherently prevents an "efficient implementation" as you're expecting. This just means that we would need to evaluate whether we would need another abstraction. This abstraction would probably be something along the lines of:
And that would prevent having to re-sleep on ports all the time (possibly, I have not considered implementation details). Regardless, the speed of |
( @bill-myers you may already know this, but one can put comments on individual lines of the code on the "Files Changed" tab of pull requests, which gives better locality for a review.) |
* Streams are now ~3x faster than before (fewer allocations and more optimized) * Based on a single-producer single-consumer lock-free queue that doesn't always have to allocate on every send. * Blocking via mutexes/cond vars outside the runtime * Streams work in/out of the runtime seamlessly * Select now works in/out of the runtime seamlessly * Streams will now fail!() on send() if the other end has hung up * try_send() will not fail * PortOne/ChanOne removed * SharedPort removed * MegaPipe removed * Generic select removed (only one kind of port now) * API redesign * try_recv == never block * recv_opt == block, don't fail * iter() == Iterator<T> for Port<T> * removed peek * Type::new * Removed rt::comm
This pull request completely rewrites std::comm and all associated users. Some major bullet points * Everything now works natively * oneshots have been removed * shared ports have been removed * try_recv no longer blocks (recv_opt blocks) * constructors are now Chan::new and SharedChan::new * failure is propagated on send * stream channels are 3x faster I have acquired the following measurements on this patch. I compared against Go, but remember that Go's channels are fundamentally different than ours in that sends are by-default blocking. This means that it's not really a totally fair comparison, but it's good to see ballpark numbers for anyway ``` oneshot stream shared1 std 2.111 3.073 1.730 my 6.639 1.037 1.238 native 5.748 1.017 1.250 go8 1.774 3.575 2.948 go8-inf slow 0.837 1.376 go8-128 4.832 1.430 1.504 go1 1.528 1.439 1.251 go2 1.753 3.845 3.166 ``` I had three benchmarks: * oneshot - N times, create a "oneshot channel", send on it, then receive on it (no task spawning) * stream - N times, send from one task to another task, wait for both to complete * shared1 - create N threads, each of which sends M times, and a port receives N*M times. The rows are as follows: * `std` - the current libstd implementation (before this pull request) * `my` - this pull request's implementation (in M:N mode) * `native` - this pull request's implementation (in 1:1 mode) * `goN` - go's implementation with GOMAXPROCS=N. The only relevant value is 8 (I had 8 cores on this machine) * `goN-X` - go's implementation where the channels in question were created with buffers of size `X` to behave more similarly to rust's channels.
This is awesome (and the docs are great)! I assume |
Eventually I want |
This pull request completely rewrites std::comm and all associated users. Some major bullet points
I have acquired the following measurements on this patch. I compared against Go, but remember that Go's channels are fundamentally different than ours in that sends are by-default blocking. This means that it's not really a totally fair comparison, but it's good to see ballpark numbers for anyway
I had three benchmarks:
The rows are as follows:
std
- the current libstd implementation (before this pull request)my
- this pull request's implementation (in M:N mode)native
- this pull request's implementation (in 1:1 mode)goN
- go's implementation with GOMAXPROCS=N. The only relevant value is 8 (I had 8 cores on this machine)goN-X
- go's implementation where the channels in question were created with buffers of sizeX
to behave more similarly to rust's channels.