-
-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Proposal: Event loop redesign #8224
Comments
Assuming we continue with the idea of having an "event loop" abstraction, I would like to see from it some form of consistency across different OSs, at least when it comes to the big 3. More specifically, I believe one important goal should be to ensure, as much as possible, that if an application doesn't deadlock on OS Y it also doesn't deadlock on OS X (hah pun kinda intended). I am not sure what is the current state of things, but one example that comes to mind is file I/O which only simulates evented I/O by using a thread pool. Somewhat related to the above goal, there should be an API (or a clear pattern) for dealing with blocking I/O (and CPU intensive operations). As a concrete example, in bork I used to steal one thread from the event loop when using BearSSL and, depending on the CPU configuration of the host (and other implementation details of the event loop that fortunately in this case didn't pose a problem) it could simply deadlock. If instead the event loop gets broken down into discrete components, then this question IMO still stands and would need to be re adapted for the new context. |
I want one that prints "bruh" asynchronously |
I'll put my high-level thoughts down in points here since my thoughts are a little all over the place when it comes to Zig's async story. The perspective I'm coming from with my thoughts here is after spending the last few months implementing a few resource-constrained high-performance distributed systems and async-driven sensor applications in Zig, and spending the last month researching how async works in other programming languages.
I personally have a lot of frustrations over some of the design decisions in event loop amalgamations featured in the libraries and runtimes for Go/Rust/Nim/Pony/etc. At the very least, aiming towards modularizing the event loop is work that can be done with Zig today that would undoubtedly be reusable should the time come that we eventually come to a consensus on an all-in-one event loop design.
Given that async file I/O really is something that has only been recently been made available in Linux via. io_uring, the There should also be cross-platform support for async Ctrl+C handling in
This comes from my experience that abstracting a proactor-based I/O poller (e.g. IOCP) to follow a reactor-based design requires quite a bit of hackery and workarounds vs. abstracting a reactor-based I/O poller to follow a proactor-based design. Having such a design where either a callback may be provided, or an async frame may be created to wrap under a callback, allows for users to have flexibility in how they want to allocate the memory necessary to asynchronously drive an I/O primitive (i.e. on the heap, or on the stack).
With a production-ready I/O poller and thread pool abstraction, either way, it is most likely the case that you wouldn't settle with general-purpose implementations of async I/O or synchronization primitives that are easy to use if you are in a situation where the smallest amount of overhead present in supporting cancellation and compositional operations matter to you.
Regarding a few of the points above, I didn't dig that deep into the specifics as I'd love to see other people's thoughts and feedback first given my one-sided point of view (i.e. how module-level configuration should look like, how async operations are to be facilitated by the standard library to be composed together, how I/O poller-related state should be separated and contained, etc.). That being said, let me know if you spotted any glaring holes in my thoughts, or if you'd like me to explain more of my rationale on any one of the points above. |
Any further opinions on this? |
Thanks @kprotty for setting this up.
No, and yes, depending on the plane: control plane or data plane. Having a clear distinction between these helps to inform the design decisions for each separately because the safety and performance profiles are different. For example, given Zig's approach to safety and performance, I would really want a single-threaded control plane, with this single-threaded control plane outsourcing to the kernel thread pool for I/O through io_uring (or kqueue or IOCP for other platforms) to reduce the performance cost of syscalls, context switches and locks, and to side-step or at least diminish the gap for data races for the most part. Backwards compatibility for older Linux platforms can be shimmed with an I/O threadpool, but io_uring should drive the design because it reflects where hardware performance is at today.
QoS is important. In addition to the single-threaded control plane (with the kernel threadpool for I/O) I really want additional multi-threaded data plane(s) for CPU-intensive work [xor] long-running work, with clear QoS, i.e. one data plane for user-defined CPU-intensive tasks (e.g. bulk crypto like checksumming 256 MiB chunks, or erasure coding, or deduplicating them), and another separate data plane for long-running background tasks (stuff like DNS requests that are not CPU-intensive, that could take minutes and that could be context-switched without a performance hit). So the data plane for CPU-intensive tasks should be sized according to the number of cores to reduce context switching. On the other hand, the data plane for long-running background tasks would be okay with context switching (being on a different order compared to the task time), and would be sized larger than the number of cores to support concurrency.
This could solve alot of pain for people. Cancellation can also be racy and difficult to get right. Then again, it's hard at every layer and solving this lower down can help higher up. My gut feel is it's hard to do well. On the other hand, there could be cancellation for everything, but with a completion event for when the cancellation succeeds, so that this could shim over underlying abstractions that don't support cancellation, while not holding back the abstractions that do.
No. Although there might be a good argument to made for something like registered buffers, with a user-provided allocator of course. The focus overall though should be on zero-copy into user provided buffers. For TigerBeetle, everything is static allocation (even disk) and zero-copy. We would never use something that allocated dynamically, unless this was at initialization, where we have some kind of control over the parameters, and again for something like registered buffers. Allocation for stuff like internal queues etc. is also okay but it needs to be subject to the user's restrictions and at initialization only.
Zig's async is awesome of course. However, here's something surprising from our experience. We tried out Zig's async plus io_uring for TigerBeetle (https://github.com/coilhq/tigerbeetle/blob/main/src/io_async.zig) and then actually went back to explicit callbacks plus io_uring in the end (https://github.com/coilhq/tigerbeetle/blob/main/src/io.zig). The reason being that we were doing this for a distributed consensus protocol where we wanted to make sure our message handlers run to completion with the same system state, whereas coroutines increase dimensionality while a function is still running. We wanted clear jumping off points to I/O just because getting the consensus protocol right was hard enough. This is specific to our use-case for TigerBeetle only, it might not be relevant here, but wanted to share the anecdote if it helps.
The money question. We could start with the primitives. For example, @ifreund and I worked hard to get an event loop primitive going for io_uring that could mimic epoll's "poll for some amount of time": https://github.com/coilhq/tigerbeetle/blob/main/src/io.zig#L52-L83
io_uring from the 5.6 kernel on up has almost everything we need for the data plane intensive aspects of storage and network I/O, and even the 5.8 kernel is gaining support. io_uring is a beast and it's only accelerating from here. It's something that's hard to ignore. While we could plan according to current support for the 5.8 Linux kernel right now, perhaps a better approach would be to plot a course taking Zig's 1.0.0 trajectory into account. With that in mind, I really want to see the IO component of the event loop taking full advantage of io_uring as a first-class citizen, because that's where the puck is already at, just not evenly distributed. NVME devices have become so fast that the cost of a context switch is on the same order as an I/O operation, so io_uring makes sense to support the next generation of lockless thread-per-core designs, which is what we're doing for TigerBeetle to support a million financial transactions a second. It's also a perfect time for Zig to take advantage of this shift in API design. I would really want to start with io_uring and work backwards rather than the other way round. A worker threadpool or traditional approaches to event loops should be last-resort, not something driving our design. The bottom line: Zig should be "High-Definition" or "Dolby Atmos" where the machinery supports it. |
Thanks @jorangreef for the detailed response! This is an area where I strongly believe zig should make some good decisions after having interacted with other async/IO environements so any input/ideas are welcome.
Could you elaborate on the differences between the two? In regards to performance, particularly throughput, it sounds difficult to reach its optimum if there's only one thread which handles control flow (if this is what you're implying). Having multiple threads performing application logic and without locking sounds ideal for throughput in the absence of effective static work distribution. I do agree that io_uring should somewhat drive the design for the I/O subsection of the Event Loop. I say "somewhat" as SQEs are allocated internally on flush (at least currently) and it may not be what is desired for io_uring compatibility layers.
This has been my experience as well
This was my primary intent when posing the question. While I have working thread pools which only utilize intrusive memory, I have been interested if using dynamically growing but contiguous task queues would help in latency. This sort of memory allocation sounds contrary to your idea though.
By "clear", would this be a reference to a function call doing I/O without the callers awareness? or some other scenario?
This is pretty neat! Like the idea of intrusive SQE/CQE structs and the flushing/polling strategy. This would solve the "io_uring SQE allocation" issue at the abstraction level.
This was my initial experience too. Although having micro-benchmarked it against epoll, the decrease in syscalls and use of kernel thread-pool that sounds more efficient on the surface ended up being slower. Do you (or anyone else really) have any ideas why this might be the case?
Definitely. Although it's a bit rough trying to envision how this would manifest itself higher-level API-wise. Have personally been looking into libdispatch for scheduling and queuing insight/inspirations recently. |
Sure! In the abstract, the control plane is just anything that's not in the critical data path, it's safety critical but not performance critical, whereas the data plane has the inverse profile. For example, it would be fine to have (and one would want) plenty of assertions in the control plane for safety, whereas the data plane is in the critical path with huge volumes flowing through it. Here in the data plane one would want to optimize for performance: cache misses, context switches, branch mispredicts etc. The data plane would be like a water mains or oil pipeline, there's nothing inside to obstruct flow. The control plane would be all the safety checks you do outside the pipeline as an operator, the little control box that adjusts pressure and controls the pipeline. Having this clean split in the design provides both safety and performance without compromising either. It also obviates in large part the need for a borrow checker, although data races are still possible. As a concrete example of this technique, in TigerBeetle, we have around 10,000 financial transactions in a batch. Our single-threaded control plane is responsible for switching each of these batches through the consensus protocol, and we amortize all runtime bounds checks, assertions, syscalls and I/O across the batch, so that these become almost free, yet we have literally hundreds of assertions, and we're doing
I think there's also a third option. For example, in TigerBeetle, if we want to do crypto for a batch (we have a use-case for this for Interledger), we would then drop that into a multi-threaded data plane with QoS for CPU-intensive work, by dividing the batch up across threads. We wouldn't want to solve that kind of performance problem with a multi-threaded control plane, because that would introduce too much complexity into a safety-critical domain, and also be conflating the control plane and data plane. In terms of this approach, the performance-critical work should happen in the multi-threaded data plane because that's been optimized for it. If we do find any scalability limit, then we would simply move to multiple event loop abstractions per domain, in our case across networking, storage, and state machine, all connected by ring buffers, and all the control planes still single-threaded. This is also not new to TigerBeetle, we based this design on the work done by LMAX, a high-performance trading platform, which has single-threaded control planes for all event loops, and which has multiple event loops connected by ring buffers. They found the single-threaded control plane is also faster because it lets the CPU run unimpeded by context switches, like a sprinter doing the 100m in a straight line without zigzags. Martin Thompson calls this "mechanical sympathy" or "the single writer principle" (great blog post). He's done some awesome talks on this that we've documented here: https://github.com/coilhq/tigerbeetle/blob/main/docs/DESIGN.md#references Redpanda is another new database, from people who worked on ScyllaDB, also doing thread-per-core.
Just speaking from our point of view, it would be totally fine to allocate this at initialization, which coincidentally is also how io_uring does it for the most part. Contiguous task queues would definitely be great for latency to reduce cache misses and avoid pointer-chasing. We would really like fixed-size bounded queues with an error for overflow and back pressure (in our IO code we still need to set our upper bounds for SQ overflow, we're yet to profile some of our upper bounds). Explicit limiting is a good thing, especially for any kind of distributed system. This is probably also an area in event loops where developers have been used to "hidden allocation" and unbounded IO depth but I think we could be more explicit about this.
Yes, this would be where the function call ends up doing I/O that suspends so that when the caller is resumed, the rest of the system has moved on to a new state. For example, in the Viewstamped Replication consensus protocol that we use, it's critical that distributed messages are only processed when a replica process' status is
Thanks! We learned alot from the event loop gists you did!
Yes, there was a known networking regression in 5.7.16 that has since been patched and @axboe has also been optimizing against epoll significantly since then so that io_uring should be ahead on most benchmarks. In terms of storage, io_uring performance on NVME can be more than double compared to blocking syscalls and more than quadruple compared to an I/O threadpool like libuv. Overall, what excites me is that io_uring offers a single unified interface for networking and storage and potentially more and more syscalls down the line. There are some rough edges, like what we had to do to implement
Thanks for the link! I'm definitely looking up to you when it comes to anything like this, and look forward to seeing your work on this. |
Just a experience from studying Zig language: when I saw that all memory allocating function actually take an allocator as a parameter, I actually expected that IO functions would take an event loop as a parameter. Or at least that one could switch seemlessly between different implementations or instances of the main loop. Same as with allocators, user might have different needs for it so I'd expect standard library would play in concert with many possible implemetations. Also given how minimal Zig tries to be, I was really surprised by the fact that making a program async suddenly starts number of threads by itself, which clashes with the "no hidden control flow" slogan on the homepage. |
@vlada-dudr The goal of the discussion is to shed more light on the "hows" and "whys" of asynchronous computing. This also leads to a position that an "event loop" may not even be the right abstraction. Not all IO can be done asynchronously (or at least efficiently with async) and IO done with and without an "event loop" could have very different semantics. For example
Zig is minimal, but the stdlib is there to do certain things for you. This includes setting up custom handles, looking up vdso functions, discovering tls, etc. Spawning threads could be considered another thing it does to help do other stuff (simplify IO async frame scheduling) for you.
I don't think so because the creation of threads doesn't disrupt the control flow of your program. If you use an stdlib function which schedules to said threadpool, it could switch threads between |
Thanks for reaction. Maybe I should have posted it on irc, not here, so this can be free of random gibberish and keep it highly technical. Cheers! |
@vlada-dudr For the "hows" and "whys" part, I meant it as context to transition into why I don't think passing an "event loop" around could be equated to the Allocator approach, rather than a critique of missing impl. I believe your post did touch "how" (explicit loop ref) and "why" (similar to allocator).
I think this is where i'm at a fork. On one hand, I would like to have a super-customizable and near-optimal async building-blocks library since I enjoy polishing concepts. But on the other hand, there seems to be enough prioritizing simplicity/easy-understanding (even with discussions in other zig communities (come join #async!) which comes at a price to the former. Practically, both have optimization opportunities in mind but having issue deciding if Zig (the stdlib rather) should target "efficient hardware utilization" as a high priority or "more efficient hw util than other environments" (or something else?). Have any thoughts on this meta issue?
Im a bit confused in how slippery this slope should be for the stdlib. In regards to the other useful features I noted that happen in
Would be helpful to have an example to go over. I've recently played around with a bring your own blocking impl for synchronization primitives which allows the same algs which implement things like Channels and Mutexes to work generically for blocking and asynchronous environments. Would it be something similar to this?
I'm of the opinion that gibberish is fine so long as it's on topic to the discussion. An ok-ish but not so great example lol. I could try hopping on IRC later today to continue the discussion if needed. |
My input requires a bit of context, so I'll start with that. Firstly, I'd like to somewhat echo an earlier comment that said:
Later on, the thread has moved more in the direction of discussing more state of the art approaches, e.g.:
With those in mind, this is what I'd like to say: I think that there's a difference between a good/optimal/pick-your-word event loop[1] design, and one that is appropriate for a standard library. My experience so far has been that Zig wokrs for a large number of targets, meaning both architectures and OSes (or lack thereof, but that's probably out of scope for stdlib discussions). Given that, opting for a linux-centric event loop design could end up at least somewhat dangerous/counterproductive:
I should say that I agree with the earlier quote [1] or any other component for that matter |
FWIW
I agree, but am of a different notion on which of the two approaches would be labeled "advanced". I would like to argue is that while an efficient design of the "traditional" approach of a multi-threaded work-stealing event loop could be more familiar to setup/interact with for someone coming from say Go/Rust/Erlang, it would be much harder for an stdlib maintainer to change, understand, and modify. Designing an efficient scheduler for such an event loop is non-trivial, and when you look at advanced work-stealing schedulers like those of Go, tokio, and rayon only a hand-full of contributors make feature/bug changes due to their large scope, multi-purpose algorithms, and sometimes complexity. This has also been my experience as well, although I'm open to be convinced that the ramp-up time may not be as big of an issue in practice.
Once again, I think this rings true especially for efficient multi-threaded applications. Vetting them can be difficult and understanding atomics / memory ordering is known to be one of the "background specialization" areas IME.
For some insight, we (@lithdew and I) recently created a prototype of a
I agree as well. The thing is that the higher level API of
I know i've been shilling the io_uring abstraction recently, but I have reasons in doing so:
Even though i've changed my stance on the "traditional" approach recently, I'm still very much open to going back if the supporting arguments are technically supportive or if such a design is just not worth it for the standard library. I just know that either abstraction has a ton of room for optimization opportunities. |
Firstly, I really appreciate the extensive and thoughtful response! And to be clear, I am not opposed to a
Does the Zig stdlib event loop need to be particularly high performance? I think there's value in an implementation that tries moderately hard to be correct (thus understandable) and consistent across platforms (to a feasible extent), but putting performance as a second/third priority. One of things I find appealing about Zig is that a lot of things that'd often be implemented as language features in other langs are just code which can be plugged and reused as any other. So, in principle, it shouldn't be difficult to plug in a high performance event loop instead of the std one[1]. And if that is difficult, perhaps looking how to make that easier is a worthwhile avenue to explore? [1] Having infra like this outside of stdlib of course comes with its own sets of issues and advantages, and I think that in medium-long term those might be just as (if not more) important than the technical merits! C++'s |
Darwin (macos, ios, etc.) uses kqueue, an event notification scheme similar to linux epoll. The primary differences are that it supports batched "epoll_ctl"s via kevent(), it doesn't coalesce read & write events on the same fd, and it supports a higher variety of notifications without requiring an fd (timerfd, eventfd, inotify, etc.). It's integration to an io_uring emulation layer would be similar to epoll's in a lot of aspects.
It's one of those things that are implicitly in the Zig zen, particularly With this, however, is also implied somewhat performant code. This focus on memory/operational efficiency in many aspects of the stdlib (unicode, MultiArrayList, HashMap.Entry, Allocators) helps design APIs that are performant but also not too difficult to use. I think that is the direction the event loop should head towards as well, regardless of the abstraction-class chosen. This means avoiding interface restrictions like having to dynamically allocating memory to schedule a callback/
There's a difference between "correct" and "understandable": Goroutines and channels are "correct" and arguably "understandable" while Zig async is still "correct" but, from my experience, doesn't seem to be immediately "understandable". The good thing is that when Zig async is understood, it tends to be one of those "it's obvious in hindsight" things like As for platform consistency, this looks like one of the criteria that I considered essential by default for a standard library abstraction so I didn't give it much attention in my response: I also think it should be of higher priority than "performance".
This is the hard question that needs to be answered: what should such an API look like which allows a high performance implementation to be plugged into? I'm of the opinion that the API choice (in this case "traditional multi-threaded" vs "io_uring single-threaded") is what affects the performance ceiling of the implementation[1] and that a poor API effects the efficiency of the implementation [2].
Boost ASIO appears to be effectively the "traditional multi-threaded event loop" approach coupled with an MPSC channel called "strand" and no cross-platform support for file I/O (unless i'm missing something here). Would categorize it similarly to other "traditional" approaches (i.e. go, rust tokio) [1] POSIX synchronous file I/O interface limits ssd/m.2 drive performance |
I see, thank you again for the detailed responses - I've learned tons and my mind is at ease, you have certainly thought about this a lot! |
I support message passing as the only communication method between tasks, like Erlang/OTP. Functions should take a event loop argument if they want to use event loop. |
@locriacyber Message passing is an easy to grok, unified communication scheme. But it's not mechanically efficient given memory or serialization overhead it implies. Many times, shared memory is either more obvious or more resource efficient (e.g. Mutex) and I believe Zig should be explicit about those costs like it is with memory allocation and not provide only one high level abstraction. As for event loop by function argument (instead of global), it sounds interesting. Would things like File/Socket/Thread take an event loop? |
There is no cost if you only pass pointers around. The language user need to make sure not to use a pointer after it's being sent. This calls for event loop scoped memory allocator. OS Thread is irrelevant to event loop. File and Socket operations generally don't need a event loop since they don't spawn new parallel tasks. The current async/suspend feature in Zig is a language-level effect. A more elegant approach is to add an effect tracking system like the one in Koka allowing user-space defined suspend/resume. (That's a huge language feature.) |
In Windows based OS, loop.accept could use ws32 AcceptEx which support overlappedIO Result which works with current 0.10 loop design. Is someone working on this already ? Current Build only throw "Os not Supported" - Compiler Error ? |
I don't really like |
@tauoverpi To better understand, would this make things like Other food for thought: should this pattern also extend to other things that could possibly block like synchronization primitives and timed sleeps/waits? Or should it be strictly for IO, or just File IO? |
@kprotty Methods taking the IO type would possibly be cleaner as the underlying type (e.g I've used the type generic approach with timers and serial IO to ease testing timeouts and various read failures which has worked well (with hindsight, on methods would have been easier to work with) so going beyond IO to sync primitives and more could work well for testing and alternate loop implementations. As a side note, this does start to feel like a capability/effects system. |
@tauoverpi One issue with IO types on the methods is that it makes the implementation less efficient: Things like epoll and kqueue 1) wouldn't know the socket is non-blocking so its a syscall to set it as such (and possibly reset it back) 2) would need to use oneshot event registration, which is a syscall on every wait, as they don't have any stable association for the fd's lifetime due to IO being method based instead of type based. This prevents the usage of the more efficient edge triggering mode which requires extra state per handle/fd. For windows, there wouldn't be a way to know that the handle supports overlapped IO without first tryring the syscall and observing the error. This would reduce the performance for non-overlapped/sync access. Another potential issue is that one could use different IO types on the same file/socket/fd. And without a way to synchronize their accesses, one event registration or async routine could override the other. This is actually what happens when you try to |
hopefully it's not too late to reverse the decision on making it implicit and global.
async is extremely hard to get perfect for all use cases, so it would be very useful if competing async impls could exist. |
@aep After some thought, I'm up for removing
Could you elaborate on "async impls"? It's a language feature with currently one impl in zig stage1 that isn't tied to concepts like file descriptors or necessitating an event loop. |
For research purposes, this experiment from Nick Banks / Microsoft tries to explore a unified API for multiple IO models: https://github.com/nibanks/eventq This is possibly relevant research material. I'm no expert on this matter. (Don't wanna sidetrack the discussion, feel free to ignore 😆) |
I would like to see a symmetrical API similar to std.Thread.* that can be used interchangeably between threads and coroutines (maybe call it std.Strand). Then you could launch a thread using So functions would still be colored, but it would be bottom up instead of top down. (eg - what does the consumer want). So one function can be instantiated with multiple calling conventions, and |
asynchronous versions of I still believe something pluggable like this would be a good idea, although an |
There would still be an event loop to handle async scheduling. The event loop (with its allocator) could be an option to
I've had a scenario where I wanted a single executable to have both blocking and evented IO. I would like to be able to write a function once and then instantiate it as both blocking and evented and the same executable. This can be accomplished by replacing |
For 1. there can be a standard API for threads and event-loop specific coroutines, but the latter has more scheduling freedom than the former and requiring OS thread APIs can limit what's most efficient (e.g. concurrency without alloc, tail scheduling a woken up task). For 2.
Since |
Bottom up function coloring eliminates the need for |
The unified std.Strand.spawn idea seems very implicit. I'd prefer to explicitly choose how the task is going to be executed, if as own thread, or as async in current thread. Especially in language which tries to be low level and explicit. And if thinking about locking/synchronization primitives, aren't they specific for type of runtime? I mean that if the async scheduler never moves task to other thread it doesn't need atomic operations, for dealing with other tasks on same os thread. |
It would be similar to the status quo. Currently |
This would mean that either the IO portions like read are comptime provided, or the callee is able to comptime inspect the calling convention used by the caller (of which idk how it would play out with comptime evaluation).
Wouldn't this syntax be similar to calling A with
This is a good catch. You'd solve this by checking for
@vlada-dudr I think the idea was just to have a upper-level API that chooses between std.Mutex = if (io_mode == .evented) std.event.Mutex else std.Thread.Mutex |
I would expect that the callee is able to just comptime inspect itself. In my example above, Yes, std.crypto.random can be fixed by adding a |
take a look at https://github.com/dee0xeed/xjiss "everything in Unix is a file", right? ,) |
My 2 cents here aren't much on the technical side, but rather on the general Zig philosophy of being explicit, reducing magic, and making code easier to read over writing. A default, global, behind-the-scenes event loop seems to me a lot like the default, global, behind-the-scenes allocator many languages have and that Zig is better off for not having. If technically possible, I would prefer a user-selectable event loop(s) in the standard library versus the global one we have now. To me, that is just the more Zig-like way to do it. PS- This is quite the long thread so I apologize if I'm repeating someone else's prior ideas here. |
Late to the game, but my 2 cents: TLDR: I think IO models are going to undergo a large shift in the next few years. Background: There are some exotic fabrics (InfiniBand, etc) where it's faster to move a packet to another machine than to take a context switch on x86. This is only going to become more common on less exotic hardware. CPU clock speeds aren't going up much anymore, but networking clock speeds are. This type of zero-copy IO model is common in HPC environments and uses DPDK/efvi/vma/XDP etc. I think a standard zero-copy API will emerge (or maybe zig should define one?). Implement that in If you are going to put an event loop in Extra points : Define a way for multiple event-loops to coexist. Rust tried this but largly failed (IMHO). You need extra threads to combine loops in rust, and I think having multiple event-loops for a single-threaded app is not intended to be possible ( it can be done, but it's using APIs in ways they aren't intended to be used). Counter argument: People who really care about IO models won't use whatever is in edit: CXL, not CLX. |
Exactly! Having things like hash-maps or json parsing (i.e purely CPU+RAM stuff) in a library (does not matter std or not std) is definitely nice, but i/o is completely another story. Look at all those libpq, libhiredis etc - they all contain connect, read, write, but why? It might be good for a simple program that runs during some relatively short period of time, but for a complex program that interacts with many other programs on different hosts and works 24/7 it is not good at all. Personally I usually do not use things like libpq - instead I implement (partly) the protocol myself and do i/o the way I want it using OS API directly (in a single event loop), without multi-level superstructures/wrappers found in libraries. |
Do you mean CXL (Compute Express Link)? If not can you give a reference to CLX. |
Yes. CXL. I'm dyslexic AF. Updated comment. Thanks for pointing that out. |
I'm a bit late to this thread, but I'll add my input: I don't think an event loop should be used. This is mainly because from what I've seen, in other languages, when event loops are used they oftentimes get so deeply embedded into the language that they're difficult to replace when you need to do so. So, if an event loop is a genuine thing we need to have, it should be more of an "extra" than anything else; async/await should be minimal, like the rest of the language, and should not require one to function. (It also shouldn't be too complicated! I've seen how languages like Rust do it, and I'm definitely not impressed; in fact, I found the way Rust did async/await runtimes, in particular, such as async executors, incredibly confusing and difficult to actually understand, particularly when documentation I read claimed that Rust auto-inserted "yield points", whatever those are, but I had no way of detecting or interacting with these "yield points" or otherwise controlling when I awoke an async task, even in the executor, or that's how it seemed, which was really confusing and annoying for me.) My reasoning for simplicity and control (e.g., like Zig does it at present) is because I'd like to take advantage of this in an embedded system I'm considering writing that relies heavily on this paradigm: I'd use it to suspend functions to wait for things like interrupts, instead of needing to do something like setting atomic flags, or anything like that. This way, the interrupt could just "resume" the frame and remove it from the suspended frame stack, for instance. I could see something similar being done in many other use-cases. So, in sum, if you really do need an event loop, try to make it optional, and document it that maximizes understandability. I wouldn't focus too heavily on efficiency. Use io_uring, iocp, etc., but don't try to make efficiency a hill to die on. (And you can always optimize later.) If someone desires something that's more efficient than what the stdlib offers, they can and should always write their own. |
@ethindp Zig async is a language feature while the event loop is a library component which uses it. This means an event loop is not required for use in other environments like WASM/embedded as you've noted. Event loop just refers to the abstraction which starts and schedules coroutines (async frames). As for "too complicated", that's too subjective of a criteria; It would be more helpful to list which areas or things you were trying to (or would like to) do that felt confusing instead. Regarding Rust, the "yield points" are explicit I'd disagree regarding not focusing on efficiency early: Event loops aren't something you can "optimize later" as they are limited by your designs. For instance, you can't optimize a select-based API to use io_uring/IOCP efficiently. Nor could you make something like libuv or libdispatch without internal heap allocation. I agree that people will write their own, but I think its better to at least make doing so easier by providing high performance smaller components like an IO queue, thread pool, timer queue, etc. |
It's worth to take a look at cats-effect 's design too. |
poking around DPDK right now and super anxious about using zig, because of the possibility that some day it will just start having an opinion about IO and then you end up with both things next to each other and having to sync them, like they do with rust now. |
The current event loop is not ready yet (relatively slow, windows unfinished, bugs/races) and many wish for it to be. From the discord communities at least, there seems to be enough interest to warrant addressing this, with some interested in helping out. My question then is: what do people really want out of the standard libraries event loop? Toying with some designs locally produced incomplete implementations from lack of a direction. In addition to what people want from an event loop, how do they also want it implemented? Here are some example properties:
async
should it be?I'm open to adding more considerations if others have some. Just want to understand the requirements before making any sort of PR.
The text was updated successfully, but these errors were encountered: