Refactor Lifetime Event Loop #14996

ysbaddaden · 2024-09-12T17:20:07Z

Almost identical to #14959 but with cleaner history and more documentation that led me to identify issues in the arena.

RFC: https://github.com/crystal-lang/rfcs/blob/main/text/0009-lifetime-event_loop.md

Overall design

The logic of the event loop doesn't change much from libevent: we try to execute an operation (e.g. read, write, connect, ...) on nonblocking file descriptors; if the operation would block (EAGAIN) we create an event that references the operation along with the current fiber; we eventually rely on the polling system (epoll or kqueue) to report when an fd is ready, which will dequeue a pending event and resume its associated fiber (one at a time).

Unlike libevent that will add and remove the fd to and from the polling system every time we'd block, this event loop adds it once and will only remove it when closing the fd. The theory is that being notified for readiness (once thanks to edge-triggered) is less expensive than always modifying the polling system. In practice a purely IO-bound benchmark with long running sockets, we notice up to 20% performance improvement. Actual applications should see less improvements.

Implementation details

Unlike the previous attempts to integrate epoll and kqueue directly that required a global events object and a global lock to protect it, this PR only needs fine-grained locks for each IO object and operation (read, write) to add the events to the waiting lists. In practice, the locks should never be contented (unless you share a fd for the same read or write operation in multiple fibers).

Caveat: timers are still global to the event loop, and we need a global lock to protect it. This means that IO with timeout will still see contention. Improving the timers data structure will come later (e.g. lock free, more precise, faster operations).

To avoid keeping pointers to the IO object that could prevent the GC from collecting lost IO objects, this PR introduces "poll descriptor" objects (the name comes from Go's netpoll) that keep the list of readers and writers and don't point back to the IO object. The GC collecting an IO object is fine: the finalizer will close the fd and tell the event loop to cleanup the ~~fd resources~~ associated poll descriptor (so we can safely reuse the fd).

To avoid pushing raw pointers into the kernel data structures, and to quickly retrieve the poll descriptor from a mere fd, but also to avoid programming errors that would segfault the program, this PR introduces a Generational Arena to store the "Poll Descriptors" (the name is inherited from Go's netpoll) so we only store an index into the polling system. Another benefit is that we can reuse the existing allocation when a fd is reused. If we try to retrieve an outdated index (the allocation was freed or reallocated) the arena will raise an explicit exception.

The poll descriptors associate a fd to an event loop instance, so we can still have multiple event loops per processes, yet make sure that an fd is only ever in one event loop. When a fd will block on another event loop instance, the fd will be transferred automatically (i.e. removed from the old one & added to the new one). The benefits are numerous: this avoids having multiple event loops being notified at the same time; this avoids having to close/remove the fd from each event loop instances; this avoids cross event loop enqueues that are much slower than local enqueues in RFC 2.

A limitation is that trying to move a fd from one evloop to another while there are pending waiters will raise an exception. We can't move timeout events along with the fd from one event loop instance to another one, but that would also break the "always local enqueues" benefit.

Most application shouldn't notice any impact because of this design choice, since a fd is usually not shared across fibers (concurrency issues), except maybe a server socket with multiple accepting fibers? In that case you'll need to make sure the fibers are on the same thread (preview_mt) or same context (RFC 2).

Availability

We may want to have a compile time flag to enable the new event loop before we merge? For the time being this PR uses the new shiny evloop by default (to have CI runs) and introduces the :evloop_libevent flag to fallback to libevent.

TODO

A couple changes before merge (but not blocking review):

Compile time flag(s):
- the epoll/kqueue evloop is enabled by default on FreeBSD, Linux and macOS (tested and working), but disabled on the other BSD (slow, broken or untested) and Solaris (untested).
- -Devloop=libevent to return to libevent (in case of regressions or to test/benchmark);
- -Devloop=epoll to use epoll (e.g. Solaris);
- -Devloop=kqueue to use kqueue (on *BSD);
EINTR: automatically retry epoll_wait or kevent with infinite timeout;
Remove the #try_run? and #try_lock? methods that are no longer needed.

REGRESSIONS/ISSUES

Timers are noticeably slower than libevent (especially Fiber.yield): we should consider a minheap (4-heap) or a skiplist;
DragonFlyBSD: running std_spec is eons slower than libevent; it regularly hangs on evloop.run until the stack pool collector timeout kicks in (i.e. loops on 5s pauses);
OpenBSD: running std_spec is noticeably slower (4:15 minutes) compared to libevent (1:16 minutes); it appears that the main fiber keeps re-enqueueing itself from the evloop run (10us on each run);
NetBSD: the evloop doesn't work with kevent returning ENOENT for the signal loop fd and EINVAL when trying to set an EVFILT_TIMER.

For the reasons above the kqueue evloop is disabled by default on DragonFly(BSD), OpenBSD and NetBSD.

ysbaddaden · 2024-09-19T14:22:06Z

While trying to integrate the evloop with the experimental shard for RFC 2, I noticed a race condition between #process_timers and #evented_close with MT since we can close a fd from any thread while another is processing timers.

I solved it with the requirement that to resume an IO event with a timeout we must successfully dequeue the event from both queues (poll descriptor waiters and timers) and a bias on timers: they always win. This allowed to keep the Event allocated on the stack. This involves some locks, but the race condition should be rare enough that I don't expect it to be a contention point (I could be wrong).

Notes:

the bias on timers only means that the timer wins and it must be the one to resume the fiber, it doesn't mean that we must timeout the event. If #process_timer fails to dequeue the IO event from the waiters list we could consider the IO event to be ready and not call event.timed_out!;
the fix allows to process timers independently from the event loop (which might give more weight to the above note); we may be able to find a way to run timers in a more timely manner in the future; for example from the schedulers' run loop?

crysbot · 2024-09-24T17:05:54Z

This pull request has been mentioned on Crystal Forum. There might be relevant details there:

https://forum.crystal-lang.org/t/new-event-loop-unix-call-for-reviews-tests/7207/1

ysbaddaden · 2024-09-26T12:01:05Z

I just noticed one difference in IO::Evented (for libevent): it's #evented_read and #evented_write methods will resume the pending readers, or writers, when something raised inside (e.g. IO::Error, IO::TimeoutError, ...).

I'm not sure if this has any impact on this evloop. If the fd has any error, then epoll_wait and kevent shall report the error and the evloop will resume the waiting fibers accordingly.

GeopJr · 2024-09-29T22:43:27Z

I'm unable to build this on Debian Sid with mmap failed with Cannot allocate memory (ENOMEM)

full log

$ make clean clean_cache clean_crystal crystal
rm -rf .build
rm -rf ./docs
rm -rf src/llvm/ext/llvm_ext.o
rm -rf man/*.gz
rm -rf /home/geopjr/.cache/crystal
make: Nothing to be done for 'clean_crystal'.
Using /usr/bin/llvm-config-17 [version=17.0.6]
g++ -c  -o src/llvm/ext/llvm_ext.o src/llvm/ext/llvm_ext.cc -I/usr/lib/llvm-17/include -std=c++17   -fno-exceptions -funwind-tables -D_GNU_SOURCE -D__STDC_CONSTANT_MACROS -D__STDC_FORMAT_MACROS -D__STDC_LIMIT_MACROS
CRYSTAL_CONFIG_BUILD_COMMIT="15093e1" CRYSTAL_CONFIG_PATH='$ORIGIN/../share/crystal/src' SOURCE_DATE_EPOCH="1727195906" CC="cc -fuse-ld=lld" CRYSTAL_CONFIG_LIBRARY_PATH='$ORIGIN/../lib/crystal' ./bin/crystal build -D strict_multi_assign -D preview_overload_order -Dwithout_interpreter  -o .build/crystal src/compiler/crystal.cr -D without_openssl -D without_zlib -D use_pcre2 --error-trace
In src/compiler/crystal.cr:11:18

 11 | Crystal::Command.run
                       ^--
Error: instantiating 'Crystal::Command.run()'


In src/compiler/crystal/command.cr:54:3

 54 | def self.run(options = ARGV)
      ^--
Error: instantiating 'run(Array(String))'


In src/compiler/crystal/command.cr:55:18

 55 | new(options).run
                   ^--
Error: instantiating 'Crystal::Command#run()'


In src/compiler/crystal/command.cr:75:7

 75 | init
      ^---
Error: instantiating 'init()'


In src/compiler/crystal/command.cr:214:10

 214 | Init.run(options)
            ^--
Error: instantiating 'Crystal::Init.run(Array(String))'


In src/compiler/crystal/tools/init.cr:25:31

 25 | InitProject.new(config).run
                              ^--
Error: instantiating 'Crystal::Init::InitProject#run()'


In src/compiler/crystal/tools/init.cr:251:22

 251 | views.each &.render
                    ^-----
Error: instantiating 'Crystal::Init::View+#render()'


In src/compiler/crystal/tools/init.cr:195:31

 195 | File.write(full_path, to_s)
                             ^---
Error: instantiating 'to_s()'


In src/object.cr:97:12

 97 | String.build do |io|
             ^----
Error: instantiating 'String.build()'


In src/string.cr:295:21

 295 | String::Builder.build(capacity) do |builder|
                       ^----
Error: instantiating 'String::Builder.build(Int32)'


In src/string.cr:295:21

 295 | String::Builder.build(capacity) do |builder|
                       ^----
Error: instantiating 'String::Builder.build(Int32)'


In src/object.cr:97:12

 97 | String.build do |io|
             ^----
Error: instantiating 'String.build()'


In src/object.cr:98:7

 98 | to_s io
      ^---
Error: instantiating 'to_s(String::Builder)'


There was a problem expanding macro 'def_to_s'

Called macro defined in src/ecr/macros.cr:35:3

 35 | macro def_to_s(filename)

Which expanded to:

 > 1 |     def to_s(__io__ : IO) : Nil
 > 2 |       ::ECR.embed "/home/geopjr/Projects/crystal/src/compiler/crystal/tools/init/template/example_spec.cr.ecr", "__io__"
 > 3 |     end
 > 4 |   
Error: expanding macro


There was a problem expanding macro 'embed'

Called macro defined in src/ecr/macros.cr:69:3

 69 | macro embed(filename, io_name)

Which expanded to:

 > 1 |     {{ run("ecr/process", "/home/geopjr/Projects/crystal/src/compiler/crystal/tools/init/template/example_spec.cr.ecr", "__io__") }}
 > 2 |   
Error: expanding macro


There was a problem expanding macro 'def_to_s'

Called macro defined in src/ecr/macros.cr:35:3

 35 | macro def_to_s(filename)

Which expanded to:

 > 1 |     def to_s(__io__ : IO) : Nil
 > 2 |       ::ECR.embed "/home/geopjr/Projects/crystal/src/compiler/crystal/tools/init/template/example_spec.cr.ecr", "__io__"
 > 3 |     end
 > 4 |   
Error: expanding macro


There was a problem expanding macro 'embed'

Called macro defined in src/ecr/macros.cr:69:3

 69 | macro embed(filename, io_name)

Which expanded to:

 > 1 |     {{ run("ecr/process", "/home/geopjr/Projects/crystal/src/compiler/crystal/tools/init/template/example_spec.cr.ecr", "__io__") }}
 > 2 |   
Error: Error executing run (exit code: 1): ecr/process /home/geopjr/Projects/crystal/src/compiler/crystal/tools/init/template/example_spec.cr.ecr __io__


stderr:

    mmap failed with Cannot allocate memory (ENOMEM)

make: *** [Makefile:227: .build/crystal] Error 1

FWIW, I can build up to 2821fbb just fine and I have plenty of available memory while building this branch

Crystal 1.12.1 (2024-05-17)

LLVM: 17.0.6
Default target: x86_64-pc-linux-gnu

libevent-dev 2.1.12-stable-10
libgc-dev 1:8.2.8-1

edit: (it builds successfully with -Devloop_libevent, as expected)

ysbaddaden · 2024-09-30T07:41:53Z

Sadly the log doesn't tell much because this is a runtime error in a macro run (to generate Crystal code from ECR).

The only new mmap is the one for the arena... which is set to the hardware limit of open fds. What does ulimit -Hn reports? On my Ubuntu it's 1048576 open files. Maybe it's infinite and I have an issue capping to Int32::MAX.

GeopJr · 2024-09-30T12:43:18Z

$ ulimit -Hn
1073741816

Can't remember if I set it that high or if it was the default :/

I did some debugging of the macros if it helps at all, new ev, evloop_libevent

edit: here's the diff (though it'd be easier to view with Meld)

ysbaddaden · 2024-09-30T13:19:40Z

The large number of fds is the problem. The new evloop is trying to virtually allocate ~40GB of memory and refuses to. What's weird is that this is only reserved memory, not allocated memory 🤔

It's also working on macos despite the hardware limit being infinite (capped to Int32::MAX).

What's the software limit for ulimit -Sn? It's 1024 on my system (ubuntu). Maybe for starters I should only consider the software limit, then add support for blocks to the arena and use the hardware limit (in a subsequent PR).

GeopJr · 2024-09-30T13:36:58Z

$ ulimit -Sn
1024

ysbaddaden · 2024-09-30T17:40:32Z

@GeopJr I made a change to select the soft limit instead of the hard limit. That should workaround the mmap issue —until we improve the arena to allocate individual blocks dynamically, instead of a single region at once.

GeopJr · 2024-10-01T00:02:08Z

Can confirm that it compiles successfully, thanks! (Can also confirm that I didn't face any issues with the new ev + GTK)

ysbaddaden · 2024-10-03T14:14:49Z

While integrating with the execution_context shard I identified a race condition in Waiters#add when called in parallel to Waiters#consume_each. We call the method on close, error or hup after which the fd will never block again; just setting @ready wouldn't be sufficient (we reset it) so I introduced an @closed variable so trying to add would never add. The actual syscall (read, write, connect, etc) will eventually report an error and the IO be properly closed.

Keeps information about the event that a fiber is waiting on, can be a time event and/or an IO event.

Keeps waiting reader and writer events for an IO. The event themselves keep the information about the event and the associated fiber.

A simple, unoptimized, data structure to keep a list of timed events (IO timeout, sleep or select timeout).

The foundation for the system specific epoll and kqueue event loops.

Specific to Linux and Android. It might be working on Solaris too through their Linux compatibility layer.

For BSD and Darwin.

This is now only required for the libevent event loop, and the wasi pseudo event loop.

Also includes the `:evloop_libevent` flag to fallback to libevent.

Avoids an issue with the timecop shard that overrides Time.monotonic.

We must cast the actual LibEvent types because the type signatures return the abstract Crystal::EventLoop interface that don't implement the necessary methods.

ysbaddaden · 2024-11-05T09:12:02Z

@straight-shoota the check format CI action fails, but I can't reproduce on my local (neither with crystal 1.14.0 nor a compiler from this branch) 😕

straight-shoota · 2024-11-05T10:42:14Z

The failure happens on the pull_request trigger which runs in the merge branch (pull/14996/merge), i.e. the PR branch merged with the target branch (master). The PR is currently based on an old commit from master and a compiler built from that misses #14718.
If you merge current master into this branch (or use a recent compiler build form master such as nightly), you should be able to reproduce.

src/crystal/system/event_loop.cr

crysbot · 2024-11-14T13:58:37Z

This pull request has been mentioned on Crystal Forum. There might be relevant details there:

https://forum.crystal-lang.org/t/new-event-loop-unix-call-for-reviews-tests/7207/22

Replaces the static `mmap` that must accommodate for as many file descriptors as allowed by ulimit/rlimit. Despite being virtual memory, not really allocated in practice, this led to out-of-memory errors in some situations. The arena now dynamically allocates individual blocks as needed (no more virtual memory). For simplicity reasons it will only ever grow, and won't shrink (we may think of a solution for this later). The original safety guarantees still hold: once an entry has been allocated in the arena, its pointer won't change. The event loop still limits the arena capacity to the hardware limit (ulimit: open files). **Side effect:** the arena don't need to remember the maximum fd/index anymore; that was only needed for `fork`; we can simply iterate the allocated blocks now. Co-authored-by: Johannes Müller <[email protected]>

Refactors the internals of the epoll/kqueue event loop to `yield` the fiber(s) to be resumed instead of blindly calling `Crystal::Scheduler.enqueue`, so the `#run` method becomes the one place responsible to enqueue the fibers. The current behavior doesn't change, the `#run` method still enqueues the fiber immediately, but it can now be changed in a single place. For example the [execution context shard](https://github.com/ysbaddaden/execution_context) monkey-patches an alternative `#run` method that collects and returns fibers to avoid parallel enqueues from an evloop run to interrupt the evloop run (:sob:). Note that the `#close` method still directly enqueues waiting fibers one by one, for now.

Related to [RFC #12](crystal-lang/rfcs#12). Replaces the `Deque` used in #14996 for a min [Pairing Heap] which is a kind of [Mergeable Heap] and is one of the best performing heap in practical tests when arbitrary deletions are required (think cancelling a timeout), otherwise a D-ary Heap (e.g. 4-heap) will usually perform better. See the [A Nearly-Tight Analysis of Multipass Pairing Heaps](https://epubs.siam.org/doi/epdf/10.1137/1.9781611973068.52) paper or the Wikipedia page for more details. The implementation itself is based on the [Pairing Heaps: Experiments and Analysis](https://dl.acm.org/doi/pdf/10.1145/214748.214759) paper, and merely implements a recursive twopass algorithm (the auxiliary twopass might perform even better). The `Crystal::PointerPairingList(T)` type is generic and relies on intrusive nodes (the links are into `T`) to avoid extra allocations for the nodes (same as `Crystal::PointerLinkedList(T)`). It also requires a `T#heap_compare` method, so we can use the same type for a min or max heap, or to build a more complex comparison. Note: I also tried a 4-heap, and while it performs very well and only needs a flat array, the arbitrary deletion (e.g. cancelling timeout) needs a linear scan and its performance quickly plummets, even at low occupancy, and becomes painfully slow at higher occupancy (tens of microseconds on _each_ delete, while the pairing heap does it in tens of nanoseconds). Follow up to #14996 [Mergeable Heap]: https://en.wikipedia.org/wiki/Mergeable_heap [Pairing Heap]: https://en.wikipedia.org/wiki/Pairing_heap [D-ary Heap]: https://en.wikipedia.org/wiki/D-ary_heap Co-authored-by: Linus Sellberg <[email protected]> Co-authored-by: Johannes Müller <[email protected]>

ysbaddaden added kind:refactor topic:stdlib:system labels Sep 12, 2024

ysbaddaden self-assigned this Sep 12, 2024

ysbaddaden requested a review from straight-shoota September 13, 2024 08:03

ysbaddaden added 15 commits October 3, 2024 16:15

Add :evloop to Crystal::Tracing

657dbcb

Add C bindings for epoll

abd0ce4

Add C bindings for eventfd

e6c722d

Add C bindings for timerfd + itimerspec

ceb7a6f

Add C bindings for kqueue

74f58d8

Add C bindings for getrlimit(RLIMIT_NOFILE)

2421008

Add Event and related FiberChannel objects

3f34458

Keeps information about the event that a fiber is waiting on, can be a time event and/or an IO event.

Add PollDescriptor and Waiters objects

95b64d9

Keeps waiting reader and writer events for an IO. The event themselves keep the information about the event and the associated fiber.

Add Timers object

dfa56a5

A simple, unoptimized, data structure to keep a list of timed events (IO timeout, sleep or select timeout).

Add generational arena

7a12a86

Add polling EventLoop (abstract base)

2363de1

The foundation for the system specific epoll and kqueue event loops.

Add epoll EventLoop (Linux, Android)

ff85b3a

Specific to Linux and Android. It might be working on Solaris too through their Linux compatibility layer.

Add kqueue event loop (BSD)

20adb88

For BSD and Darwin.

Conditionnaly load IO::Evented

73412be

This is now only required for the libevent event loop, and the wasi pseudo event loop.

Enable the epoll/kqueue event loop

dceb184

Also includes the `:evloop_libevent` flag to fallback to libevent.

Fix: prefer System::Time.monotonic over Time.monotonic

b14fa3c

Avoids an issue with the timecop shard that overrides Time.monotonic.

ysbaddaden mentioned this pull request Oct 24, 2024

One EventLoop per process ysbaddaden/execution_context#30

Closed

ysbaddaden added 7 commits November 4, 2024 14:50

Add :evloop_epoll and :evloop_kqueue flags + opt-in on some targets

b4c192e

Fix: compilation with -Devloop_libevent

36ef33b

We must cast the actual LibEvent types because the type signatures return the abstract Crystal::EventLoop interface that don't implement the necessary methods.

fixup! Fix: compilation with -Devloop_libevent

3777d73

Use -Devloop=[libevent|epoll|kqueue] flag(s)

cf6f508

Format + avoid formatter bug (crystal-lang#15112)

406d7d6

fixup! Format + avoid formatter bug (crystal-lang#15112)

4a66600

fixup! Use -Devloop=[libevent|epoll|kqueue] flag(s)

f782784

straight-shoota mentioned this pull request Nov 5, 2024

Improve compile-time flags #15157

Open

Fix: crystal tool format

a3a320c

straight-shoota reviewed Nov 5, 2024

View reviewed changes

src/crystal/system/event_loop.cr Outdated Show resolved Hide resolved

Fix: update todo

79bc334

straight-shoota approved these changes Nov 8, 2024

View reviewed changes

straight-shoota added this to the 1.15.0 milestone Nov 8, 2024

This was referenced Nov 12, 2024

RFC 0009: Lifetime Event Loop crystal-lang/rfcs#9

Merged

Add post: Lifetime Event Loop crystal-lang/crystal-website#869

Merged

straight-shoota changed the title ~~Event Loop~~ Refactor Lifetime Event Loop Nov 19, 2024

straight-shoota merged commit cc30da2 into crystal-lang:master Nov 19, 2024
69 checks passed

ysbaddaden deleted the feature/lifetime-evloop branch November 19, 2024 15:27

This was referenced Nov 19, 2024

Refactor Evented::Arena to allocate in blocks [fixup #14996] #15205

Merged

EventLoop: store Timers in min Pairing Heap [fixup #14996] #15206

Merged

BrewTestBot mentioned this pull request Jan 10, 2025

crystal 1.15.0 Homebrew/homebrew-core#203831

Merged

1 task

straight-shoota mentioned this pull request Jan 20, 2025

Crystal 1.15 seems to have symbol Index reserved #15357

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactor Lifetime Event Loop #14996

Refactor Lifetime Event Loop #14996

ysbaddaden commented Sep 12, 2024 •

edited by straight-shoota

Loading

ysbaddaden commented Sep 19, 2024 •

edited

Loading

crysbot commented Sep 24, 2024

ysbaddaden commented Sep 26, 2024

GeopJr commented Sep 29, 2024 •

edited

Loading

ysbaddaden commented Sep 30, 2024 •

edited

Loading

GeopJr commented Sep 30, 2024 •

edited

Loading

ysbaddaden commented Sep 30, 2024

GeopJr commented Sep 30, 2024

ysbaddaden commented Sep 30, 2024

GeopJr commented Oct 1, 2024

ysbaddaden commented Oct 3, 2024

ysbaddaden commented Nov 5, 2024

straight-shoota commented Nov 5, 2024

crysbot commented Nov 14, 2024

Refactor Lifetime Event Loop #14996

Refactor Lifetime Event Loop #14996

Conversation

ysbaddaden commented Sep 12, 2024 • edited by straight-shoota Loading

ysbaddaden commented Sep 19, 2024 • edited Loading

crysbot commented Sep 24, 2024

ysbaddaden commented Sep 26, 2024

GeopJr commented Sep 29, 2024 • edited Loading

ysbaddaden commented Sep 30, 2024 • edited Loading

GeopJr commented Sep 30, 2024 • edited Loading

ysbaddaden commented Sep 30, 2024

GeopJr commented Sep 30, 2024

ysbaddaden commented Sep 30, 2024

GeopJr commented Oct 1, 2024

ysbaddaden commented Oct 3, 2024

ysbaddaden commented Nov 5, 2024

straight-shoota commented Nov 5, 2024

crysbot commented Nov 14, 2024

ysbaddaden commented Sep 12, 2024 •

edited by straight-shoota

Loading

ysbaddaden commented Sep 19, 2024 •

edited

Loading

GeopJr commented Sep 29, 2024 •

edited

Loading

ysbaddaden commented Sep 30, 2024 •

edited

Loading

GeopJr commented Sep 30, 2024 •

edited

Loading