EventLoop: direct epoll/kqueue integration #14959

ysbaddaden · 2024-09-03T10:06:55Z

Integrates the epoll (Linux) and kqueue (*BSD, macOS) syscalls to handle the event loop on UNIX platforms.

Benefits

Can remove the libevent external dependency (still available as fallbackn, see below);
Better performance thanks to a design change;

Instead of adding a fd to the poll structure when the fd blocks and remove it when it's ready for read or write (then repeat), we now add it once and keep it there until we close the fd. This is the ideal scenario for epoll and kqueue.

Unlike the previous attempts to integrate epoll & kqueue directly that followed libevent's logic and didn't bring any performance improve and required a big lock (contented with MT) to keep a list of events, this change allows up to a +20% performance boost in an ideal scenario (http/server with long lived connections), and only requires fine grained locks for MT (usually uncontended).

To nobody's surprise: this is how Go's netpoll works.

Notes

The evloop supports preview_mt with still one evloop instance per thread (scheduler requirement). Execution contexts RFC #2 will have one evloop instance per context.

We transfer the fd from an evloop to another when it would block, that evloop becomes the sole "owner" of the fd. The transfer is automatic, there is nothing to do. This leads to a caveat: we can't have multiple fibers waiting for the same fd in different evloops (aka threads). Trying to transfer the fd will raise if there is any waiting fiber already. This is because an IO read/write can have a timeout which is registered in the current evloop timers, and timers aren't transferred. This also allows for future enhancements (e.g. evloop enqueues are always local).

This can be an issue for preview_mt, for example with multiple fibers waiting for connections on a server socket; this shall be mitigated with execution contexts from RFC #2 that will share an evloop instance per context —just don't share a fd in multiple contexts.

If you experience any issue, you can always recompile with the -Devloop_libevent compile-time flag to return to the regular libevent-based event loop instead of the shiny new one.

Review

The branch kept the whole history of commits from the previous epoll and kqueue branches, and have far too many commits. Maybe a couple of them could be extracted on their own.

Each syscall is abstracted in its own little struct: Crystal::System::Epoll, Crystal::System::TimerFD, etc. They could be simplified (possible some dead code).

Crystal::Evented namespace (src/crystal/system/unix/evented) contains the base implementation that the system specific Crystal::Epoll::EventLoop (src/crystal/system/unix/epoll) and Crystal::Kqueue::EventLoop (src/crystal/system/unix/kqueue) are built on.
Crystal::Evented::Timers is a basic data structure to keep a list of timers (one instance per evloop); it could be optimized (in follow-up pull requests)
Crystal::Evented::Event holds the event, be it IO or sleep or select timeout or IO with timeout, while FiberEvent wraps an Event for sleeps and select timeouts.
Crystal::Evented::PollDescriptor are allocated in a Generational Arena and keeps the list of readers and writers (events/fibers waiting on IO).

The run loop first waits on epoll/kqueue, canceling IO timeouts as it resumes fibers, then proceeds to process timers.

The epoll/kqueue call doesn't wait until the next erady timer (it could without MT and with preview_mt but can't for execution contexts). I instead rely on timerfd on linux and EVTFILT_TIMER on BSD to interrupt a blocking evloop wait. It also allows to circumvent the 1ms precision of epoll_wait on Linux.

References

Supersedes both #14814 and #14829.

We can't call EPOLL_CTL_MOD with EPOLLEXCLUSIVE. Let's disable it for now and see later if we can replace it with a pair of EPOLL_CTL_DEL and EPOLL_CTL_ADD.

Process.run sometimes hang forever after fork and before exec, because it tries to close a fd that requires to lock, but another thread may have already acquired the lock, while `fork` only duplicates the current thread (the other ones are not, and the forked process was left waiting for a mutex to be unlocked, which would never happen.

That required to allocate a Node for the interrupt event, which ain't a bad idea.

Extracts the generic parts of the event loop into an intermediary class between Crystal::EventLoop and Crystal::Epoll::EventLoop so we can reuse it to implement the event loop on other similar syscalls (poll, kqueue).

Sometimes we only want a pair of fds, and not IO::FileDescriptor objects.

For some reason specs fail with a fiber failing to raise an exception because `pthread_mutex_unlock` failed with EPERM while trying to dequeue the `Fiber#resume_event` from the event loop. Re-creating the thread mutex after fork seems to fix the issue.

This allows to keep a file descriptor into the evloop for its whole lifetime (from open to close) instead of adding it every time it would block and removing it as soon as it unblocks. This brings over 20% performance improvement on a simple HTTP/1.1 server (with keepalive). Among the advantages: this allows to remove the global mutex around handling IO events and instead have an almost never contended lock around the reader or the writer waiting lists for each IO. We don't even have to keep a global list of events (epoll and kqueue will do it). The drawback is that the preview MT scheduler isn't compatible with this scheme.

ysbaddaden · 2024-09-03T10:16:03Z

src/crystal/arena.cr

This should be renamed as Crystal::Evented::Arena since it's not a generic generational arena (memory region). It takes advantage that the OS kernels handle the fd number (it's guaranteed unique) and always reuse closed fd instead of growing (until it's needed).

An actual generational arena would keep a list of free indexes.

Note: the goal of the arena is to:

avoid repeated allocations;

avoid polluting the IO object with the PollDescriptor (doesn't exist in other evloops);

avoid saving raw pointers into kernel data structures;

safely detect allocation issues instead of segfaults because of raw pointers.

ysbaddaden · 2024-09-03T11:55:47Z

A couple issues with preview_mt:

Unhandled exception: Crystal::Evented::Event#wake_at cannot be nil (NilAssertionError) while trying to get the next ready timer after deleting a timer (to update timerfd). A queued event timer got its #wake_at property set to nil. Sounds like a thread safety issue with FiberChannel. Happened on my local.
Unhandled exception: timerfd_settime: Invalid argument (RuntimeError) while trying to update timerfd after processing timers. The Sounds like the itimerspec structure has an invalid value (as per timerfd_settime man page) Happened on CI.

ysbaddaden · 2024-09-03T15:16:40Z

More details about issue 2. The time is indeed completely wrong, so the event pointer may point to an invalid value (or invalid memory?)

Unhandled exception: timerfd_settime
time=-1623225768.01:54:32.415596704
itimerspec=LibC::Itimerspec(
  @it_interval=LibC::Timespec(@tv_sec=0, @tv_nsec=0),
  @it_value=LibC::Timespec(@tv_sec=140246706362072, @tv_nsec=-415596704)
): Invalid argument (RuntimeError)

ysbaddaden added 30 commits August 23, 2024 21:32

Fix: don't cancel timeout select action event twice

2c20b91

Add :evloop to Crystal.trace

26bbb20

Epoll: initial attempt (doesn't compile)

558e50f

Fix: epoll_event is only packed on x86_64

ec1f4e0

Fix: disable EPOLLEXCLUSIVE for now

c9e4554

We can't call EPOLL_CTL_MOD with EPOLLEXCLUSIVE. Let's disable it for now and see later if we can replace it with a pair of EPOLL_CTL_DEL and EPOLL_CTL_ADD.

Fix: close in MT environment

3c1dc8a

Fix: after_fork (no MT) or after_fork_before_exec (MT only)

8c186dc

fixup! Fix: add optional Crystal::EventLoop#after_fork_before_exec (MT)

4089895

Prefer eventfd over pipe (only one fd, smaller struct in kernel)

b7d6fec

Save pointer to Node instead of fd (skips searches after wait)

3def18e

That required to allocate a Node for the interrupt event, which ain't a bad idea.

fixup! Save pointer to Node instead of fd (skips searches after wait)

887f29c

fixup! Prefer eventfd over pipe (only one fd, smaller struct in kernel)

1457032

Add Crystal::System::EventFD abstraction

f2a9b07

Use generic :system event type instead of :interrupt

f6fb444

fixup! Add Crystal::System::EventFD abstraction

5c048f8

Extract timers + cleanup + one timerfd per eventloop

7c33a38

Fix: also check that timers are empty (not only events)

4cbf6e5

Fix: missing mutex sync

b835d00

Extract Evented::Eventloop from Epoll::EventLoop

687f9df

Extracts the generic parts of the event loop into an intermediary class between Crystal::EventLoop and Crystal::Epoll::EventLoop so we can reuse it to implement the event loop on other similar syscalls (poll, kqueue).

Extract #system_pipe from Crystal::System::FileDescriptor.pipe

2c7f8a5

Sometimes we only want a pair of fds, and not IO::FileDescriptor objects.

Add Crystal::Kqueue::EventLoop (*BSD, Darwin)

197a2a2

Fix: explicit none/read/write registration for EventQueue::Node

2c6ebd9

Fix: dequeue timer when io event is ready

8185ecd

fix: format + cleanup

1db5459

Fix: cleanup sleep/select_timeout event

ff82437

Evented::Event#time => #wake_at

b2292d5

Fix: Eventfd#write raises on success

275f2cc

ysbaddaden added 9 commits August 29, 2024 16:31

Add generation to evloop arena

1d84083

Fix: compilation error on crystal 1.0.0

f80622c

Fix: add generation arena index to kqueue evloop

c0c1cfd

fixup! Fix: add generation arena index to kqueue evloop

8d808cc

Fix: Crystal::SpinLock should be a struct

ed41608

Fix: compilation with preview_mt

8b2fd43

Allow multiple Evented::EventLoop instances (with ownership transfer)

b3bc0ed

Add :evloop_libevent to force libevent event loop

4586892

Fix: global arena accessor (class vars aren't inherited)

07220b2

ysbaddaden self-assigned this Sep 3, 2024

ysbaddaden commented Sep 3, 2024

View reviewed changes

straight-shoota added kind:feature topic:stdlib:runtime platform:unix labels Sep 3, 2024

ysbaddaden added 2 commits September 3, 2024 20:09

Fix: close must cancel timers on the owning evloop (not the current one)

086e0ee

Fix: ignore errno on epoll_del in IO finalizer

e674803

This was referenced Sep 4, 2024

Kqueue event loop (*BSD, macOS) #14829

Closed

Epoll event loop (linux) #14814

Closed

Add Crystal::EventLoop#delete to clean up on finalize

d1d6d72

ysbaddaden closed this Sep 6, 2024

ysbaddaden deleted the feature/single-epoll-kqueue-loop branch September 6, 2024 10:57

ysbaddaden mentioned this pull request Sep 12, 2024

Refactor Lifetime Event Loop #14996

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

EventLoop: direct epoll/kqueue integration #14959

EventLoop: direct epoll/kqueue integration #14959

ysbaddaden commented Sep 3, 2024 •

edited

Loading

ysbaddaden Sep 3, 2024

ysbaddaden commented Sep 3, 2024 •

edited

Loading

ysbaddaden commented Sep 3, 2024

EventLoop: direct epoll/kqueue integration #14959

EventLoop: direct epoll/kqueue integration #14959

Conversation

ysbaddaden commented Sep 3, 2024 • edited Loading

ysbaddaden Sep 3, 2024

Choose a reason for hiding this comment

ysbaddaden commented Sep 3, 2024 • edited Loading

ysbaddaden commented Sep 3, 2024

ysbaddaden commented Sep 3, 2024 •

edited

Loading

ysbaddaden commented Sep 3, 2024 •

edited

Loading