Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Epoll event loop (linux) #14814

Closed

Conversation

ysbaddaden
Copy link
Contributor

@ysbaddaden ysbaddaden commented Jul 16, 2024

Implements a custom event loop on top of epoll for Linux (and Solaris?), following RFC #7. It's not particularly optimized yet (e.g. it still allocates), but seems to be working.

The EventQueue and Timers types aren't the best solutions (performance wise) but they work and abstract their own logic (keep a list of events per fd / keep a sorted list of timers). We should be able to improve (e.g. linked lists -> ordered linked lists -> skiplists, ...)

ISSUES

  • #after_fork must close & nullify every Fiber#timerfd & update pending events to use the new fd (not needed with one timerfd per eventloop, see below);
  • must reset mutex after fork before exec (preview mt);
  • must "close" fd in all eventloops (preview mt);
  • LibC bindings for aarch-linux-musl are wrong;
  • investigate segfault runs on CI;
  • support :preview_mt (broken CI);
  • can't modify a registered fd with EPOLLEXCLUSIVE set (disabled for now);
  • BUG: a timerfd file descriptor errored or got closed! on CI;
  • epoll_ctl(EPOLL_CTL_MOD): No such file or directory (RuntimeError) on CI;
  • Unhandled exception: pthread_mutex_unlock: Operation not permitted (RuntimeError) on CI and again
  • infinite loop in interpreter on CI (primitives spec);

OPTIMIZE

  • epoll_event.data should point to Node instead of fd, so we could skip searches;
  • use eventfd instead of pipe2 to interrupt the loop;
  • one timerfd per eventloop should be enough (wait until next timer => we can't enqueue another timer on that thread anymore => for MT a parallel enqueue will interrupt epoll_wait); we still need a timerfd because epoll_wait only has millisecond precision; we also need a list of timers to know the next timer, and which timers are ready.

TODO

  • don't raise BUG exceptions, print BUG warnings to STDERR;
  • more testing: please help to break it 🔨
  • extract fix for IO::FileDescriptor & Socket finalizers do far too much #14807 (distinct PR);
  • extract fix for double event dequeue in Fiber#cancel_timeout (distinct PR);
  • cleanup Crystal::Epoll::EventLoop (commented code, debug traces);
  • use :ev_epoll flag instead of forcing it for Linux (not yet: I want CI test runs);

@ysbaddaden
Copy link
Contributor Author

Urgh, supporting fork is painful. I must close the timerfds for each fiber (easy) but then I must mutate all pending events to use the new fd 😩

Copy link
Contributor

@yxhuvud yxhuvud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I trust you have read https://idea.popcount.org/2017-02-20-epoll-is-fundamentally-broken-12/ (and part 2: https://idea.popcount.org/2017-03-20-epoll-is-fundamentally-broken-22/? ) The answer to multitheaded issiues may be found in there.

Comment on lines +172 to +173
raise "BUG: #{node.fd} is ready for reading but no registered reader for #{node.fd}!\n" if readable
raise "BUG: #{node.fd} is ready for writing but no registered writer for #{node.fd}!\n" if writable
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it sufficent to do normal raise here? I imagine it can be a bit weird in the multithreaded scenario if things raise here. As this is called from the scheduler itself I imagine most places are not really build to handle this well.

Will it bubble up properly and take down the whole thing (as at least I think it should)? Or could we get strange situations with some crashed threads but the program as a whole remains (potentially waiting for the crashed stuff).

I suppose that goes for other exceptions in this vicinity too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not meant to stay, and definitely not safe to handle.

All the raised "BUG:" are meant to loudly detect errors. If they raise, there's a 99% probability that I made a mistake, though I discovered the double event dequeue in Fiber#cancel_timeout thanks to one of these.

src/crystal/system/unix/epoll/event_queue.cr Show resolved Hide resolved
src/crystal/system/unix/epoll/event_queue.cr Show resolved Hide resolved
src/fiber.cr Outdated Show resolved Hide resolved
src/fiber.cr Outdated Show resolved Hide resolved
@ysbaddaden
Copy link
Contributor Author

ysbaddaden commented Jul 16, 2024

I took care to use EPOLLEXCLUSIVE (available since 2017). I'm not sure we need EPOLLONESHOT since we register the fd once then dequeue readers/writers one by one. Now, there may still be races, especially in a MT environment, but there's little we can do without blocking the other readers/writers (there's no telling if a reader/writer will keep reading/writing).

I didn't know the close quirk. That explains why we unregister the file descriptors before we actually close. Now, if a fd is in multiple epoll instances, oops, this is likely already broken today 😨 this is working today: we remove the events and the fd from all event bases that referenced the IO object.

@ysbaddaden
Copy link
Contributor Author

ysbaddaden commented Jul 17, 2024

The close quirk needs some experimental verification:

Q6 Will the close of an fd cause it to be removed from all epoll sets automatically?

A6 Yes.

https://linux.die.net/man/4/epoll

The one thing to take care of would be to resume the cached event/fibers from all epoll instances (tricky).

I also found an issue with EPOLLEXCLUSIVE: we can't modify, so either we don't use it or we must del/add when changing the set of events.

@yxhuvud
Copy link
Contributor

yxhuvud commented Jul 17, 2024

The close quirk

Could be they fixed it at some point, there has been a bunch of years since that post was written after all..

@ysbaddaden
Copy link
Contributor Author

Another man page, a longer answer:

Yes, but be aware of the following point. A file descriptor is a
reference to an open file description (see open(2)). Whenever a
descriptor is duplicated via dup(2), dup2(2), fcntl(2) F_DUPFD, or
fork(2), a new file descriptor referring to the same open file de-
scription is created. An open file description continues to exist
until all file descriptors referring to it have been closed. A
file descriptor is removed from an epoll set only after all the
file descriptors referring to the underlying open file description
have been closed (or before if the descriptor is explicitly removed
using epoll_ctl() EPOLL_CTL_DEL). This means that even after a
file descriptor that is part of an epoll set has been closed,
events may be reported for that file descriptor if other file de-
scriptors referring to the same underlying file description remain
open.

https://man.freebsd.org/cgi/man.cgi?query=epoll&apropos=0&sektion=0&manpath=SuSE

@ysbaddaden
Copy link
Contributor Author

ysbaddaden commented Jul 18, 2024

The preview_mt issues are caused by thread B closing a fd while thread A is waiting; since each thread has its own eventloop this is creating issues, I end up not cleaning up the fd from thread A' epoll instance.

I'm trying to think of an efficient way to iterate the event loops on (something lighter than ThreadLocalValue), but may resort to iterate all EL for starters (and check if it fixes the MT issues that fixes the issue).

We can't call EPOLL_CTL_MOD with EPOLLEXCLUSIVE. Let's disable it for
now and see later if we can replace it with a pair of EPOLL_CTL_DEL and
EPOLL_CTL_ADD.
@yxhuvud
Copy link
Contributor

yxhuvud commented Jul 18, 2024

Perhaps it is possible to have a lookup table fd -> loop (or probably - a bit less contentious in mt scenario - have a list in the FileDescriptor (well no, but you perhaps better understand what I mean compared to other choices of words) that keeps track of what epolls uses it). But it is an icky problem :(

Process.run sometimes hang forever after fork and before exec, because
it tries to close a fd that requires to lock, but another thread may
have already acquired the lock, while `fork` only duplicates the current
thread (the other ones are not, and the forked process was left waiting
for a mutex to be unlocked, which would never happen.
That required to allocate a Node for the interrupt event, which ain't a
bad idea.

byte = 1_u8
LibC.write(@pipe[1], pointerof(byte), 1)
# the atomic makes sure we only write once
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is writing more than once a problem when using eventfd? Could be an alternative to simply write multiple times and not have the atomic variable at all. The eventfd will just return the sum of values that has been written to it, after all (unless using the semaphore flag).

(BTW, this method is to interrupt the blocking wait, right? Sometimes the term notify is used for that)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. It's not needed as for the pipe (we can write 0xfffffffffffffffe times until it blocks) but it helps avoid pointless write syscalls, so ⚖️ ?

@ysbaddaden
Copy link
Contributor Author

It looks like it's starting to work... except for the interpreter that now hangs forever 😭

@ysbaddaden
Copy link
Contributor Author

Using a release compiler (with libevent), the interpreter segfaults 😭

Building a compiler with epoll (as CI does) in non release, the interpreter enters a kind of infinite loop (100% CPU, silent strace) after printing the first spec dot (.). The last strace logs are:

epoll_create1(EPOLL_CLOEXEC)            = 13
eventfd2(0, EFD_CLOEXEC)                = 14
timerfd_create(CLOCK_MONOTONIC, TFD_CLOEXEC) = 15
epoll_ctl(13, EPOLL_CTL_ADD, 14, {EPOLLIN, {u32=1958253120, u64=140490338505280}}) = 0
epoll_ctl(13, EPOLL_CTL_ADD, 15, {EPOLLIN, {u32=1958253072, u64=140490338505232}}) = 0
write(11, "\33[", 2
rite(11, "32", 232)                      = 2
write(11, "m", 1m)                       = 1
write(11, ".", 1.)                       = 1
write(11, "\33[", 2
rite(11, "0", 10)                       = 1
write(11, "m", 1m)                       = 1

It's creating the eventloop instance (epoll, eventfd, timerfd) then writes a colored . then nothing.

Any pointers to debug the interpreter itself?

@ysbaddaden
Copy link
Contributor Author

I can't get tracing to work in the interpreted code, so I hacked myself into Crystal.trace and I see an infinite list of sched.event_loop traces 🤨

@ysbaddaden
Copy link
Contributor Author

I only checked whether @events was empty and didn't check if @timers was, too 🤦

@ysbaddaden ysbaddaden marked this pull request as ready for review July 20, 2024 12:55
@ysbaddaden
Copy link
Contributor Author

CI is finally green and the implementation can be reviewed!

Copy link
Contributor

@yxhuvud yxhuvud left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the plan for rolling out this? Does it need some sort of opt-in flag for a version or two so that people can try and find issues? Or is it a 2.0 thing?

src/crystal/system/unix/epoll/event_loop.cr Show resolved Hide resolved
src/crystal/system/unix/epoll/event_loop.cr Show resolved Hide resolved
size = LibC.recvfrom(socket.fd, slice, slice.size, 0, sockaddr, pointerof(addrlen))
if size == -1
if Errno.value == Errno::EAGAIN
wait_readable(socket.fd, socket.@read_timeout)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So one potential issue here is the following potentially chain of confusing behavior:

-> recvfrom -> EAGAIN
-> waiting a while, say timeout half time span
-> recvfrom -> EAGAIN
.. etc, potentially waiting forever, never timing out.

But treating the timeout as a deadline is perhaps a different issue than this pr.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch.I'm 99% sure this problem is already in the libevent event loop. That doesn't mean we should reproduce it here.

@ysbaddaden
Copy link
Contributor Author

ysbaddaden commented Jul 22, 2024

how to roll out

Either opt-in with a flag... or we're bold and make it the default + a flag to opt-out. That will depend on how stable/performant it will get until the next release.

@ysbaddaden ysbaddaden force-pushed the feature/epoll-event-loop branch from 2fc2f99 to 4d4c068 Compare July 22, 2024 14:23
getter type : Type
getter fd : Int32

property! time : Time::Span?
Copy link
Contributor Author

@ysbaddaden ysbaddaden Jul 23, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TODO: IOCP uses wake_at and it's much better name than time!

@ysbaddaden ysbaddaden marked this pull request as draft July 29, 2024 20:39
@ysbaddaden
Copy link
Contributor Author

Moving back to draft. There are some fixes in #14829 and... I got great ideas from the Go implementation that would allow to skip most of the synchronizations and take full advantage of epoll/kqueue (see ysbaddaden/execution_context#30).

@straight-shoota
Copy link
Member

Superseded by #14959

@ysbaddaden ysbaddaden deleted the feature/epoll-event-loop branch September 6, 2024 10:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants