Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Time based fiber yield points #15

Open
ysbaddaden opened this issue Jun 11, 2024 · 5 comments
Open

Time based fiber yield points #15

ysbaddaden opened this issue Jun 11, 2024 · 5 comments
Labels
enhancement New feature or request

Comments

@ysbaddaden
Copy link
Owner

CPU bound fibers may never hit a cancellation point, and block a thread to run a fiber for an unbounded amount of time, blocking other fibers from running.

I'm not even talking about CPU heavy algorithms (e.g. cryptography): a regular sockets may prevent a fiber from yielding, for example when the socket's always ready for read or write (the fiber will never wait on the event loop); a buffered Channel won't suspend the current fiber until the buffer is full or empty, which may take a while to happen if the other end is pushing/consuming them quickly.

One goal of execution contexts is to limit this in practice, by taking advantage of the OS to preempt threads. Still, we should have more places places that can yield the current fiber.

For example Fiber#enqueue could check for how long the current fiber has been running and decide to yield when it ran for N milliseconds. Maybe IO methods could do just that (inside the EventLoop). Instead of checking Time.monotonic over and over again, we could have the monitoring thread do the check and mark the fiber (see #5).

@ysbaddaden ysbaddaden added the enhancement New feature or request label Jun 11, 2024
@ysbaddaden
Copy link
Owner Author

Note: more yielding points means that cooperative shutdown or cooperative stop-the-world, or fiber cancellations could happen faster.

@ysbaddaden
Copy link
Owner Author

I got the logic implemented:

  1. Fiber.maybe_yield will yield if told to;
  2. Fiber#enqueue always calls Fiber.maybe_yield (*);
  3. start a monitoring thread along with the default execution context;
  4. the monitoring thread wakes up every 10ms, iterates each thread/scheduler, collects the running fiber (with current schedule tick) and marks it for yield if it didn't change since the previous iteration —it should tell a fiber to yield after it ran between 10ms to 30ms;

It works. A busy loop that manually checks for Fiber.maybe_yield regularly yields 🎉

But I quickly get a libevent2 warning:

[warn] event_base_loop: reentrant invocation. Only one event_base_loop can run on each event_base at once.

And an eventual crash:

FATAL: can't resume a running fiber #<Fiber:0x7fa4608ebe40: DEFAULT:loop> (#<ExecutionContext::SingleThreaded:0x7fa4608f0f20 DEFAULT>)
  from src/single_threaded.cr:124:52 in 'resume'
  from src/single_threaded.cr:172:11 in 'run_loop'
  from src/single_threaded.cr:68:56 in '->'
  from src/core_ext/fiber.cr:151:11 in 'run'
  from src/core_ext/fiber.cr:57:34 in '->'
  from ???

(*) more places could eventually call Fiber.maybe_yield, for example uncontented IO methods, buffered channel, mutex, ... and so on.

@ysbaddaden
Copy link
Owner Author

ysbaddaden commented Jun 20, 2024

Note about the crash: this is the EC::Scheduler#run_loop fiber trying to resume itself 🤨

EDIT: this may happen if:

  1. the monitoring thread is trying to interrupt a thread's main fiber that is reserved to run the run loop or the main thread's run loop fiber (in both cases: it musn't);
  2. the scheduler ends up calling Fiber#enqueue (it musn't);

NOTE: had it been a MT context, the thread would have deadlock, waiting for the current fiber to be resumable.

@ysbaddaden
Copy link
Owner Author

ysbaddaden commented Jun 20, 2024

I think the issue is the monitoring thread telling the runloop fibers to interrupt then the event calling Fiber#enqueue to enqueue fibers.

  • Workaround: events should now call ExecutionContext::Scheduler.enqueue(fiber) instead of Fiber#enqueue;
  • Ideally: Crystal::EventLoop#run would return a list of fibers, that the scheduler would enqueue in a batch (instead of 1 by 1), minus one to be resumed immediately (skipping the queue).

@ysbaddaden
Copy link
Owner Author

  • Alt workaround: the monitoring should skip a scheduler's main fiber.

That fixed the issue.

This was referenced Jun 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

1 participant