Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port concurrency features to Windows #6957

Closed
4 tasks done
straight-shoota opened this issue Oct 18, 2018 · 40 comments · Fixed by #13362
Closed
4 tasks done

Port concurrency features to Windows #6957

straight-shoota opened this issue Oct 18, 2018 · 40 comments · Fixed by #13362
Assignees
Labels
platform:windows Windows support based on the MSVC toolchain / Win32 API topic:stdlib:concurrency

Comments

@straight-shoota
Copy link
Member

straight-shoota commented Oct 18, 2018

This is a sub-task of porting the stdlib to Windows #5430

I suppose we can delay porting threads by implementing a mock API for Thread and Thread::Mutex for win32 which essentially doesn't to anything. That should work perfectly fine for single threaded process.

On windows, we should use the win32 API directly instead of libevent (quoting @RX14):

libevent uses select on windows, and select doesnt support pipes on windows, only sockets. Besides, it's an additional dependency, whereas using IOCP directly isn't that much harder and allows us to support sockets, pipes, files, etc.
Windows's nonblocking IO model is entirely different to how libevent works. On windows you submit entire IO operations and wait for them to complete. With epoll you're waiting for "readable/writable status" of a FD then you can perform an IO.

Since the API models are quite different, this will likely require some refactoring of Crystal::Event and Crystal::EventLoop.

@j8r
Copy link
Contributor

j8r commented Oct 18, 2018

I was asking in the parent issue if libuv could be used, with the thread-safe uv_async.

@RX14
Copy link
Contributor

RX14 commented Oct 18, 2018

uv_async isn't useful

@RX14
Copy link
Contributor

RX14 commented Oct 18, 2018

@straight-shoota Crystal::Event and Crystal::EventLoop should go, and be internal to Crystal::System. There needs to be a concept of starting the event loop, and resuming the event loop when there's no work to do, but read/write/other blocking operations that submit events to the event loop should be inside Crystal::System with no fixed API. Attempting to provide a unified event loop abstraction to the outside world doesn't seem worthwhile to me, instead just provide a blocking IO abstraction and make the suspension and resumption of fibers entirely internal and entirely different internally on each platform.

I'm not sure how this fits in with @ysbaddaden's work, and it shouldn't be neecesary to port fibers. Fibers on windows should be working with channels and merged into master before any of the nonblocking IO stuff is even thought about. They are seperate concerns.

@RX14
Copy link
Contributor

RX14 commented Oct 18, 2018

To be clear: I'm proposing Crystal::Event and Crystal::EventLoop and all their usages are moved into Crystal::System such that windows can pursue an entirely different solution, probably not based on callbacks at all, instead directly resuming fibers from the event loop.

But this very much depends on the scheduler design from @ysbaddaden which I haven't seen. I'm pretty in the dark on the design for multicore which @ysbaddaden is proposing and how that fits in with my and @bcardiff's work. This is why I propose not porting evented IO at all on windows yet. It may end up being counterproductive and chasing a moving target.

@straight-shoota
Copy link
Member Author

If we can port fibers without event, that's totally fine by me. Then we just need to further refactor Fiber and Scheduler so that a platform-specific implementation for doesn't depend on Crystal::Event?

@RX14
Copy link
Contributor

RX14 commented Oct 18, 2018

@straight-shoota you can stub it out for now, the only connection between the scheduler and the event loop is Crystal::EventLoop.resume, sleep which is unused on windows and yield. I'm not 100% sure but I think yield can be modeled as exactly the same as enqueue if the scheduler isn't present.

I advise you to do these tweaks via {% if flag in scheduler.cr, not to add any more Crystal::System for the scheduler.

@ysbaddaden
Copy link
Contributor

Some notes:

  • Fiber: only depends on anonymous memory maps (posix: mmap, win32: file mapping) for allocating stacks, and arch-specific context creation/swap functions —warning: context is arch+os specific on win32;

  • Crystal::EventLoop: the stdlib has few expections that can be changed for something more abstract (add a file read resume event, add a socket write resume event, ...). The monothreaded Crystal::Scheduler only cares about resuming the event loop fiber (blocking). My multithreaded Crystal::Scheduler expects a safe way to run the event loop (no fiber resume) once and nonblocking (with libevent: event_base_loop(base, EVLOOP_ONCE | EVLOOP_NONBLOCK));

  • Crystal::Scheduler: depends on Fiber to be correctly implemented, and resumes the event loop when its queue is empty; until win32 has a working event loop, maybe it could just exit (nothing to do);

  • Thread, Thread::Mutex and Thread::ConditionVariable: they're not required to implement Fiber, but I'd still encourage to have real or just skip implementations instead of adding more stubs.

I have a few more changes pending that I'd like to push:

  1. introduce a Fiber::Context struct for holding the stack top pointer and a resumable flag (preventing a double resume of a fiber), and change the context functions accordingly. Maybe later it could contain some bytes to save the current CPU registers —for a Crystal GC stopping the world;
  2. introduce a thread-safe Fiber::StackPool struct (mutex based).

@straight-shoota
Copy link
Member Author

I got it working so far. At least in theory. I still need to get the stack swap on win32...

@RX14
Copy link
Contributor

RX14 commented Oct 18, 2018

add a file read resume event, add a socket write resume event

That doesn't work on windows, because you're not waiting for an FD to become readable or writable, you're waiting for a specific read or write IO to finish. You then need to resume the specific fiber that sent that IO.

event_base_loop(base, EVLOOP_ONCE | EVLOOP_NONBLOCK)

Where is the blocking sleep when there's nothing to do performed then?

maybe it could just exit (nothing to do);

agreed, probably exit with a warning since with just fibers and channels it should never happen? I haven't proved it to myself but it seems logical.

I have a few more changes pending that I'd like to push

They look like good changes.

@ysbaddaden
Copy link
Contributor

That doesn't work on windows

Oh, then the event loop is target specific. That's wonderful. I still wish we could try libevent (at least for timers and sockets), until we dig for arch specifics (IOCP, kqueue and epoll)

Where is the blocking sleep when there's nothing to do performed then?

Threads will spin trying to steal fibers / run the event loop, then give up and park themselves, unless it's the last thread, which should run the event loop as blocking; along with a mechanism to wake parked threads when fibers are enqueued (i.e. mutex + condition variable).

@RX14
Copy link
Contributor

RX14 commented Oct 21, 2018

I still wish we could try libevent

I do too, but unfortunately pipes are pretty essential to everything from basic process spawning to signal handing. (yes, windows has signals two)

which should run the event loop as blocking;

ah, so blocking/nonblocking is optional, makes sense

@rdp
Copy link
Contributor

rdp commented Dec 6, 2018

FWIW libevent does mention "IOCP" though with scant documentation, seemingly: https://stackoverflow.com/questions/8042796/libev-on-windows ... hmm might not be enough...

@straight-shoota straight-shoota added topic:stdlib:concurrency platform:windows Windows support based on the MSVC toolchain / Win32 API labels Dec 30, 2018
@cfsamson
Copy link
Contributor

cfsamson commented Aug 5, 2019

@RX14 and @ysbaddaden Regarding the Windows event loop I think this is very interesting: https://github.com/piscisaureus/wepoll and just wanted to leave it here as a reference before I forget it.

In Rust mio is the de facto standard event loop library and they very recently switched over to a solution inspired by this work. The are several advantages besides a familiar api, one of them beeing performance since the only other way I know of needs extra allocations on Windows due to IOCP requiring a read/write buffer for each event.

It's at least worth considering if the alternative is to implement our own IOCP implementation since that will be a big task anyway.

@rdp
Copy link
Contributor

rdp commented Aug 9, 2019

Wonder if we could just start with select+libevent and then move to IOCP...to save time. But then again maybe too painful to do everythinig twice...or those others might be interesting as well.

@RX14
Copy link
Contributor

RX14 commented Aug 18, 2019

wepoll is limited only to sockets

which limits it to being just an optimization once there's an eventloop architecture which can handle the IOCP model.

I'd rather make something that works for the most general case of readable/writable handles then optimize it for sockets later, instead of make something which works for sockets then leave IO to block on every other kind of file until someone gets round to fixing it (which would mean refactoring the event loop, which probably means nobody will get around to it which means hell)

@cfsamson
Copy link
Contributor

cfsamson commented Aug 27, 2019

You're right, it's unfortunately only useful for sockets it seems. On investigating this closer I also realized that the use of IOCTL_AFD_POLL seems to be undocumented and might not work in the future which probably should raise concerns if used in Crystals stdlib.

@cfsamson
Copy link
Contributor

@rdp I think designing the event loop architecture with IOCP in mind from the start is the right thing to do. I would actually consider designing it with IOCP in mind first, getting the readiness based models like kqueue and epoll to work with that is probably easier than the other way around. However, everything is possible.

@RX14
Copy link
Contributor

RX14 commented Sep 8, 2019

@cfsamson yeah, I thought getting epoll to work like IOCB is easier than the other way around too

after all the "you need to allocate less buffers when using epoll" argument is moot when using crystal's IO model: you need to allocate them anyway since it emulates blocking IO with greenthreads.

Windows' IO model is essentially submitting a buffer and the OS tells you when it's done filling it with data and how much. This is easily mapped to Crystal, and epoll is easily mapped to that (we already do it, just at a higher layer).

@cfsamson
Copy link
Contributor

cfsamson commented Sep 10, 2019

@RX14 I'm going out on a limb here partly since it might contribute to the discussion, and partly out of curiosity. I made an extremely simplified model to just plot down how something like this could work (if the plan is to abstract at a higher level like sockets/files/pipes to hide the implementation details). If I understand you correctly the green thread model greatly simplifies the event loop implementation since you can easily prevent the buffer sent to IOCP from being touched while waiting for the event to complete and you will not have any "extra" allocations since this will be abstracted over in either case:

Crystal Eventloop

I apologize in advance for simplyfying this so much that the code is not valid anything really and skipping a ton of complexity.

@RX14
Copy link
Contributor

RX14 commented Sep 17, 2019

@cfsamson note that all reads go through the IO primitive read(), which already requires an explicit buffer to be passed in, so you'd register that buffer with the event loop directly. gets and co already allocate buffers and use read(). That means the amount of allocation does not change, and buffers must already be per-fiber.

@jan-zajic
Copy link
Contributor

@rdp
Copy link
Contributor

rdp commented Nov 20, 2019

libevent supports threaded IOCP (practically the only examples I could find here: https://github.com/libevent/libevent/blob/master/event_iocp.c https://github.com/libevent/libevent/blob/master/sample/http-server.c https://github.com/libevent/libevent/blob/master/test/regress_iocp.c#L304 (the *ptr passed in is a "basic_test_data" object with a socket pair established).
http://www.wangafu.net/~nickm/libevent-book/Ref6_bufferevent.html) but documentation is super scarce...it almost looks unmaintained...either that or it's bug free? :)

Also interesting is that https://github.com/libevent/libevent/blob/master/event_iocp.c (the entire "libevent IOCP implementation") isn't that long, maybe a good pattern. go's is seems reasonably small too: https://golang.org/src/runtime/netpoll_windows.go?h=iocp and https://golang.org/src/net/fd_windows.go#205 line 205 FWIW. Maybe that's all :)

libuv also supports IOCP but...I could hardly see any examples anywhere...also it seems libuv requires one "event loop" per thread, wasn't sure how that lined up with crystal's current use of libevent...

@cfsamson
Copy link
Contributor

cfsamson commented Dec 12, 2019

I've been thinking about this quite a lot (since I'm investigating something related). Creating our own event queue is doable. It's a handfull of syscalls to use on linux/bsd/windows, but this is only part of the problem and we should consider the next steps as well. Here are some of the questions I think needs some discussion and my initial thoughts as well:

How do we run the event queue?

We can implement a simple Reactor backed by this epoll/kqueue/IOCP based queue which is meant to run on a single separate reactor thread (OS thread), and then follow an epoll based approach from there (for a lack of a better description) where the Reactor wakes the green thread which is ready to progress with a task. This means the Reactor needs a way to communicate with the Scheduler to mark the relevant green thread as ready to progress.

How to register events?

The next part is how we register events. My initial thoughts here are that we implement a Registrator which can be sent to different OS threads which is tied to the epoll/kqueue/IOCP based event queue. This passes on a resource handle, a buffer and a flag to indicate Readable/Writable interests. It also means that we need to make this Registrator thread safe. Registrator would be an implementation detail used in the abstraction over for example Sockets where we register an interest with the Reactor and then suspends the green thread.

DNS lookup and File I/O

Since these are most often cached by the OS (and have poor cross platform API's AFAIK) these are most often sent to a thread pool. Anyway, I think we need a cross platform thread pool up and running as well to be able to actually use this in i.e. a web server.

How to wake a green thread

This is a bit tricky I think due to synchronization issues and performance. If we wont to avoid actively polling a queue we need a way to interact with the scheduler from the reactor thread. I don't know how well the current Scheduler implementation allows for this or the best way to solve this yet.

I'm just putting this thoughts here for now to see if it can contribute to a constructive discussion.

@RX14
Copy link
Contributor

RX14 commented Dec 12, 2019

The interface would be to submit a file descriptor for a read/write to the scheduler, and the scheduler would resume your fiber when it's done. The rest is a platform-specific black box. On existing platforms it'd use the same (refactored) libevent code it always has, just moved out of IO::Evented. On windows, it'd use IOCP. There's no need to share code at higher-granularity for an initial implementation, so KISS.

IIRC with IOCP you can register a void* of data with your read/write, this would simply be the Fiber pointer to resume, so the event loop would just block waiting for events, and directly receive the fiber pointer to resume from the OS.

That's why I proposed the custom event loop for windows - the only hard part is handling the sleep events.

@cfsamson
Copy link
Contributor

cfsamson commented Dec 13, 2019

Oh, I see.

Yes, you associate a token (or a pointer) when registering a resource with the completion port in CreateIoCompletionPort (you associate it with the resource and not as a part of the event WSARecv/WSASend). This token is "returned" when retrieving an event using either GetQueuedCompletionStatus (wait for one event at a time) or as part of the OVERLAPPED_ENTRY structure when using GetQueuedCompletionStatusEx to get multiple events.

You still need to actually do the blocking wait for events in a separate thread so that would be the Reactor part I suggested, or some variant of that inside the black box.

Sleep events is tricky. I've tried something like that before and kept an ordered queue of timers (I used a BTreeMap) which I checked every time the event queue receives an event (or times out) for expired timers.

Every blocking call to GetQueuedCompletionStatus(Ex) uses the closest timer as a timeout. If a new timer is registered which is earlier than the previous registered timers, I update the timeout by posting an empty completion packet using PostQueuedCompletionStatus thereby forcing the event queue to wake up and update its timeout with the new value.

I don't know if there is a better way to do this since it's not a pretty solution. There probably is.

@RX14
Copy link
Contributor

RX14 commented Dec 16, 2019

Yes, you associate a token (or a pointer) when registering a resource with the completion port in CreateIoCompletionPort (you associate it with the resource and not as a part of the event WSARecv/WSASend). This token is "returned" when retrieving an event using either GetQueuedCompletionStatus (wait for one event at a time) or as part of the OVERLAPPED_ENTRY structure when using GetQueuedCompletionStatusEx to get multiple events.

Ah yes, I'd forgotten the details. This is exactly the same as we currently have in libevent's implementation then, where we register one Event per read/write of a given FD. Then you receive a read/write event for a given file descriptor, and you're passed a handle to the IO::Evented instance. Then the IO::Evented instance works out exactly what fiber to resume, given there's been an event. Here is the impl.

You still need to actually do the blocking wait for events in a separate thread so that would be the Reactor part I suggested, or some variant of that inside the black box.

The interface is Crystal::EventLoop.run_once, in the same file i showed above. This would simply:

  • Work out the interval until the next sleep event expires
  • Call GetQueuedCompletionStatusEx with the calculated interval as the timeout
  • Use either the recieved event, or the timeout failure, to choose which fiber to wake up.

If a new timer is registered which is earlier than the previous registered timers, I update the timeout by posting an empty completion packet using PostQueuedCompletionStatus thereby forcing the event queue to wake up and update its timeout with the new value.

This is all abstracted a bit above the event loop in crystal, at the scheduler level. Event loops are per-thread, not per-program, and one thread can only have timers registered on it's event loop from that thread, meaning you never have to deal with the case of a timer being registered while you're sleeping. Fibers can then be passed between threads by pipes (which generate a read event). I'm glad this is solvable if that situation changes though. Might want to look through scheduler.cr to get an idea.

@cfsamson
Copy link
Contributor

cfsamson commented Dec 17, 2019

That actually simplifies things even more. I'll have to get to know the scheduler and EventLoop a bit more to make sure I understand correctly.

I would start by adding bindings for the relevant syscalls and provide some wrappers around them to make them easier to use. I suggest that Event essentially is just an alias for the OVERLAPPED_ENTRY on Windows which will return with a pointer to the relevant IO::Evented instance in the lp_completion_key field once an event has completed. This should allow us to set up some basic infrastructure in event_loop_iocp.cr.

I'll see if I have some time after my current project is done and see if I can help progress this. Is there a stdcall directive in Crystal for working with WinApi or is there another way that is solved?

Edit:

The above suggestion of using the lp_completion_key field might not work since it's registered on a per resource basis. I can't see a way to actually identify what event has occurred by using that to store a pointer to IO::evented.

Judging by the BOOST ASIO implementation they seem to not use CompletionKey at all except when posting "custom" completion packets to PostQueuedCompletionStatus like in the case of timers.

Instead they wrap the OVERLAPPED structure passed in to for example WSARecv in an Operation struct compatible with the expected memory layout for API. This lets them cast the pointer to the Operation struct to *OVERLAPPED when passing it in to WSARecv and then back to a Operation struct when the event has occurred thereby getting the rest of the context for that exact event.

@RX14
Copy link
Contributor

RX14 commented Dec 19, 2019

The above suggestion of using the lp_completion_key field might not work since it's registered on a per resource basis

I think this is fine, given that there's only one IO::FileDescriptor per resource. But, I wonder if we'd be able to make that guarantee.

I can't see where windows lets you see whether an event was a read or a write completing though...

I'll see if I have some time after my current project is done and see if I can help progress this. Is there a stdcall directive in Crystal for working with WinApi or is there another way that is solved?

There's plenty of windows API functions bound already. I think crystal already uses stdcall for all functions on windows.

Actually, since we're compiling for x86-64, everything on windows uses the microsoft x64 calling convention, which is not fastcall. This is WinAPI and crystal functions themselves. So on 64bit windows, there is only one calling convention which makes this all simple. If we ever port to 32bit windows we might have to sort this out.

@cfsamson
Copy link
Contributor

cfsamson commented Dec 19, 2019

I can't see where windows lets you see whether an event was a read or a write completing though...

I think this is exactly why they wrap OVERLAPPED to privide this exatra information about the event. I'll see if I can find one more references on how to solve this. I haven't found any information about this in the IOCP documentation, but might have missed something. I do have a POC using this technique in Rust and it seems to work fine.

@RX14
Copy link
Contributor

RX14 commented Dec 19, 2019

I think this is exactly why they wrap OVERLAPPED to privide this exatra information about the event.

If it works, it's more flexible, and it's what everyone else does, this is just fine to me!

@cfsamson
Copy link
Contributor

I just checked mio (a Rust implementation of epoll/kqueue/iocp event queue) and it does (well, did since they switched to wepoll recently) the same as I explained with regards to wrapping the OVERLAPPED structure above check here for the relevant lines of the source code.

It seems to be the a pretty normal technique.

@RX14
Copy link
Contributor

RX14 commented Dec 21, 2019

It seems to be the a pretty normal technique.

yeah I prefer this too now I know about it.

@neatorobito
Copy link
Contributor

I've done the plumbing work to get the IOCP functions wrapped into Crystal. I am somewhat familiar with win32 APIs but I've never worked on an event loop. @cfsamson Would you want to work together on this? I've been reading some of your stuff here to get up to speed.

@cfsamson
Copy link
Contributor

cfsamson commented Nov 9, 2020

@incognitorobito Great!

I've been wanting to take this on but have had (and still have) a limited bandwidth. If you can take lead on this I'll try to help push this forward. Great that you found that book. The event loop here should be pretty simple. If I remember correctly Crystal::EventLoop.run_once will pretty much wrap a call to GetQueuedCompletionStatusEx and resume the correct fiber on a completion event (or a timeout).

We'll need to wrap WSAOVERLAPPED in something like an "Operation" struct which gives us a handle to the fiber to resume on completion. We need to keep the memory layout of Operation compatible with WSAOVERLAPPED. I write about that all the way down at the end of this chapter. I'm not 100% sure how to do this in Crystal, though.

@neatorobito
Copy link
Contributor

I put together a basic implementation in #9957. Does it line up with what was discussed here?

@kubo
Copy link
Contributor

kubo commented Dec 20, 2020

#9957 is great. I tried it several months ago and it worked for small scripts. However tests failed with access violation randomly. I abandoned my code.

By the way, IOCP requires handles with FILE_FLAG_OVERLAPPED. However,

  1. Consoles such as CONIN$ and CONOUT$ cannot be opened with FILE_FLAG_OVERLAPPED.
    dwFlagsAndAttributes is ignored for Consoles according to this document.
  2. Handles associated with stdin, stdout and stderr at crystal process startup don't have FILE_FLAG_OVERLAPPED.
  3. The arguments of input, output and error of Process.run must be handles without FILE_FLAG_OVERLAPPED. Otherwise, process startup fails. I tried to convert a handle with FILE_FLAG_OVERLAPPED to a handle without FILE_FLAG_OVERLAPPED by ReOpenFile but it didn't work. See this comment instead of strikethrough text.

So event loop for Windows should support I/O without the overlapped flag.

Just an idea to resolve 1 and 2:

  • For a handle with FILE_FLAG_OVERLAPPED, (this is just for comparison with the next)
  • For a handle without FILE_FLAG_OVERLAPPED,
    • pass the handle, I/O buffer and an OVERLAPPED structure to a separate thread (thread pool?),
    • request blocking I/O in the separate thread,
    • call PostQueuedCompletionStatus with the OVERLAPPED structure when the I/O finishes in the separate thread
    • and get the result by using GetQueuedCompletionStatusEx.

As for 3, two ideas.
The first:

  • FIle.new and File.open take an argument overlapped : Bool = true.
  • Process.run raises an exception when any one of input, output and error arguments is opened with overlapped = true.
    Pros: No need to use a separate thread to request I/O when overlapped = true.
    Cons: Programmers must take care about the overlapped flag when using Process.run.

The second:

  • Files are opened always without FILE_FLAG_OVERLAPPED.
    Pros: Programmers have no need to take care about the overlapped flag when using Process.run.
    Cons: Any I/O request uses a separate thread.

@cfsamson
Copy link
Contributor

cfsamson commented Dec 23, 2020

@kubo Right now I would let file I/O be blocking. Most implementations I've seen uses a threadpool for file I/O (e.g. libuv) but with the advent of io_uring this might change. Since most OS cache frequently accessed files the performance impacts of leaving it blocking might not be that big depending on the concrete use case (for some uses it might be faster since you don't involve a lot of machinery to serve a cached file). However, I see that this might be insufficient for a long term solution, but IMHO we should focus on getting every other piece working first. An interesting article about the subject can be found here.

@incognitorobito I've added some comments in #9957.

@kubo
Copy link
Contributor

kubo commented Dec 25, 2020

@cfsamson
I agree with you if file I/O is only disk I/O. However the file I/O API is used for not only real files but also pipes and consoles.

@kubo
Copy link
Contributor

kubo commented Dec 31, 2020

I implemented experimental event loop support of IO::FileDescriptor#read based on #9957. It passed std_spec but not compiler_spec as #9957. It runs blocking read in the default thread pool in Win32 using TrySubmitThreadpoolCallback.
EDITED: This causes access violation when ReadFile() returns after the timeout specified here.

I checked the behavior by the following code.

puts("Hit enter to exit.")
spawn do
  loop do
    sleep(1)
    print('.')
  end
end
gets

It prints "Hit enter to exit" only with #9957. gets prevents the loop in the spawn block.
It prints dots periodically with my implementation.

@HertzDevil
Copy link
Contributor

#11647 is merged, but leaving this open since the comments about asynchronous file I/O here may prove valuable

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
platform:windows Windows support based on the MSVC toolchain / Win32 API topic:stdlib:concurrency
Projects
Status: Done
Development

Successfully merging a pull request may close this issue.

10 participants