-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support macOS via kqueue
#27
Conversation
Wow! |
if (LinktimeInfo.isLinux) | ||
socket.accept4(fd, null, null, SOCK_NONBLOCK) | ||
else | ||
socket.accept(fd, null, null) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the SN binding for accept
is still bugged with null
arguments, so I call our own binding here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems the SN binding for accept is still bugged with null arguments,
That is both news and a disappointment. I will look into that in daylight.
This almost surely falls into the category of code that I got rid of for 0.5.0,
but that does us no good for 0.4.n.
I am surprised by the null
arguments. I can trace the code, but I would expect
the enclosing method to actually use the accepted sockaddr to fill in the
Socket structure returned. I need to look at more than a four line window.
All things in time.
CI is starting to look green, but there are still some suspicious behaviors to be investigated... Lee, I trust that you will scrutinize this on M1 😁 thanks! |
if ((event.flags.toLong & EV_ERROR) != 0) { | ||
|
||
// TODO it would be interesting to propagate this failure via the callback | ||
reportFailure(new RuntimeException(s"kevent64: ${event.data}")) | ||
|
||
} else if (callbacks.contains(event.ident.toLong)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Suspicious behavior here: in the CI logs it seems that a failure here is being reported quite frequently. However, the value of event.data
is zero. This seems strange because the manpages say that
with EV_ERROR set in flags and the system error in data.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Btw, I do not see these on my local system, which is macOS 11 intel.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wait, I lied, I do see it on my local 😂
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, I had an off-by-one, which is why data
was 0.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So for the record the error is a "bad file descriptor". What's more interesting is it only seems to happen when running multiple tests; I cannot reproduce it when running a single test in isolation. This suggests some kind of interference perhaps due to file descriptors being re-used?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hard to tell from the audit trail above, sorry. Is this soup enough for me to try on Apple M1?
Not rushing, just polling.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, please, I think it's better to try on M1 sooner rather than later :) All the tests should pass, but there will likely be some stack traces logged.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, I am off to run this PR on M1. That will keep me out of your hair for awhile.
2022-09-06 12:20 -0400 testsJVM/test : all passed, establishing a baseline testsNative/test : [info] Passed: Total 12, Failed 0, Errors 0, Passed 12 Apple M1, Monterrey, most current os (12.6?). I will print full stack traceback for the first and I think -9 is EBADF (bad file descriptor) on macOS. Unlisted tests all pass cleanly. I will have to use my system for awhile to see if it is OK
|
Thanks for reporting those results! Just to confirm, if you run those tests in isolation (e.g.
Yes I believe it is.
Eh, I am not so concerned. I believe these errors are occurring because the file descriptors have not been deregistered from the |
I did not try one-at-a-time. I can try that later tonight when it cools down.
Yes, unix likes to las-in-first-out file descriptor it knows about. Thus if you open an fd, getting say fd4, then close 4, |
I am not following. I understand that rules are off/suspended for rapid devo, so |
Good question. The error is happening here: epollcat/core/src/main/scala/epollcat/unsafe/KqueueExecutorScheduler.scala Lines 81 to 89 in 8cdd71d
Currently there is no way to propagate this error back via the callback, so we resort to Part of my confusion is that supposedly calling
https://www.freebsd.org/cgi/man.cgi?query=kqueue&sektion=2 In an earlier draft of the epollcat/core/src/main/scala/epollcat/unsafe/KqueueExecutorScheduler.scala Lines 65 to 68 in 0cc2458
However, this was also causing kqueue to error, in this case because I had asked it to remove kevents that no longer existed, because it had removed them itself when I So I am a bit stumped about what exactly to do here. Note that it is standard convention to use |
I tried running a single test on Apple M1. As you surmised, it worked as expected, with no traceback. You seemed to indicate that there was an easy way to run one test within a Suite. I know how to run I usually use brute force and edit the file. Knowing how to do a single test would save I will standby. Let me know how I can be useful. It will be a long day or |
Thanks for confirming that! So now we think what to do ...
Unfortunately I also edit the file 😂 I usually do this: test("run this test".only) {
...
} There are also some hints on how to run a single test without editing it: |
Ok, I've now fixed the Unlike The easy fix is to add a Now that CI is a healthy green, I think this PR is ready for proper review 😁 |
I have only a moment now. I was looking at the most recent changes and I am offline for most of today. I will try to trace that when Not that I doubt your code, just that I like to do fault |
Yes please!! I'm counting on it 😁
Sorry, I couldn't follow this scenario. How do read and write events relate to closing the socket? |
I am working my way through the code. I'll look closely at the "2 event in poll interval". When I first read the Switching topics:
I am somewhat surprised that |
epollcat/core/src/main/scala/epollcat/internal/ch/SocketHelpers.scala Lines 50 to 53 in 1ac7627
But perhaps I should rename |
Good question. It is a bit of a hack, and I am afraid we have to keep it; otherwise our tests would break. I wrote a detailed explanation in Cats Effect. /**
* @param timeout
* the maximum duration for which to block. ''However'', if `timeout == Inf` and there are
* no remaining events to poll for, this method should return `false` immediately. This is
* unfortunate but necessary so that this `ExecutionContext` can yield to the Scala Native
* global `ExecutionContext` which is currently hard-coded into every test framework,
* including JUnit, MUnit, and specs2.
*
* @return
* whether poll should be called again (i.e., there are more events to be polled)
*/
protected def poll(timeout: Duration): Boolean |
Everything is "art of the possible'. Thank you for the explanation and for leaving something for me to alert on. Perhaps a one line comment in the file referencing the longer explanation? Aren't "Infinity - 1" or even a minute or 10 going to cause the same problem? |
Adding a comment is a good idea. I actually forgot about this myself and tried to remove this "weird" case in 902b939. Then I reverted in 8e07a9b. So ... yeah :)
They are not. If a non-infinite value is passed to When Cats Effect has no scheduled timers, it will pass an infinite timeout to The problem occurs when Cats Effect has no scheduled timers (so it passes an infinite timeout) and there are no outstanding I/O events to poll for. In essence, the user application has finished: there is nothing left to do. Specifically, if we are running in a test suite, the test has completed. Thus it is imperative that we immediately yield control back to the test runner, which is running on the Scala Native global |
Clever! Thank you for patiently explaining the contract. |
Not at all, I appreciate your detailed tracing of the implementation. Every line of code that you question let's me sleep better at night knowing that at least one person other than myself has stared at it! |
I can now confirm that this PR runs multiple tests on macOS using M1 hardware. I think this PR is ready for merge. I can continue to trace code, but I think I think the step after that is for me to see if I can break it by doing my IPv6 |
.github/workflows/ci.yml
Outdated
@@ -27,10 +27,13 @@ jobs: | |||
name: Build and Test | |||
strategy: | |||
matrix: | |||
os: [ubuntu-latest] | |||
os: [ubuntu-latest, macos-latest] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A matter of personal style:
I have been hosed a number of times by using foo-latest
rather than, say ubuntu-20.04
.
foo-latest
is good for rapid prototyping but fragile going beyond that .
GitHub can and does change the -latest
. Things can go
bump in the night where the shrapnel leaves little direct clue.
Yes, this introduces the technical debt of having to manually
change all the ubuntu-18.04
to ubuntu-20.04
and to
figure out what the desired -mumble
is.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, absolutely agree. Let's get these pinned down!
@@ -9,6 +9,9 @@ ThisBuild / tlSonatypeUseLegacyHost := false | |||
ThisBuild / crossScalaVersions := Seq("3.1.3", "2.12.16", "2.13.8") | |||
|
|||
ThisBuild / githubWorkflowJavaVersions := Seq(JavaSpec.temurin("17")) | |||
ThisBuild / githubWorkflowOSes += "macos-latest" | |||
ThisBuild / githubWorkflowBuildMatrixExclusions += | |||
MatrixExclude(Map("os" -> "macos-latest", "project" -> "rootJVM")) | |||
|
|||
val catsEffectVersion = "3.4-519e5ce-SNAPSHOT" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are catsEffectVerson
and the line below it still up to date.
There have been a couple of version
up/down(?) grades
recently. Good reasons to get this PR in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You are right, they have fallen out of date. Fortunately CI runs on a merge commit so it will use the latest from main
.
.socket(posix.sys.socket.AF_INET, posix.sys.socket.SOCK_STREAM | SOCK_NONBLOCK, 0) | ||
if (fd == -1) | ||
throw new RuntimeException(s"socket: ${errno.errno}") | ||
val fd = SocketHelpers.mkNonBlocking() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This will help my IPv6 work, thank you.
I pinned the OS versions, and expanded the matrix to cover ubuntu 20.04 and 22.04 and macOS 11 and 12. This should have us well covered. |
@extern | ||
@nowarn | ||
private[ch] object socket { | ||
final val SOCK_NONBLOCK = 2048 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps one line to state which operating systems use
this magic value, Linux, macOS, FreeBSD?
Somebody is going to port to an OS which does not.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, even I will forget and make a mistake.
I noted that SOCK_NONBLOCK
and also accept4
are supported on Linux and FreeBSD, however we are currently only using them on Linux. Because we do not have a FreeBSD environment in CI I am hesitant to add special conditions for it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
re: FreeBSD understood & concur.
@@ -43,7 +42,7 @@ private[unsafe] final class EpollExecutorScheduler private ( | |||
val noCallbacks = callbacks.isEmpty() | |||
|
|||
if ((timeoutIsInfinite || timeout == Duration.Zero) && noCallbacks) | |||
false // nothing to do here | |||
false // nothing to do here. refer to scaladoc on PollingExecutorScheduler#poll |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you. Given the "refer to", even I could probably find that
(and will chase it down for my own edification).
@extern | ||
@nowarn | ||
private[unsafe] object event { | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the provenance of these magic numbers?
Are they macOS only or expected to work on any *BSD,
say FreeBSD?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are from here:
https://opensource.apple.com/source/xnu/xnu-7195.81.3/bsd/sys/event.h.auto.html
I do not know of their support beyond macOS. epollcat does not officially support FreeBSD or other BSDs. Lacking both a local machine and CI environment for it makes it difficult to support at this time. I'm happy to wait for a request.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I suggest adding exactly that URL and a "Your mileage may vary on FreeBSD and others".
We have concurred before about "one thing done well" rather than "a skillion things done wrong".
I know the pain about doing primary development only on CI machines. I give my SN PRs
on Windows two attempts, one to generate to smoke, and, if only a little, one to see if I can hack a fix"
Meanwhile, I utter words which, sad to admit, increase my karmic debt.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the chuckle 😁 I added the URL in e24bd4a.
|
||
callbacks(fd.toLong) = cb | ||
|
||
() => { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The comment is well placed. It anticipates & answers
the question of a person tracing the code.
Can it be expanded just a bit? Are you saying that
at that point the fd has been closed (by somebody), and
that notified kexec to un-register it.
Or are you saying that callbacks.remove()
will close the fd?
I am now tracing callbacks
. It looks like a Set and it would
be mildly surprising for a Set.remove to close/transform the thing being removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah yes I see it is confusing as-written. It is the former: we do not need to anything special to unregister the fd with the kqueue, because it will be unregistered automatically when the fd is closed by its owner.
The purpose of callbacks
is to prevent them from being garbage-collected while the kqueue
holds the pointer, which the GC doesn't see. I will add a comment to clarify that as well.
There is a line in
The
|
LGTM. Job well done All I could find were low level concerns & slight improvements. Thank At this point, I think wall clock time of experience is needed to detect remaining |
Oh interesting, I actually did not know this was possible at all! So continuing this line: if we were to register the callback to be notified for |
I think what's confusing to me about subscribing to |
You are absolutely right! My mistake. |
Lee, thank you so much for your detailed review. I feel very lucky to have your wisdom, experience, and humor on the team. I learned a lot, and will continue to do so! :) |
Arman, thank you for the encouragement and kind words. I am enjoying the collaboration and am energized by both |
Closes #2.