-
-
Notifications
You must be signed in to change notification settings - Fork 495
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
epoll: add epoll-based pollq implementation #264
Conversation
Wow, I had planned to do this tomorrow actually. I will review and test your changes at least then! :-) |
Codecov Report
@@ Coverage Diff @@
## master #264 +/- ##
==========================================
+ Coverage 78.77% 78.94% +0.16%
==========================================
Files 157 157
Lines 13024 12993 -31
==========================================
- Hits 10260 10257 -3
+ Misses 2764 2736 -28
Continue to review full report at Codecov.
|
src/CMakeLists.txt
Outdated
@@ -104,7 +104,11 @@ if (NNG_PLATFORM_POSIX) | |||
) | |||
endif() | |||
|
|||
if (NNG_HAVE_KQUEUE) | |||
if (NNG_HAVE_EPOLL) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I actually think kqueue should have precedence. Some systems are emulating epoll, and I won't be surprised if BSDs do that too to facilitate portability, but kqueue is definitely better. epoll() itself is rather defective under the hood, and we would prefer not to use it unless no other alternative is viable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done in 7d8e166
nni_cv_init(&pq->cv, &pq->mtx); | ||
|
||
if (((rv = nni_thr_init(&pq->thr, nni_posix_poll_thr, pq)) != 0) || | ||
((rv = nni_posix_pollq_add_wake_pipe(pq)) != 0)) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this, rather than using the clunky wakewfd, wakerfd, why not just go ahead and use eventfd() directly. We know its here, because we're on Linux. (We shouldn't use epoll() anywhere else.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 implemented in 7d8e166
|
||
node->events &= ~events; | ||
if (node->events == 0) { | ||
ev.events = 0; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we actually need to make the system call in this case. We have the ONESHOT flag, so we shouldn't keep getting called here....
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the only thing i can think of is the case in which an event has been armed, and disarm is called before any events have been delivered. this could result in the next event being delivered despite disarm having been called, since we wouldn't have actively disabled it.
if callers can deal with that, then agreed we should be able to avoid the call to epoll_ctl()
here.
existing tests pass with your proposed change - let me know which you prefer.
The http server test faulted on Linux. I'm not quite ready to sweep this under the table, since we've never seen it anywhere else, and the code you changed is quite possibly relevant. I'll look and see if I can turn up anything, but it might be a day or two. |
makes sense. i pushed #270 to help capture more info for issues like this in the future. interestingly, i cannot reproduce any failures when running the test binary directly, ie however, i periodically see failures in the httpserver test when running via ctest (which is invoked via
If I wait long enough between test runs, I don't see this failure, as expected. Maybe SO_REUSEADDR/SO_REUSEPORT could be of assistance? However, this doesn't appear to correspond to the case that occurred on that travis build, since the print out in that case was I haven't been able to turn up anything conclusive on Not sure how that could be the case here unless there's something strange going on w Travis :/ So if some form of #270 gets merged, perhaps I'll just rebase on top of that and rebuild several more times to see if it reproduces. |
I've added that change... we'll see if it helps with diagnosis. |
hm, seems to have passed this time :/ From a scan of the log output on this passing run compared to this failing run everything looks pretty much the same, two differences i noted:
probably not enough to blame it on travis, yet. maybe once you get a chance to test it locally, we can go from there. |
fc52e8a
to
b874da2
Compare
#908 is an interesting failure - when i run edit: ah, i see it failed on this build on master as well. |
@gdamore ping - thoughts on this PR? I haven't changed any code since responding to your initial feedback, have just been rebasing periodically since to trigger more runs. |
i will take some time for it tomorrow. stay tuned.
…On Tue, Mar 13, 2018 at 9:59 PM Liam Staskawicz ***@***.***> wrote:
@gdamore <https://github.com/gdamore> ping - thoughts on this PR? I
haven't changed any code since responding to your initial feedback, have
just been rebasing periodically since to trigger more runs.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#264 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ABPDfXJf7n646bmQCjLTPjYXAcPW7y-Lks5teKOWgaJpZM4Sbsfa>
.
|
Looking over this, and doing some testing, things look reasonably good. I saw some problems, but I think I've verified they are in my Ubuntu setup (and combined with the way we test for invalid URLs). The failure in 908 is due to a race condition and bogus scheduling assumption -- it's bad test code, and I believe I've fixed that now. (The test app was making the bad assumption; what was happening was that the receive callback was executed before the send callback. Normally you wouldn't expect this, but there's nothing to say that a scheduler can't do that if the send/recv occurs fast enough.) I think it's likely that the server fault in the earlier bug was indeed collisions in the address use, so I'm discounting that now. I'll do some performance tests, to make sure this doesn't regress, then assuming it's good, I'll merge it. |
fixes #33
This is ultimately pretty similar to the kqueue PR, main differences are:
epoll_wait()
epfd
doesn't workOther than that, I think most of the integration effort for the kqueue PR made this one substantially more straightforward :)