epoll: add epoll-based pollq implementation #264

liamstask · 2018-03-05T03:58:14Z

fixes #33

This is ultimately pretty similar to the kqueue PR, main differences are:

the epoll oneshot flag behavior allows us to avoid an explicit call to disable events as in kqueue, as they are disabled as soon as they are reported via epoll_wait()
using the pipe fds to wake the poll thread in destroy appears to be necessary, as just closing the epfd doesn't work

Other than that, I think most of the integration effort for the kqueue PR made this one substantially more straightforward :)

gdamore · 2018-03-05T04:03:43Z

Wow, I had planned to do this tomorrow actually. I will review and test your changes at least then! :-)

codecov · 2018-03-05T04:09:36Z

Codecov Report

Merging #264 into master will increase coverage by 0.16%.
The diff coverage is 88.65%.

@@            Coverage Diff             @@
##           master     #264      +/-   ##
==========================================
+ Coverage   78.77%   78.94%   +0.16%     
==========================================
  Files         157      157              
  Lines       13024    12993      -31     
==========================================
- Hits        10260    10257       -3     
+ Misses       2764     2736      -28

Impacted Files	Coverage Δ
src/platform/posix/posix_pollq_epoll.c	`88.65% <88.65%> (ø)`
src/protocol/bus0/bus.c	`86.84% <0%> (-0.66%)`	⬇️
src/platform/posix/posix_pollq_poll.c
src/nng.c	`76.32% <0%> (+0.22%)`	⬆️
src/transport/ipc/ipc.c	`74.43% <0%> (+0.28%)`	⬆️
src/platform/posix/posix_file.c	`63.97% <0%> (+0.73%)`	⬆️
tests/convey.c	`88.99% <0%> (+0.97%)`	⬆️
src/core/pipe.c	`96.24% <0%> (+1.5%)`	⬆️
src/transport/tcp/tcp.c	`80.1% <0%> (+1.8%)`	⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e926c65...90c81cb. Read the comment docs.

gdamore · 2018-03-05T23:44:28Z

src/CMakeLists.txt

@@ -104,7 +104,11 @@ if (NNG_PLATFORM_POSIX)
    )
 endif()

-if (NNG_HAVE_KQUEUE)
+if (NNG_HAVE_EPOLL)


I actually think kqueue should have precedence. Some systems are emulating epoll, and I won't be surprised if BSDs do that too to facilitate portability, but kqueue is definitely better. epoll() itself is rather defective under the hood, and we would prefer not to use it unless no other alternative is viable.

done in 7d8e166

gdamore · 2018-03-05T23:48:03Z

src/platform/posix/posix_pollq_epoll.c

+	nni_cv_init(&pq->cv, &pq->mtx);
+
+	if (((rv = nni_thr_init(&pq->thr, nni_posix_poll_thr, pq)) != 0) ||
+	    ((rv = nni_posix_pollq_add_wake_pipe(pq)) != 0)) {


Thinking about this, rather than using the clunky wakewfd, wakerfd, why not just go ahead and use eventfd() directly. We know its here, because we're on Linux. (We shouldn't use epoll() anywhere else.)

👍 implemented in 7d8e166

gdamore · 2018-03-05T23:54:16Z

src/platform/posix/posix_pollq_epoll.c

+
+	node->events &= ~events;
+	if (node->events == 0) {
+		ev.events = 0;


I wonder if we actually need to make the system call in this case. We have the ONESHOT flag, so we shouldn't keep getting called here....

the only thing i can think of is the case in which an event has been armed, and disarm is called before any events have been delivered. this could result in the next event being delivered despite disarm having been called, since we wouldn't have actively disabled it.

if callers can deal with that, then agreed we should be able to avoid the call to epoll_ctl() here.

existing tests pass with your proposed change - let me know which you prefer.

gdamore · 2018-03-06T21:34:40Z

The http server test faulted on Linux. I'm not quite ready to sweep this under the table, since we've never seen it anywhere else, and the code you changed is quite possibly relevant. I'll look and see if I can turn up anything, but it might be a day or two.

liamstask · 2018-03-07T06:27:27Z

makes sense. i pushed #270 to help capture more info for issues like this in the future.

interestingly, i cannot reproduce any failures when running the test binary directly, ie ./tests/httpserver from the build dir.

however, i periodically see failures in the httpserver test when running via ctest (which is invoked via ninja test, which is used in ci) returning NNG_EADDRINUSE for nng_http_server_start()`, ie:

root@3321c08eb0fc:/nng/linux-build# env CTEST_OUTPUT_ON_FAILURE=1 ctest -R httpserver
Test project /nng/linux-build
    Start 8: httpserver
1/1 Test #8: httpserver .......................***Failed    0.54 sec

If I wait long enough between test runs, I don't see this failure, as expected. Maybe SO_REUSEADDR/SO_REUSEPORT could be of assistance?

However, this doesn't appear to correspond to the case that occurred on that travis build, since the print out in that case was ***Exception: Other rather than the ***Failed I've seen above.

I haven't been able to turn up anything conclusive on Exception: Other/OTHER_FAULT but the cases in which I've seen it mentioned are all related to dynamic linking issues, an exe not being able to find a .so or something along those lines.

Not sure how that could be the case here unless there's something strange going on w Travis :/

So if some form of #270 gets merged, perhaps I'll just rebase on top of that and rebuild several more times to see if it reproduces.

gdamore · 2018-03-07T22:14:10Z

I've added that change... we'll see if it helps with diagnosis.

liamstask · 2018-03-08T04:05:52Z

hm, seems to have passed this time :/

From a scan of the log output on this passing run compared to this failing run everything looks pretty much the same, two differences i noted:

under the Build system information heading, the travis-build id is different (not sure what that corresponds to)
in the passing build it prints "Network availability confirmed." which is not printed in the failing build

probably not enough to blame it on travis, yet. maybe once you get a chance to test it locally, we can go from there.

liamstask · 2018-03-10T04:15:11Z

hm, 2 more passes (#882 and #883 on travis) after rebasing on master a couple times.

given that the original error reported does not correspond to a test assertion failure or a segfault, but some "OTHER_FAULT" exception, I'm starting to suspect something environmental on that run.

liamstask · 2018-03-13T05:19:11Z

#908 is an interesting failure - when i run ./tests/aio locally, I don't see any failures, but I also notice that it does not add any nodes to the poller at all, suggesting that it might not be caused by the epoll changes. any other ideas about how they might be related?

edit: ah, i see it failed on this build on master as well.

liamstask · 2018-03-14T04:59:01Z

@gdamore ping - thoughts on this PR? I haven't changed any code since responding to your initial feedback, have just been rebasing periodically since to trigger more runs.

gdamore · 2018-03-14T05:22:42Z

i will take some time for it tomorrow. stay tuned.

…

On Tue, Mar 13, 2018 at 9:59 PM Liam Staskawicz ***@***.***> wrote: @gdamore <https://github.com/gdamore> ping - thoughts on this PR? I haven't changed any code since responding to your initial feedback, have just been rebasing periodically since to trigger more runs. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#264 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABPDfXJf7n646bmQCjLTPjYXAcPW7y-Lks5teKOWgaJpZM4Sbsfa> .

fixes nanomsg#33

gdamore · 2018-03-15T22:32:42Z

Looking over this, and doing some testing, things look reasonably good.

I saw some problems, but I think I've verified they are in my Ubuntu setup (and combined with the way we test for invalid URLs).

The failure in 908 is due to a race condition and bogus scheduling assumption -- it's bad test code, and I believe I've fixed that now. (The test app was making the bad assumption; what was happening was that the receive callback was executed before the send callback. Normally you wouldn't expect this, but there's nothing to say that a scheduler can't do that if the send/recv occurs fast enough.)

I think it's likely that the server fault in the earlier bug was indeed collisions in the address use, so I'm discounting that now.

I'll do some performance tests, to make sure this doesn't regress, then assuming it's good, I'll merge it.

gdamore reviewed Mar 5, 2018

View reviewed changes

liamstask force-pushed the epoll branch from 75431a4 to 7d8e166 Compare March 6, 2018 05:42

liamstask mentioned this pull request Mar 7, 2018

ci: dump failure details on test failure #270

Merged

liamstask force-pushed the epoll branch from 7d8e166 to 86051af Compare March 8, 2018 03:38

liamstask force-pushed the epoll branch 2 times, most recently from fc52e8a to b874da2 Compare March 10, 2018 04:04

liamstask force-pushed the epoll branch from b874da2 to 1f70407 Compare March 13, 2018 02:58

liamstask force-pushed the epoll branch from 1f70407 to 8b95858 Compare March 14, 2018 04:39

epoll: add epoll-based pollq implementation

90c81cb

fixes nanomsg#33

liamstask force-pushed the epoll branch from 8b95858 to 90c81cb Compare March 15, 2018 14:20

gdamore merged commit 91866ed into nanomsg:master Mar 15, 2018

liamstask deleted the epoll branch March 15, 2018 22:44

muryliang pushed a commit to muryliang/nng that referenced this pull request Mar 30, 2023

fixes nanomsg#264 panic in req (nanomsg#265)

0ffb63b

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

epoll: add epoll-based pollq implementation #264

epoll: add epoll-based pollq implementation #264

liamstask commented Mar 5, 2018

gdamore commented Mar 5, 2018

codecov bot commented Mar 5, 2018 •

edited

Loading

gdamore Mar 5, 2018

liamstask Mar 6, 2018

gdamore Mar 5, 2018

liamstask Mar 6, 2018

gdamore Mar 5, 2018

liamstask Mar 6, 2018

gdamore commented Mar 6, 2018

liamstask commented Mar 7, 2018

gdamore commented Mar 7, 2018

liamstask commented Mar 8, 2018

liamstask commented Mar 10, 2018

liamstask commented Mar 13, 2018 •

edited

Loading

liamstask commented Mar 14, 2018

gdamore commented Mar 14, 2018 via email

gdamore commented Mar 15, 2018

epoll: add epoll-based pollq implementation #264

epoll: add epoll-based pollq implementation #264

Conversation

liamstask commented Mar 5, 2018

gdamore commented Mar 5, 2018

codecov bot commented Mar 5, 2018 • edited Loading

Codecov Report

gdamore Mar 5, 2018

Choose a reason for hiding this comment

liamstask Mar 6, 2018

Choose a reason for hiding this comment

gdamore Mar 5, 2018

Choose a reason for hiding this comment

liamstask Mar 6, 2018

Choose a reason for hiding this comment

gdamore Mar 5, 2018

Choose a reason for hiding this comment

liamstask Mar 6, 2018

Choose a reason for hiding this comment

gdamore commented Mar 6, 2018

liamstask commented Mar 7, 2018

gdamore commented Mar 7, 2018

liamstask commented Mar 8, 2018

liamstask commented Mar 10, 2018

liamstask commented Mar 13, 2018 • edited Loading

liamstask commented Mar 14, 2018

gdamore commented Mar 14, 2018 via email

gdamore commented Mar 15, 2018

codecov bot commented Mar 5, 2018 •

edited

Loading

liamstask commented Mar 13, 2018 •

edited

Loading