Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

epoll: add epoll-based pollq implementation #264

Merged
merged 1 commit into from
Mar 15, 2018
Merged

Conversation

liamstask
Copy link
Contributor

fixes #33

This is ultimately pretty similar to the kqueue PR, main differences are:

  • the epoll oneshot flag behavior allows us to avoid an explicit call to disable events as in kqueue, as they are disabled as soon as they are reported via epoll_wait()
  • using the pipe fds to wake the poll thread in destroy appears to be necessary, as just closing the epfd doesn't work

Other than that, I think most of the integration effort for the kqueue PR made this one substantially more straightforward :)

@gdamore
Copy link
Contributor

gdamore commented Mar 5, 2018

Wow, I had planned to do this tomorrow actually. I will review and test your changes at least then! :-)

@codecov
Copy link

codecov bot commented Mar 5, 2018

Codecov Report

Merging #264 into master will increase coverage by 0.16%.
The diff coverage is 88.65%.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #264      +/-   ##
==========================================
+ Coverage   78.77%   78.94%   +0.16%     
==========================================
  Files         157      157              
  Lines       13024    12993      -31     
==========================================
- Hits        10260    10257       -3     
+ Misses       2764     2736      -28
Impacted Files Coverage Δ
src/platform/posix/posix_pollq_epoll.c 88.65% <88.65%> (ø)
src/protocol/bus0/bus.c 86.84% <0%> (-0.66%) ⬇️
src/platform/posix/posix_pollq_poll.c
src/nng.c 76.32% <0%> (+0.22%) ⬆️
src/transport/ipc/ipc.c 74.43% <0%> (+0.28%) ⬆️
src/platform/posix/posix_file.c 63.97% <0%> (+0.73%) ⬆️
tests/convey.c 88.99% <0%> (+0.97%) ⬆️
src/core/pipe.c 96.24% <0%> (+1.5%) ⬆️
src/transport/tcp/tcp.c 80.1% <0%> (+1.8%) ⬆️
... and 1 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update e926c65...90c81cb. Read the comment docs.

@@ -104,7 +104,11 @@ if (NNG_PLATFORM_POSIX)
)
endif()

if (NNG_HAVE_KQUEUE)
if (NNG_HAVE_EPOLL)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I actually think kqueue should have precedence. Some systems are emulating epoll, and I won't be surprised if BSDs do that too to facilitate portability, but kqueue is definitely better. epoll() itself is rather defective under the hood, and we would prefer not to use it unless no other alternative is viable.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done in 7d8e166

nni_cv_init(&pq->cv, &pq->mtx);

if (((rv = nni_thr_init(&pq->thr, nni_posix_poll_thr, pq)) != 0) ||
((rv = nni_posix_pollq_add_wake_pipe(pq)) != 0)) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this, rather than using the clunky wakewfd, wakerfd, why not just go ahead and use eventfd() directly. We know its here, because we're on Linux. (We shouldn't use epoll() anywhere else.)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 implemented in 7d8e166


node->events &= ~events;
if (node->events == 0) {
ev.events = 0;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we actually need to make the system call in this case. We have the ONESHOT flag, so we shouldn't keep getting called here....

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the only thing i can think of is the case in which an event has been armed, and disarm is called before any events have been delivered. this could result in the next event being delivered despite disarm having been called, since we wouldn't have actively disabled it.

if callers can deal with that, then agreed we should be able to avoid the call to epoll_ctl() here.

existing tests pass with your proposed change - let me know which you prefer.

@gdamore
Copy link
Contributor

gdamore commented Mar 6, 2018

The http server test faulted on Linux. I'm not quite ready to sweep this under the table, since we've never seen it anywhere else, and the code you changed is quite possibly relevant. I'll look and see if I can turn up anything, but it might be a day or two.

@liamstask
Copy link
Contributor Author

makes sense. i pushed #270 to help capture more info for issues like this in the future.

interestingly, i cannot reproduce any failures when running the test binary directly, ie ./tests/httpserver from the build dir.

however, i periodically see failures in the httpserver test when running via ctest (which is invoked via ninja test, which is used in ci) returning NNG_EADDRINUSE for nng_http_server_start()`, ie:

root@3321c08eb0fc:/nng/linux-build# env CTEST_OUTPUT_ON_FAILURE=1 ctest -R httpserver
Test project /nng/linux-build
    Start 8: httpserver
1/1 Test #8: httpserver .......................***Failed    0.54 sec

If I wait long enough between test runs, I don't see this failure, as expected. Maybe SO_REUSEADDR/SO_REUSEPORT could be of assistance?

However, this doesn't appear to correspond to the case that occurred on that travis build, since the print out in that case was ***Exception: Other rather than the ***Failed I've seen above.

I haven't been able to turn up anything conclusive on Exception: Other/OTHER_FAULT but the cases in which I've seen it mentioned are all related to dynamic linking issues, an exe not being able to find a .so or something along those lines.

Not sure how that could be the case here unless there's something strange going on w Travis :/

So if some form of #270 gets merged, perhaps I'll just rebase on top of that and rebuild several more times to see if it reproduces.

@gdamore
Copy link
Contributor

gdamore commented Mar 7, 2018

I've added that change... we'll see if it helps with diagnosis.

@liamstask
Copy link
Contributor Author

hm, seems to have passed this time :/

From a scan of the log output on this passing run compared to this failing run everything looks pretty much the same, two differences i noted:

  • under the Build system information heading, the travis-build id is different (not sure what that corresponds to)
  • in the passing build it prints "Network availability confirmed." which is not printed in the failing build

probably not enough to blame it on travis, yet. maybe once you get a chance to test it locally, we can go from there.

@liamstask liamstask force-pushed the epoll branch 2 times, most recently from fc52e8a to b874da2 Compare March 10, 2018 04:04
@liamstask
Copy link
Contributor Author

hm, 2 more passes (#882 and #883 on travis) after rebasing on master a couple times.

given that the original error reported does not correspond to a test assertion failure or a segfault, but some "OTHER_FAULT" exception, I'm starting to suspect something environmental on that run.

@liamstask
Copy link
Contributor Author

liamstask commented Mar 13, 2018

#908 is an interesting failure - when i run ./tests/aio locally, I don't see any failures, but I also notice that it does not add any nodes to the poller at all, suggesting that it might not be caused by the epoll changes. any other ideas about how they might be related?

edit: ah, i see it failed on this build on master as well.

@liamstask
Copy link
Contributor Author

@gdamore ping - thoughts on this PR? I haven't changed any code since responding to your initial feedback, have just been rebasing periodically since to trigger more runs.

@gdamore
Copy link
Contributor

gdamore commented Mar 14, 2018 via email

@gdamore
Copy link
Contributor

gdamore commented Mar 15, 2018

Looking over this, and doing some testing, things look reasonably good.

I saw some problems, but I think I've verified they are in my Ubuntu setup (and combined with the way we test for invalid URLs).

The failure in 908 is due to a race condition and bogus scheduling assumption -- it's bad test code, and I believe I've fixed that now. (The test app was making the bad assumption; what was happening was that the receive callback was executed before the send callback. Normally you wouldn't expect this, but there's nothing to say that a scheduler can't do that if the send/recv occurs fast enough.)

I think it's likely that the server fault in the earlier bug was indeed collisions in the address use, so I'm discounting that now.

I'll do some performance tests, to make sure this doesn't regress, then assuming it's good, I'll merge it.

@gdamore gdamore merged commit 91866ed into nanomsg:master Mar 15, 2018
@liamstask liamstask deleted the epoll branch March 15, 2018 22:44
muryliang pushed a commit to muryliang/nng that referenced this pull request Mar 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Add epoll() support on Linux
2 participants