-
-
Notifications
You must be signed in to change notification settings - Fork 494
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Segfault in req0_ctx_cancel_send() #1702
Comments
I am pretty sure that the above solution works for me, as aios are always allocated for new messages and never reused (works well for a dynamic high level language like R, at least in the current implementation). It also works more generally to preface those lines with Note: I've also amended the triggering condition above to make clear that the receive is done with a timeout that expires, before attempting to free the aio. |
I need to look at this -- the code was intended to be the case that the message was always valid. So if that's not happening then my assumptions are incorrect. |
Ok, I think I understand. You're calling receive before the send has even completed. That's not something I was thinking about - and your workaround probably works. But I'm not entirely sure I want to depend on it yet. It requires some bad luck to hit this I think -- or planning to arrange to have your send operation "stuck" (because you don't have a connection to send it on). I want to think through this a bit. |
So thinking about this a bit more... there is definitely a semantic break here. The send operation has not executed, and so your attempt to call receive on it is actually supposed to fail with an NNG_ESTATE, but as the request was submitted asynchronously we kind of didn't do the work to track and report this properly. (The normal flow is on a given context you issue a send, then only issue a receive once the send completes. Because if the send fails for any reason, then the receive cannot succeed either.) That's not a big deal. A bigger concern however, is that we also provide a guaranteed semantic that if send fails, we will not discard the user's message, so that they can try again with the same message (or perhaps look at the message to understand what didn't get sent). But since we canceled the receive (which shouldn't even exist in this state except for the break in the state assumptions), we also clean up all the context, which is bad juju. I'm torn between trying to remediate this by recovering the req mesage in this case, and just discarding the message and letting the user deal with the fact that it's gone. (If the user didn't try to use these two requests almost in parallel, it wouldn't have been an issue.) |
(For the record, the normal way to do REQ state is to use even just a single aio, and do send, then in the callback from send submit the receive.) |
The above fix (PR) will prevent the segfault. But you're responsible for reclaiming the message from the failed send aio in this case -- failure to do so will leak the message. This is the semantic about message ownership -- we only "retain" ownership of the message once we have transmitted it -- asynchronous transmit must respond back to the caller with an indication of success (or the request having been queued) before we will accept ownership. |
Describe the bug
When developing the R binding, I came across and reported this (on Discord) a few months ago. I have now traced this with the help of valgrind.
To Reproduce
nng_send_aio()
and receiving on the other usingnng_recv_aio()
with an expiry timeout.nng_aio_free()
on the send aio.The following segfault happens.
==213611== Invalid write of size 8
==213611== at 0x4A8B0E84: nni_msg_header_clear (message.c:623)
==213611== by 0x4A8C4C3C: req0_ctx_cancel_send (req.c:627)
==213611== by 0x4A8AC7B8: nni_aio_fini (aio.c:138)
==213611== by 0x4A8AC7EA: nni_aio_free (aio.c:161)
==213611== by 0x4A8AC7EA: nni_aio_free (aio.c:158)
==213611== by 0x4A89A194: saio_finalizer (aio.c:304)
==213611== by 0x4A89A194: saio_finalizer (aio.c:299)
==213611== by 0x49EF5DC: R_RunWeakRefFinalizer (in /usr/lib/R/lib/libR.so)
==213611== by 0x49EF83B: ??? (in /usr/lib/R/lib/libR.so)
==213611== by 0x49F2A54: ??? (in /usr/lib/R/lib/libR.so)
==213611== by 0x4993264: ??? (in /usr/lib/R/lib/libR.so)
==213611== by 0x49ADCFF: Rf_eval (in /usr/lib/R/lib/libR.so)
==213611== by 0x49AFD95: ??? (in /usr/lib/R/lib/libR.so)
==213611== by 0x49B0BC4: Rf_applyClosure (in /usr/lib/R/lib/libR.so)
==213611== Address 0x40 is not stack'd, malloc'd or (recently) free'd
==213611==
*** caught segfault ***
address 0x40, cause 'memory not mapped'
** Environment Details **
Additional context
As it is clear that
nni_msg_header_clear()
was the culprit, I have removed these 2 lines inreq0_ctx_cancel_send()
:nng/src/sp/protocol/reqrep0/req.c
Lines 626 to 627 in a54820f
which eliminates this segfault.
I do not see any impact on my unit tests, valgrind does not show any memory leaks. Also in real world examples, where the connection is lost, the req resend is still happening.
So I am wondering if it is generally safe to take those out, or in what context those would be needed (the comment says restore the message back to the aio?).
From a binding perspective, the aios are individually allocated for messages and hence can be expected to be short-lived.
The text was updated successfully, but these errors were encountered: