Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segfault in req0_ctx_cancel_send() #1702

Closed
shikokuchuo opened this issue Nov 18, 2023 · 6 comments
Closed

Segfault in req0_ctx_cancel_send() #1702

shikokuchuo opened this issue Nov 18, 2023 · 6 comments

Comments

@shikokuchuo
Copy link
Contributor

shikokuchuo commented Nov 18, 2023

Describe the bug
When developing the R binding, I came across and reported this (on Discord) a few months ago. I have now traced this with the help of valgrind.

To Reproduce

  1. Start a req socket
  2. Start a listener on the socket (no connection)
  3. Allocate 2 aios, sending a message on one using nng_send_aio() and receiving on the other using nng_recv_aio() with an expiry timeout.
  4. Wait until the recv aio has expired.
  5. Call nng_aio_free() on the send aio.

The following segfault happens.

==213611== Invalid write of size 8
==213611== at 0x4A8B0E84: nni_msg_header_clear (message.c:623)
==213611== by 0x4A8C4C3C: req0_ctx_cancel_send (req.c:627)
==213611== by 0x4A8AC7B8: nni_aio_fini (aio.c:138)
==213611== by 0x4A8AC7EA: nni_aio_free (aio.c:161)
==213611== by 0x4A8AC7EA: nni_aio_free (aio.c:158)
==213611== by 0x4A89A194: saio_finalizer (aio.c:304)
==213611== by 0x4A89A194: saio_finalizer (aio.c:299)
==213611== by 0x49EF5DC: R_RunWeakRefFinalizer (in /usr/lib/R/lib/libR.so)
==213611== by 0x49EF83B: ??? (in /usr/lib/R/lib/libR.so)
==213611== by 0x49F2A54: ??? (in /usr/lib/R/lib/libR.so)
==213611== by 0x4993264: ??? (in /usr/lib/R/lib/libR.so)
==213611== by 0x49ADCFF: Rf_eval (in /usr/lib/R/lib/libR.so)
==213611== by 0x49AFD95: ??? (in /usr/lib/R/lib/libR.so)
==213611== by 0x49B0BC4: Rf_applyClosure (in /usr/lib/R/lib/libR.so)
==213611== Address 0x40 is not stack'd, malloc'd or (recently) free'd
==213611==

*** caught segfault ***
address 0x40, cause 'memory not mapped'

** Environment Details **

  • NNG 1.6.0pre (a54820f)
  • Ubuntu 22.04
  • gcc 11.4.0
  • Static library

Additional context

As it is clear that nni_msg_header_clear() was the culprit, I have removed these 2 lines in req0_ctx_cancel_send():

nni_aio_set_msg(aio, ctx->req_msg);
nni_msg_header_clear(ctx->req_msg);

which eliminates this segfault.

I do not see any impact on my unit tests, valgrind does not show any memory leaks. Also in real world examples, where the connection is lost, the req resend is still happening.

So I am wondering if it is generally safe to take those out, or in what context those would be needed (the comment says restore the message back to the aio?).

From a binding perspective, the aios are individually allocated for messages and hence can be expected to be short-lived.

@shikokuchuo
Copy link
Contributor Author

I am pretty sure that the above solution works for me, as aios are always allocated for new messages and never reused (works well for a dynamic high level language like R, at least in the current implementation).

It also works more generally to preface those lines with if (ctx->req_msg != NULL), although I'll leave it to you to see if that is the right place for this.

Note: I've also amended the triggering condition above to make clear that the receive is done with a timeout that expires, before attempting to free the aio.

@gdamore
Copy link
Contributor

gdamore commented Nov 26, 2023

I need to look at this -- the code was intended to be the case that the message was always valid. So if that's not happening then my assumptions are incorrect.

@gdamore
Copy link
Contributor

gdamore commented Nov 26, 2023

Ok, I think I understand. You're calling receive before the send has even completed. That's not something I was thinking about - and your workaround probably works. But I'm not entirely sure I want to depend on it yet.

It requires some bad luck to hit this I think -- or planning to arrange to have your send operation "stuck" (because you don't have a connection to send it on). I want to think through this a bit.

@gdamore
Copy link
Contributor

gdamore commented Nov 26, 2023

So thinking about this a bit more... there is definitely a semantic break here. The send operation has not executed, and so your attempt to call receive on it is actually supposed to fail with an NNG_ESTATE, but as the request was submitted asynchronously we kind of didn't do the work to track and report this properly. (The normal flow is on a given context you issue a send, then only issue a receive once the send completes. Because if the send fails for any reason, then the receive cannot succeed either.)

That's not a big deal.

A bigger concern however, is that we also provide a guaranteed semantic that if send fails, we will not discard the user's message, so that they can try again with the same message (or perhaps look at the message to understand what didn't get sent). But since we canceled the receive (which shouldn't even exist in this state except for the break in the state assumptions), we also clean up all the context, which is bad juju.

I'm torn between trying to remediate this by recovering the req mesage in this case, and just discarding the message and letting the user deal with the fact that it's gone. (If the user didn't try to use these two requests almost in parallel, it wouldn't have been an issue.)

@gdamore
Copy link
Contributor

gdamore commented Nov 26, 2023

(For the record, the normal way to do REQ state is to use even just a single aio, and do send, then in the callback from send submit the receive.)

@gdamore
Copy link
Contributor

gdamore commented Nov 26, 2023

The above fix (PR) will prevent the segfault. But you're responsible for reclaiming the message from the failed send aio in this case -- failure to do so will leak the message.

This is the semantic about message ownership -- we only "retain" ownership of the message once we have transmitted it -- asynchronous transmit must respond back to the caller with an indication of success (or the request having been queued) before we will accept ownership.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants