Segfault in req0_ctx_cancel_send() #1702

shikokuchuo · 2023-11-18T20:19:50Z

Describe the bug
When developing the R binding, I came across and reported this (on Discord) a few months ago. I have now traced this with the help of valgrind.

To Reproduce

Start a req socket
Start a listener on the socket (no connection)
Allocate 2 aios, sending a message on one using nng_send_aio() and receiving on the other using nng_recv_aio() with an expiry timeout.
Wait until the recv aio has expired.
Call nng_aio_free() on the send aio.

The following segfault happens.

==213611== Invalid write of size 8
==213611== at 0x4A8B0E84: nni_msg_header_clear (message.c:623)
==213611== by 0x4A8C4C3C: req0_ctx_cancel_send (req.c:627)
==213611== by 0x4A8AC7B8: nni_aio_fini (aio.c:138)
==213611== by 0x4A8AC7EA: nni_aio_free (aio.c:161)
==213611== by 0x4A8AC7EA: nni_aio_free (aio.c:158)
==213611== by 0x4A89A194: saio_finalizer (aio.c:304)
==213611== by 0x4A89A194: saio_finalizer (aio.c:299)
==213611== by 0x49EF5DC: R_RunWeakRefFinalizer (in /usr/lib/R/lib/libR.so)
==213611== by 0x49EF83B: ??? (in /usr/lib/R/lib/libR.so)
==213611== by 0x49F2A54: ??? (in /usr/lib/R/lib/libR.so)
==213611== by 0x4993264: ??? (in /usr/lib/R/lib/libR.so)
==213611== by 0x49ADCFF: Rf_eval (in /usr/lib/R/lib/libR.so)
==213611== by 0x49AFD95: ??? (in /usr/lib/R/lib/libR.so)
==213611== by 0x49B0BC4: Rf_applyClosure (in /usr/lib/R/lib/libR.so)
==213611== Address 0x40 is not stack'd, malloc'd or (recently) free'd
==213611==

*** caught segfault ***
address 0x40, cause 'memory not mapped'

** Environment Details **

NNG 1.6.0pre (a54820f)
Ubuntu 22.04
gcc 11.4.0
Static library

Additional context

As it is clear that nni_msg_header_clear() was the culprit, I have removed these 2 lines in req0_ctx_cancel_send():

nng/src/sp/protocol/reqrep0/req.c

Lines 626 to 627 in a54820f

    
           nni_aio_set_msg(aio, ctx->req_msg); 
        
           nni_msg_header_clear(ctx->req_msg);

which eliminates this segfault.

I do not see any impact on my unit tests, valgrind does not show any memory leaks. Also in real world examples, where the connection is lost, the req resend is still happening.

So I am wondering if it is generally safe to take those out, or in what context those would be needed (the comment says restore the message back to the aio?).

From a binding perspective, the aios are individually allocated for messages and hence can be expected to be short-lived.

The text was updated successfully, but these errors were encountered:

shikokuchuo · 2023-11-20T10:17:48Z

I am pretty sure that the above solution works for me, as aios are always allocated for new messages and never reused (works well for a dynamic high level language like R, at least in the current implementation).

It also works more generally to preface those lines with if (ctx->req_msg != NULL), although I'll leave it to you to see if that is the right place for this.

Note: I've also amended the triggering condition above to make clear that the receive is done with a timeout that expires, before attempting to free the aio.

gdamore · 2023-11-26T01:57:40Z

I need to look at this -- the code was intended to be the case that the message was always valid. So if that's not happening then my assumptions are incorrect.

gdamore · 2023-11-26T02:07:06Z

Ok, I think I understand. You're calling receive before the send has even completed. That's not something I was thinking about - and your workaround probably works. But I'm not entirely sure I want to depend on it yet.

It requires some bad luck to hit this I think -- or planning to arrange to have your send operation "stuck" (because you don't have a connection to send it on). I want to think through this a bit.

gdamore · 2023-11-26T02:27:26Z

So thinking about this a bit more... there is definitely a semantic break here. The send operation has not executed, and so your attempt to call receive on it is actually supposed to fail with an NNG_ESTATE, but as the request was submitted asynchronously we kind of didn't do the work to track and report this properly. (The normal flow is on a given context you issue a send, then only issue a receive once the send completes. Because if the send fails for any reason, then the receive cannot succeed either.)

That's not a big deal.

A bigger concern however, is that we also provide a guaranteed semantic that if send fails, we will not discard the user's message, so that they can try again with the same message (or perhaps look at the message to understand what didn't get sent). But since we canceled the receive (which shouldn't even exist in this state except for the break in the state assumptions), we also clean up all the context, which is bad juju.

I'm torn between trying to remediate this by recovering the req mesage in this case, and just discarding the message and letting the user deal with the fact that it's gone. (If the user didn't try to use these two requests almost in parallel, it wouldn't have been an issue.)

gdamore · 2023-11-26T02:28:00Z

(For the record, the normal way to do REQ state is to use even just a single aio, and do send, then in the callback from send submit the receive.)

gdamore · 2023-11-26T03:01:06Z

The above fix (PR) will prevent the segfault. But you're responsible for reclaiming the message from the failed send aio in this case -- failure to do so will leak the message.

This is the semantic about message ownership -- we only "retain" ownership of the message once we have transmitted it -- asynchronous transmit must respond back to the caller with an indication of success (or the request having been queued) before we will accept ownership.

gdamore closed this as completed in fe69207 Nov 26, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Segfault in req0_ctx_cancel_send() #1702

Segfault in req0_ctx_cancel_send() #1702

shikokuchuo commented Nov 18, 2023 •

edited

Loading

shikokuchuo commented Nov 20, 2023

gdamore commented Nov 26, 2023

gdamore commented Nov 26, 2023

gdamore commented Nov 26, 2023

gdamore commented Nov 26, 2023

gdamore commented Nov 26, 2023

Segfault in req0_ctx_cancel_send() #1702

Segfault in req0_ctx_cancel_send() #1702

Comments

shikokuchuo commented Nov 18, 2023 • edited Loading

shikokuchuo commented Nov 20, 2023

gdamore commented Nov 26, 2023

gdamore commented Nov 26, 2023

gdamore commented Nov 26, 2023

gdamore commented Nov 26, 2023

gdamore commented Nov 26, 2023

shikokuchuo commented Nov 18, 2023 •

edited

Loading