Race condition when sending messages from another thread #354

benoithudson · 2019-10-18T15:53:02Z

We're seeing a race that occasionally causes a 30-second lag. The precise scenario:

Run a ThreadedServer
Connection arrives, thread A assigned to serve_all on that connection.
Thread B sends a message on the connection then calls serve() to get the reply.
Thread B waits for a recv_event while thread A is in poll().
Thread A gets the reply.
Thread A posts the recv_event and gives up the GIL while dispatching the reply, before setting _ready.
Thread B wakes up, enters the poll() and gives up the GIL.
Thread A finishes dispatching the reply and sets _ready on the AsyncResult.
30 seconds later, thread B's poll times out
Thread B finds the reply is ready and all is well.

So there's no data loss, no unexpected exceptions, just 30 seconds wasted, once in a while.

Workaround: have the client ping every second, so that the waste is at most 1 second (because in step 9, thread B's poll doesn't time out it just handles the ping).

We haven't minimized the example yet.

Environment

rpyc version 4.0.2
python version 2.7.x
operating system Windows, mac, linux

This issue is presumably platform-independent.

benoithudson · 2019-10-22T15:57:09Z

Fun fact: on further testing, you ideally send an async HANDLE_PING by hand, the ping function blocks and therefore can itself hit this race condition (or you send a ping with a very short timeout).

comrumino · 2019-11-07T08:41:44Z

The scenario you described sounds related to https://github.com/tomerfiliba/rpyc/blob/master/rpyc/core/protocol.py#L415-L419
but step 10 indicates otherwise. Some other improvements might have fixed this. Could you see if the issue occurs in 4.1.2?

benoithudson · 2019-11-07T18:22:18Z

Sounds related indeed, though the serve_threaded function isn't involved in the race identified here (except for calling serve()).

I haven't tested the new version yet, but the relevant code is unchanged:

AsyncResult.wait
Connection.serve

Basically, the AsyncResult needs to check is_ready while holding the recv_lock and the dispatch needs to set is_ready before releasing it. Or there needs to be some other similar sync mechanism.

comrumino · 2019-11-19T00:55:43Z

As an update, I added some debugging documentation. A PDF of the TCP stream that the race condition occurs would help me create a unit test to prevent the issue in the future. If there is any concern about confidentiality, you could always encrypt it with my PGP and email under my profile.

benoithudson · 2020-02-19T22:46:35Z

@comrumino : Do you have a CLA for us to sign, to assign rights? We've got a PR we could send up that might better illuminate what's going on (which depends on another PR to be viable).

comrumino · 2020-02-21T10:15:05Z

Nothing to sign, but I use TLDRLegal. I think CLAs mostly apply to restrictive IP rights. Even so, the license agreement covers your rights.

I always welcome PRs or any other effort that makes my life easier 😄

DemonOne · 2021-03-17T02:26:53Z

The exact same issue also happens when a BgServingThread steals the async response data, and flags the object while the originator starts polling().
This is a real issue, especially if you set the sync time out to None, in which case your program will next recover.

comrumino · 2021-03-17T05:15:00Z

DemonOne, are you using the conn passed to BgServingThread before you stop the background thread? I imagine the channel behavior is consistent with TCP. Once I have/create a concrete example, I would be able to better research solutions.

At the moment, serving connections could be made harder to use wrong: improve resource acquisition of connections such that only one thread is serving a given connection until released or thread is closed. My initial impression is that implementing a solution to reroute or better propagate would have a measurable performance impact; allowing RPyC streams/channels/connections to handle response data would be more efficient and consistent typical socket usage.

TI-AviBerko · 2021-04-08T19:27:48Z

@comrumino (it's me, DemonOne)

The problem as I see it:
I'm creating a connection and passing it to a BgServingThread, this connection is used to execute numerous commands.

In certian situations, when calling a function through the connection, the connection's async repsonse is intercepted by the bg thread, handled and signaled right before sync_requets() starts waiting for the reply, which leads to it blocking on poll() until the default timeout elapses. though the result is actually valid.

I managed to recreate this in a ratio of around 3-5 per 100 calls (in an internal tool I'm writing)
Lately I've also seen this happen on the server, where a callback call wound up stumbling into the same issue.

TI-AviBerko · 2021-09-19T21:59:00Z

@wumb0 @comrumino
Regarding the fix in #455 - in my use case the async timeout is infinite, one of the threads can reach server() before the _is_ready is signaled, and block forever.

wumb0 · 2021-09-20T00:59:56Z

I solved something like this by using serve_threaded instead of serve_all. Try that out and see if it makes a difference. Like I said in my PR there's still a larger issue at play here (obviously).

…

On Sun, Sep 19, 2021, 5:59 PM TI-AviBerko ***@***.***> wrote: @wumb0 <https://github.com/wumb0> @comrumino <https://github.com/comrumino> Regarding the fix in #455 <#455> - in my use case the async timeout is infinite, one of the threads can reach server() before the _is_ready is signaled, and block forever. — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub <#354 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABHV3WY5EJMRZHX4RVS65ADUCZMK5ANCNFSM4JCJH67Q> . Triage notifications on the go with GitHub Mobile for iOS <https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675> or Android <https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.

TI-AviBerko · 2021-09-20T02:08:41Z

Thanks, but I don't think that's related.
I solved it by monkey patching sync_request() in order to not allow poll() to block indefinitely.

I do agree about the larger issue, maybe it'll be easier to concentrate the networking in one context.

…omerfiliba-org#354)

comrumino · 2022-05-19T15:31:28Z

@TI-AviBerko @wumb0 @benoithudson is everyone here using either serve_threaded or BgServingThread?

After another crack at this issue, here are my thoughts. From serve_threaded, I found this to be a refresher on the topic.

    CAVEAT: using non-immutable types that require a netref to be constructed to serve a request,
    or invoking anything else that performs a sync_request, may timeout due to the sync_request reply being
    received by another thread serving the connection. A more conventional approach where each client thread
    opens a new connection would allow `ThreadedServer` to naturally avoid such multiplexing issues and
    is the preferred approach for threading procedures that invoke sync_request. See issue #345

If I were to rephrase/correct this, after my most recent commits, I would make the argument that the Connection class did not properly acquire and release the _recvlock at the time of writing this comment. To address this issue, I switched _recvlock to an rlock and acquired the lock for the duration of the synchronous call. This may fix the issues you all are seeing here.

I still need to check if Connection class safely handles async requests. An issue may exist for async requests during the close sequence of the protocol.

Comments?

comrumino · 2022-05-19T19:22:45Z

Relates/impacts #482 since multiprocess is a more complex scenario. Minimally documentation could be better.

comrumino · 2022-06-15T22:42:15Z

Even with added support around threading for RPyC there will still be limitations due to fork https://www.evanjones.ca/fork-is-dangerous.html

comrumino self-assigned this Nov 7, 2019

comrumino added the Triage Investigation by a maintainer has started label Nov 7, 2019

comrumino added Bug Confirmed bug Diagnosed and removed Triage Investigation by a maintainer has started labels Nov 14, 2019

comrumino mentioned this issue Apr 3, 2020

result expired: deadlock with multiple threads(?) #381

Open

comrumino removed the Confirmed label Feb 16, 2021

walterschneider mentioned this issue Jul 16, 2021

sporadic deadlock in syncreq #449

Closed

wumb0 mentioned this issue Aug 26, 2021

fixes #449 #455

Merged

notEvil mentioned this issue May 14, 2022

Race condition in AsyncResult #490

Closed

notEvil pushed a commit to notEvil/rpyc that referenced this issue May 15, 2022

extended receive lock to AsyncResult.wait and Connection._dispatch (t…

9598a99

…omerfiliba-org#354)

notEvil mentioned this issue May 15, 2022

Fix race condition in AsyncResult.wait #492

Closed

comrumino mentioned this issue Jun 15, 2022

Improve documentation and maybe support multiprocessing for RPyC #482

Closed

gschaffner mentioned this issue Jul 6, 2022

Wheels on PyPI contain different code than is in this repository #500

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Race condition when sending messages from another thread #354

Race condition when sending messages from another thread #354

benoithudson commented Oct 18, 2019

benoithudson commented Oct 22, 2019

comrumino commented Nov 7, 2019 •

edited

Loading

benoithudson commented Nov 7, 2019

comrumino commented Nov 19, 2019

benoithudson commented Feb 19, 2020

comrumino commented Feb 21, 2020 •

edited

Loading

DemonOne commented Mar 17, 2021

comrumino commented Mar 17, 2021

TI-AviBerko commented Apr 8, 2021

TI-AviBerko commented Sep 19, 2021

wumb0 commented Sep 20, 2021 via email

TI-AviBerko commented Sep 20, 2021

comrumino commented May 19, 2022

comrumino commented May 19, 2022

comrumino commented Jun 15, 2022

Race condition when sending messages from another thread #354

Race condition when sending messages from another thread #354

Comments

benoithudson commented Oct 18, 2019

Environment

benoithudson commented Oct 22, 2019

comrumino commented Nov 7, 2019 • edited Loading

benoithudson commented Nov 7, 2019

comrumino commented Nov 19, 2019

benoithudson commented Feb 19, 2020

comrumino commented Feb 21, 2020 • edited Loading

DemonOne commented Mar 17, 2021

comrumino commented Mar 17, 2021

TI-AviBerko commented Apr 8, 2021

TI-AviBerko commented Sep 19, 2021

wumb0 commented Sep 20, 2021 via email

TI-AviBerko commented Sep 20, 2021

comrumino commented May 19, 2022

comrumino commented May 19, 2022

comrumino commented Jun 15, 2022

comrumino commented Nov 7, 2019 •

edited

Loading

comrumino commented Feb 21, 2020 •

edited

Loading