Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Race condition when sending messages from another thread #354

Open
benoithudson opened this issue Oct 18, 2019 · 15 comments
Open

Race condition when sending messages from another thread #354

benoithudson opened this issue Oct 18, 2019 · 15 comments
Assignees
Labels
Bug Confirmed bug

Comments

@benoithudson
Copy link

We're seeing a race that occasionally causes a 30-second lag. The precise scenario:

  1. Run a ThreadedServer
  2. Connection arrives, thread A assigned to serve_all on that connection.
  3. Thread B sends a message on the connection then calls serve() to get the reply.
  4. Thread B waits for a recv_event while thread A is in poll().
  5. Thread A gets the reply.
  6. Thread A posts the recv_event and gives up the GIL while dispatching the reply, before setting _ready.
  7. Thread B wakes up, enters the poll() and gives up the GIL.
  8. Thread A finishes dispatching the reply and sets _ready on the AsyncResult.
  9. 30 seconds later, thread B's poll times out
  10. Thread B finds the reply is ready and all is well.

So there's no data loss, no unexpected exceptions, just 30 seconds wasted, once in a while.

Workaround: have the client ping every second, so that the waste is at most 1 second (because in step 9, thread B's poll doesn't time out it just handles the ping).

We haven't minimized the example yet.

Environment
  • rpyc version 4.0.2
  • python version 2.7.x
  • operating system Windows, mac, linux

This issue is presumably platform-independent.

@benoithudson
Copy link
Author

Fun fact: on further testing, you ideally send an async HANDLE_PING by hand, the ping function blocks and therefore can itself hit this race condition (or you send a ping with a very short timeout).

@comrumino comrumino self-assigned this Nov 7, 2019
@comrumino comrumino added the Triage Investigation by a maintainer has started label Nov 7, 2019
@comrumino
Copy link
Collaborator

comrumino commented Nov 7, 2019

The scenario you described sounds related to https://github.com/tomerfiliba/rpyc/blob/master/rpyc/core/protocol.py#L415-L419
but step 10 indicates otherwise. Some other improvements might have fixed this. Could you see if the issue occurs in 4.1.2?

@benoithudson
Copy link
Author

Sounds related indeed, though the serve_threaded function isn't involved in the race identified here (except for calling serve()).

I haven't tested the new version yet, but the relevant code is unchanged:

  • AsyncResult.wait
  • Connection.serve

Basically, the AsyncResult needs to check is_ready while holding the recv_lock and the dispatch needs to set is_ready before releasing it. Or there needs to be some other similar sync mechanism.

@comrumino comrumino added Bug Confirmed bug Diagnosed and removed Triage Investigation by a maintainer has started labels Nov 14, 2019
@comrumino
Copy link
Collaborator

As an update, I added some debugging documentation. A PDF of the TCP stream that the race condition occurs would help me create a unit test to prevent the issue in the future. If there is any concern about confidentiality, you could always encrypt it with my PGP and email under my profile.

@benoithudson
Copy link
Author

@comrumino : Do you have a CLA for us to sign, to assign rights? We've got a PR we could send up that might better illuminate what's going on (which depends on another PR to be viable).

@comrumino
Copy link
Collaborator

comrumino commented Feb 21, 2020

Nothing to sign, but I use TLDRLegal. I think CLAs mostly apply to restrictive IP rights. Even so, the license agreement covers your rights.

I always welcome PRs or any other effort that makes my life easier 😄

@DemonOne
Copy link

The exact same issue also happens when a BgServingThread steals the async response data, and flags the object while the originator starts polling().
This is a real issue, especially if you set the sync time out to None, in which case your program will next recover.

@comrumino
Copy link
Collaborator

DemonOne, are you using the conn passed to BgServingThread before you stop the background thread? I imagine the channel behavior is consistent with TCP. Once I have/create a concrete example, I would be able to better research solutions.

At the moment, serving connections could be made harder to use wrong: improve resource acquisition of connections such that only one thread is serving a given connection until released or thread is closed. My initial impression is that implementing a solution to reroute or better propagate would have a measurable performance impact; allowing RPyC streams/channels/connections to handle response data would be more efficient and consistent typical socket usage.

@TI-AviBerko
Copy link

@comrumino (it's me, DemonOne)

The problem as I see it:
I'm creating a connection and passing it to a BgServingThread, this connection is used to execute numerous commands.

  • In certian situations, when calling a function through the connection, the connection's async repsonse is intercepted by the bg thread, handled and signaled right before sync_requets() starts waiting for the reply, which leads to it blocking on poll() until the default timeout elapses. though the result is actually valid.

I managed to recreate this in a ratio of around 3-5 per 100 calls (in an internal tool I'm writing)
Lately I've also seen this happen on the server, where a callback call wound up stumbling into the same issue.

@TI-AviBerko
Copy link

@wumb0 @comrumino
Regarding the fix in #455 - in my use case the async timeout is infinite, one of the threads can reach server() before the _is_ready is signaled, and block forever.

@wumb0
Copy link
Contributor

wumb0 commented Sep 20, 2021 via email

@TI-AviBerko
Copy link

Thanks, but I don't think that's related.
I solved it by monkey patching sync_request() in order to not allow poll() to block indefinitely.

I do agree about the larger issue, maybe it'll be easier to concentrate the networking in one context.

@comrumino
Copy link
Collaborator

@TI-AviBerko @wumb0 @benoithudson is everyone here using either serve_threaded or BgServingThread?

After another crack at this issue, here are my thoughts. From serve_threaded, I found this to be a refresher on the topic.

    CAVEAT: using non-immutable types that require a netref to be constructed to serve a request,
    or invoking anything else that performs a sync_request, may timeout due to the sync_request reply being
    received by another thread serving the connection. A more conventional approach where each client thread
    opens a new connection would allow `ThreadedServer` to naturally avoid such multiplexing issues and
    is the preferred approach for threading procedures that invoke sync_request. See issue #345

If I were to rephrase/correct this, after my most recent commits, I would make the argument that the Connection class did not properly acquire and release the _recvlock at the time of writing this comment. To address this issue, I switched _recvlock to an rlock and acquired the lock for the duration of the synchronous call. This may fix the issues you all are seeing here.

I still need to check if Connection class safely handles async requests. An issue may exist for async requests during the close sequence of the protocol.

Comments?

@comrumino
Copy link
Collaborator

Relates/impacts #482 since multiprocess is a more complex scenario. Minimally documentation could be better.

@comrumino
Copy link
Collaborator

Even with added support around threading for RPyC there will still be limitations due to fork https://www.evanjones.ca/fork-is-dangerous.html

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Bug Confirmed bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants