Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[core] Fix data race on grpc client context #49475

Merged

Conversation

dentiny
Copy link
Contributor

@dentiny dentiny commented Dec 28, 2024

Resolves issue #49473

From the TSAN stacktrace, it's clear we have data race on grpc::ClientContext:

  • One thread is trying to reconstructing the context with placement new, another thread is accessing it from grpc calls
  • The fix proposed here is to make grpc context dedicated to each grpc call, because for async operations, we really don't have any guarantee that two health check won't happen with no interleave (at the moment)

The thing happens for health check response as well.

@dentiny dentiny added the go add ONLY when ready to merge, run all tests label Dec 28, 2024
@dentiny dentiny requested a review from a team as a code owner December 28, 2024 11:09
new (&context_) grpc::ClientContext();
response_.Clear();
// grpc context and health check response are dedicated to one single async request.
auto context = std::make_shared<grpc::ClientContext>();
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Our intent is to sequential-ize grpc call and next health check, but no guarantee there's no overlapping between these two calls, and how passed in client context is used.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: StartHealthCheck is only called from either start of the manager, or the completion of the last rpc. then how can the context still be used at this place where we destroy it?

Copy link
Contributor Author

@dentiny dentiny Dec 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah good question, I thought it for a while.

First, io context and async grpc are executed in two threads, so we have to order the access to the same variable;
Second, the functionality for grpc call is: schedule io context call X ms later (and access context variable)

There's no guarantee
(1) second io context call, and reconstruct context
and (2) grpc later access context
have no concurrent access to the same context

it's pretty rare situation, I only met it twice within 1000 times repeated testing; but luckily the stacktrace is clear enough

new (&context_) grpc::ClientContext();
response_.Clear();
// grpc context and health check response are dedicated to one single async request.
auto context = std::make_shared<grpc::ClientContext>();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: StartHealthCheck is only called from either start of the manager, or the completion of the last rpc. then how can the context still be used at this place where we destroy it?

auto context = std::make_shared<grpc::ClientContext>();
auto response = std::make_shared<::grpc::health::v1::HealthCheckResponse>();

// Get the context and response pointer before async call, since the order of function
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand this comment. But I get what you are trying to do. We can say:

Get the raw pointer addresses before the shared_ptrs are moved into the lambda.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Example: suppose you have a function invocation f(a, b, c), it's undefined behavior which one of a, b, c is passed in first, it's vendor specific implementation.

In this particular case, if lambda is resolved first, shared pointer gets invalidation before it's passed as first argument.

Copy link
Contributor Author

@dentiny dentiny Dec 29, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me add a comment to clarify, I think it's a tricky problem and we need to comment out WHY we do so.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean, we all know the arg evaluations are non deterministic. But it's not the biggest point here (why we care about args order here?). We need to answer that with:

// Get the raw pointer addresses for the Check() call, before the shared_ptrs are moved into the lambda.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok sure, I copied your words

Signed-off-by: dentiny <[email protected]>
@dentiny dentiny requested a review from rynewang December 29, 2024 03:52
Signed-off-by: dentiny <[email protected]>
@rynewang rynewang enabled auto-merge (squash) December 30, 2024 21:58
@rynewang rynewang merged commit dd946c5 into ray-project:master Dec 31, 2024
6 checks passed
srinathk10 pushed a commit that referenced this pull request Jan 3, 2025
Resolves issue #49473

From the TSAN stacktrace, it's clear we have data race on
`grpc::ClientContext`:
- One thread is trying to reconstructing the context with placement new,
another thread is accessing it from grpc calls
- The fix proposed here is to make grpc context dedicated to each grpc
call, because for async operations, we really don't have any guarantee
that two health check won't happen with no interleave (at the moment)

The thing happens for health check response as well.

---------

Signed-off-by: dentiny <[email protected]>
anyadontfly pushed a commit to anyadontfly/ray that referenced this pull request Feb 13, 2025
Resolves issue ray-project#49473

From the TSAN stacktrace, it's clear we have data race on
`grpc::ClientContext`:
- One thread is trying to reconstructing the context with placement new,
another thread is accessing it from grpc calls
- The fix proposed here is to make grpc context dedicated to each grpc
call, because for async operations, we really don't have any guarantee
that two health check won't happen with no interleave (at the moment)

The thing happens for health check response as well.

---------

Signed-off-by: dentiny <[email protected]>
Signed-off-by: Puyuan Yao <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
go add ONLY when ready to merge, run all tests
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants