-
Notifications
You must be signed in to change notification settings - Fork 6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[core] Fix data race on grpc client context #49475
[core] Fix data race on grpc client context #49475
Conversation
Signed-off-by: dentiny <[email protected]>
new (&context_) grpc::ClientContext(); | ||
response_.Clear(); | ||
// grpc context and health check response are dedicated to one single async request. | ||
auto context = std::make_shared<grpc::ClientContext>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Our intent is to sequential-ize grpc call and next health check, but no guarantee there's no overlapping between these two calls, and how passed in client context is used.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: StartHealthCheck is only called from either start of the manager, or the completion of the last rpc. then how can the context still be used at this place where we destroy it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah good question, I thought it for a while.
First, io context and async grpc are executed in two threads, so we have to order the access to the same variable;
Second, the functionality for grpc call is: schedule io context call X ms later (and access context variable)
There's no guarantee
(1) second io context call, and reconstruct context
and (2) grpc later access context
have no concurrent access to the same context
it's pretty rare situation, I only met it twice within 1000 times repeated testing; but luckily the stacktrace is clear enough
new (&context_) grpc::ClientContext(); | ||
response_.Clear(); | ||
// grpc context and health check response are dedicated to one single async request. | ||
auto context = std::make_shared<grpc::ClientContext>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Q: StartHealthCheck is only called from either start of the manager, or the completion of the last rpc. then how can the context still be used at this place where we destroy it?
auto context = std::make_shared<grpc::ClientContext>(); | ||
auto response = std::make_shared<::grpc::health::v1::HealthCheckResponse>(); | ||
|
||
// Get the context and response pointer before async call, since the order of function |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this comment. But I get what you are trying to do. We can say:
Get the raw pointer addresses before the shared_ptrs are moved into the lambda.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Example: suppose you have a function invocation f(a, b, c)
, it's undefined behavior which one of a, b, c is passed in first, it's vendor specific implementation.
In this particular case, if lambda is resolved first, shared pointer gets invalidation before it's passed as first argument.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let me add a comment to clarify, I think it's a tricky problem and we need to comment out WHY we do so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean, we all know the arg evaluations are non deterministic. But it's not the biggest point here (why we care about args order here?). We need to answer that with:
// Get the raw pointer addresses for the Check() call, before the shared_ptrs are moved into the lambda.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok sure, I copied your words
Signed-off-by: dentiny <[email protected]>
Signed-off-by: dentiny <[email protected]>
Signed-off-by: dentiny <[email protected]>
Resolves issue #49473 From the TSAN stacktrace, it's clear we have data race on `grpc::ClientContext`: - One thread is trying to reconstructing the context with placement new, another thread is accessing it from grpc calls - The fix proposed here is to make grpc context dedicated to each grpc call, because for async operations, we really don't have any guarantee that two health check won't happen with no interleave (at the moment) The thing happens for health check response as well. --------- Signed-off-by: dentiny <[email protected]>
Resolves issue ray-project#49473 From the TSAN stacktrace, it's clear we have data race on `grpc::ClientContext`: - One thread is trying to reconstructing the context with placement new, another thread is accessing it from grpc calls - The fix proposed here is to make grpc context dedicated to each grpc call, because for async operations, we really don't have any guarantee that two health check won't happen with no interleave (at the moment) The thing happens for health check response as well. --------- Signed-off-by: dentiny <[email protected]> Signed-off-by: Puyuan Yao <[email protected]>
Resolves issue #49473
From the TSAN stacktrace, it's clear we have data race on
grpc::ClientContext
:The thing happens for health check response as well.