[core] Fix data race on grpc client context #49475

dentiny · 2024-12-28T11:09:57Z

Resolves issue #49473

From the TSAN stacktrace, it's clear we have data race on grpc::ClientContext:

One thread is trying to reconstructing the context with placement new, another thread is accessing it from grpc calls
The fix proposed here is to make grpc context dedicated to each grpc call, because for async operations, we really don't have any guarantee that two health check won't happen with no interleave (at the moment)

The thing happens for health check response as well.

Signed-off-by: dentiny <[email protected]>

dentiny · 2024-12-28T11:24:59Z

src/ray/gcs/gcs_server/gcs_health_check_manager.cc

-  new (&context_) grpc::ClientContext();
-  response_.Clear();
+  // grpc context and health check response are dedicated to one single async request.
+  auto context = std::make_shared<grpc::ClientContext>();


Our intent is to sequential-ize grpc call and next health check, but no guarantee there's no overlapping between these two calls, and how passed in client context is used.

Q: StartHealthCheck is only called from either start of the manager, or the completion of the last rpc. then how can the context still be used at this place where we destroy it?

Yeah good question, I thought it for a while.

First, io context and async grpc are executed in two threads, so we have to order the access to the same variable;
Second, the functionality for grpc call is: schedule io context call X ms later (and access context variable)

There's no guarantee
(1) second io context call, and reconstruct context
and (2) grpc later access context
have no concurrent access to the same context

it's pretty rare situation, I only met it twice within 1000 times repeated testing; but luckily the stacktrace is clear enough

rynewang · 2024-12-28T20:38:50Z

src/ray/gcs/gcs_server/gcs_health_check_manager.cc

-  new (&context_) grpc::ClientContext();
-  response_.Clear();
+  // grpc context and health check response are dedicated to one single async request.
+  auto context = std::make_shared<grpc::ClientContext>();


Q: StartHealthCheck is only called from either start of the manager, or the completion of the last rpc. then how can the context still be used at this place where we destroy it?

rynewang · 2024-12-28T20:41:13Z

src/ray/gcs/gcs_server/gcs_health_check_manager.cc

+  auto context = std::make_shared<grpc::ClientContext>();
+  auto response = std::make_shared<::grpc::health::v1::HealthCheckResponse>();
+
+  // Get the context and response pointer before async call, since the order of function


I don't understand this comment. But I get what you are trying to do. We can say:

Get the raw pointer addresses before the shared_ptrs are moved into the lambda.

Example: suppose you have a function invocation f(a, b, c), it's undefined behavior which one of a, b, c is passed in first, it's vendor specific implementation.

In this particular case, if lambda is resolved first, shared pointer gets invalidation before it's passed as first argument.

Let me add a comment to clarify, I think it's a tricky problem and we need to comment out WHY we do so.

I mean, we all know the arg evaluations are non deterministic. But it's not the biggest point here (why we care about args order here?). We need to answer that with:

// Get the raw pointer addresses for the Check() call, before the shared_ptrs are moved into the lambda.

ok sure, I copied your words

Signed-off-by: dentiny <[email protected]>

Resolves issue #49473 From the TSAN stacktrace, it's clear we have data race on `grpc::ClientContext`: - One thread is trying to reconstructing the context with placement new, another thread is accessing it from grpc calls - The fix proposed here is to make grpc context dedicated to each grpc call, because for async operations, we really don't have any guarantee that two health check won't happen with no interleave (at the moment) The thing happens for health check response as well. --------- Signed-off-by: dentiny <[email protected]>

Resolves issue ray-project#49473 From the TSAN stacktrace, it's clear we have data race on `grpc::ClientContext`: - One thread is trying to reconstructing the context with placement new, another thread is accessing it from grpc calls - The fix proposed here is to make grpc context dedicated to each grpc call, because for async operations, we really don't have any guarantee that two health check won't happen with no interleave (at the moment) The thing happens for health check response as well. --------- Signed-off-by: dentiny <[email protected]> Signed-off-by: Puyuan Yao <[email protected]>

fix another data race on grpc client context

efcc3ef

Signed-off-by: dentiny <[email protected]>

dentiny added the go add ONLY when ready to merge, run all tests label Dec 28, 2024

dentiny requested review from jjyao, rynewang and dayshah December 28, 2024 11:09

dentiny assigned jjyao, rynewang and dayshah Dec 28, 2024

dentiny requested a review from a team as a code owner December 28, 2024 11:09

dentiny commented Dec 28, 2024

View reviewed changes

rynewang reviewed Dec 28, 2024

View reviewed changes

add example

9cf8496

Signed-off-by: dentiny <[email protected]>

dentiny requested a review from rynewang December 29, 2024 03:52

dentiny added 2 commits December 30, 2024 06:15

Merge branch 'master' into hjiang/fix-data-race-on-grpc-context

f127a9b

resolve merge conflict

9b5ab4b

Signed-off-by: dentiny <[email protected]>

dayshah mentioned this pull request Dec 30, 2024

[core] InMemoryStoreClient improvements #49349

Merged

8 tasks

rynewang approved these changes Dec 30, 2024

View reviewed changes

add more comments

fc04328

Signed-off-by: dentiny <[email protected]>

rynewang approved these changes Dec 30, 2024

View reviewed changes

rynewang enabled auto-merge (squash) December 30, 2024 21:58

rynewang merged commit dd946c5 into ray-project:master Dec 31, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[core] Fix data race on grpc client context #49475

[core] Fix data race on grpc client context #49475

dentiny commented Dec 28, 2024

dentiny Dec 28, 2024

rynewang Dec 28, 2024

dentiny Dec 29, 2024 •

edited

Loading

rynewang Dec 28, 2024

rynewang Dec 28, 2024

dentiny Dec 29, 2024

dentiny Dec 29, 2024 •

edited

Loading

rynewang Dec 30, 2024

dentiny Dec 30, 2024

[core] Fix data race on grpc client context #49475

[core] Fix data race on grpc client context #49475

Conversation

dentiny commented Dec 28, 2024

dentiny Dec 28, 2024

Choose a reason for hiding this comment

rynewang Dec 28, 2024

Choose a reason for hiding this comment

dentiny Dec 29, 2024 • edited Loading

Choose a reason for hiding this comment

rynewang Dec 28, 2024

Choose a reason for hiding this comment

rynewang Dec 28, 2024

Choose a reason for hiding this comment

dentiny Dec 29, 2024

Choose a reason for hiding this comment

dentiny Dec 29, 2024 • edited Loading

Choose a reason for hiding this comment

rynewang Dec 30, 2024

Choose a reason for hiding this comment

dentiny Dec 30, 2024

Choose a reason for hiding this comment

dentiny Dec 29, 2024 •

edited

Loading

dentiny Dec 29, 2024 •

edited

Loading