-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix retry logic for triton client #46703
Conversation
cms-bot internal usage |
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-46703/42656 |
A new Pull Request was created by @cjh1 for master. It involves the following packages:
@cmsbuild, @fwyzard, @makortel can you please review it and eventually sign? Thanks. cms-bot commands are listed here |
test parameters: |
please test |
-1 Failed Tests: RelVals RelVals
|
please test with #46701 |
+1 Size: This PR adds an extra 12KB to repository Comparison SummarySummary:
|
@@ -363,6 +369,9 @@ void TritonClient::getResults(const std::vector<std::shared_ptr<tc::InferResult> | |||
void TritonClient::evaluate() { | |||
//undo previous signal from TritonException | |||
if (tries_ > 0) { | |||
// Setup the service token for the current thread. So that we can access the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume (because I don't remember well) the "current thread" here means a thread outside of framework's TBB worker thread pool. I think it would be good to clarify here (and probably also the place where token_
is set) that the evaluate()
is being called outside if framework's control, and therefore the service token is needed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the record, the call chain is:
client_->dispatch(holder); |
virtual void dispatch(edm::WaitingTaskWithArenaHolder holder) { dispatcher_->dispatch(std::move(holder)); } |
client_->evaluate(); |
void TritonClient::evaluate() { |
success = handle_exception([&]() { |
finish(true); |
void SonicClientBase::finish(bool success, std::exception_ptr eptr) { |
evaluate(); |
finish()
is called from inside Triton's AsyncInferMulti()
function, and it can call evaluate()
again if a retry is needed, so that second call can occur outside of the TBB thread pool.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just to be clear, I don't think these comments (in the code) need to explain the full call chain (thanks @kpedro88 for it anyway!). But I think explaining the "current" or "different" thread being a thread outside of the TBB thread pool and the evaluate()
being called outside of the framework's control would be helpful for a future reader.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, I can update the comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have updated to comments to clarify how the evaluate
method can be called outside the TBB thread pool.
On retry the client was trying to access the TritonService through the ServiceRegistry. However, the thread calling the evalute method did not have the appropriate context setup to allow this. We now save the ServiceToken when the client is created, so the appropriate context can be setup before accessing the service.
b8e88f2
to
7130113
Compare
+code-checks Logs: https://cmssdt.cern.ch/SDT/code-checks/cms-sw-PR-46703/42719 |
@cmsbuild, please test |
+1 Size: This PR adds an extra 24KB to repository Comparison SummarySummary:
|
Comparison differences are related to #46416 |
+heterogeneous |
This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @antoniovilela, @mandrenguyen, @sextonkennedy, @rappoccio (and backports should be raised in the release meeting by the corresponding L2) |
+1 |
PR description:
On retry the client was trying to access the TritonService through the ServiceRegistry. However, the thread calling the evalute method did not have the appropriate context setup to allow this. We now save the ServiceToken when the client is created, so the appropriate context can be setup before accessing the service.
PR validation:
We tested this at NERSC where we are seeing
GOAWAY
responses from our nginx ingress. This logic successfully retries the failed requests.