Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix retry logic for triton client #46703

Merged
merged 1 commit into from
Nov 21, 2024
Merged

Conversation

cjh1
Copy link
Contributor

@cjh1 cjh1 commented Nov 14, 2024

PR description:

On retry the client was trying to access the TritonService through the ServiceRegistry. However, the thread calling the evalute method did not have the appropriate context setup to allow this. We now save the ServiceToken when the client is created, so the appropriate context can be setup before accessing the service.

PR validation:

We tested this at NERSC where we are seeing GOAWAY responses from our nginx ingress. This logic successfully retries the failed requests.

@cjh1
Copy link
Contributor Author

cjh1 commented Nov 14, 2024

@asnaylor

@cmsbuild
Copy link
Contributor

cmsbuild commented Nov 14, 2024

cms-bot internal usage

@asnaylor
Copy link

@kpedro88 I've tested @cjh1 patch at NERSC and its working fine. This will fix the issues we were having connecting to the TritonServer running at NERSC.

@cmsbuild
Copy link
Contributor

@cmsbuild
Copy link
Contributor

A new Pull Request was created by @cjh1 for master.

It involves the following packages:

  • HeterogeneousCore/SonicTriton (heterogeneous)

@cmsbuild, @fwyzard, @makortel can you please review it and eventually sign? Thanks.
@kpedro88, @makortel, @missirol, @riga, @rovere this is something you requested to watch as well.
@antoniovilela, @mandrenguyen, @rappoccio, @sextonkennedy you are the release manager for this.

cms-bot commands are listed here

@kpedro88
Copy link
Contributor

@asnaylor @cjh1 thanks for this very useful contribution!

@kpedro88
Copy link
Contributor

test parameters:
workflows = 10805.31,11634.9001,24834.9001
relvals_opt = --what cleanedupgrade,standard,highstats,pileup,generator,extendedgen,production,identity,ged,machine,premix,nano,gpu,2017,2026

@kpedro88
Copy link
Contributor

please test

@cmsbuild
Copy link
Contributor

-1

Failed Tests: RelVals
Size: This PR adds an extra 12KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-ef1d15/42890/summary.html
COMMIT: b8e88f2
CMSSW: CMSSW_14_2_X_2024-11-15-1100/el8_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/46703/42890/install.sh to create a dev area with all the needed externals and cmssw changes.

RelVals

ERROR importing file  relval_data_highstats name 'base_wf_number_2022' is not defined

@kpedro88
Copy link
Contributor

please test with #46701

@cmsbuild
Copy link
Contributor

+1

Size: This PR adds an extra 12KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-ef1d15/42895/summary.html
COMMIT: b8e88f2
CMSSW: CMSSW_14_2_X_2024-11-15-1100/el8_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week1/cms-sw/cmssw/46703/42895/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • You potentially added 1 lines to the logs
  • Reco comparison results: 11 differences found in the comparisons
  • DQMHistoTests: Total files compared: 49
  • DQMHistoTests: Total histograms compared: 3655511
  • DQMHistoTests: Total failures: 531
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3654960
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 48 files compared)
  • Checked 218 log files, 186 edm output root files, 49 DQM output files
  • TriggerResults: no differences found

@@ -363,6 +369,9 @@ void TritonClient::getResults(const std::vector<std::shared_ptr<tc::InferResult>
void TritonClient::evaluate() {
//undo previous signal from TritonException
if (tries_ > 0) {
// Setup the service token for the current thread. So that we can access the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume (because I don't remember well) the "current thread" here means a thread outside of framework's TBB worker thread pool. I think it would be good to clarify here (and probably also the place where token_ is set) that the evaluate() is being called outside if framework's control, and therefore the service token is needed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the record, the call chain is:


virtual void dispatch(edm::WaitingTaskWithArenaHolder holder) { dispatcher_->dispatch(std::move(holder)); }


void TritonClient::evaluate() {

success = handle_exception([&]() {


void SonicClientBase::finish(bool success, std::exception_ptr eptr) {

finish() is called from inside Triton's AsyncInferMulti() function, and it can call evaluate() again if a retry is needed, so that second call can occur outside of the TBB thread pool.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to be clear, I don't think these comments (in the code) need to explain the full call chain (thanks @kpedro88 for it anyway!). But I think explaining the "current" or "different" thread being a thread outside of the TBB thread pool and the evaluate() being called outside of the framework's control would be helpful for a future reader.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure, I can update the comment

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have updated to comments to clarify how the evaluate method can be called outside the TBB thread pool.

On retry the client was trying to access the TritonService through
the ServiceRegistry. However, the thread calling the evalute method
did not have the appropriate context setup to allow this. We now
save the ServiceToken when the client is created, so the appropriate
context can be setup before accessing the service.
@cmsbuild
Copy link
Contributor

@cmsbuild
Copy link
Contributor

Pull request #46703 was updated. @cmsbuild, @fwyzard, @makortel can you please check and sign again.

@makortel
Copy link
Contributor

@cmsbuild, please test

@cmsbuild
Copy link
Contributor

+1

Size: This PR adds an extra 24KB to repository
Summary: https://cmssdt.cern.ch/SDT/jenkins-artifacts/pull-request-integration/PR-ef1d15/42977/summary.html
COMMIT: 7130113
CMSSW: CMSSW_14_2_X_2024-11-20-1100/el8_amd64_gcc12
User test area: For local testing, you can use /cvmfs/cms-ci.cern.ch/week0/cms-sw/cmssw/46703/42977/install.sh to create a dev area with all the needed externals and cmssw changes.

Comparison Summary

Summary:

  • No significant changes to the logs found
  • Reco comparison results: 8 differences found in the comparisons
  • DQMHistoTests: Total files compared: 49
  • DQMHistoTests: Total histograms compared: 3812905
  • DQMHistoTests: Total failures: 519
  • DQMHistoTests: Total nulls: 0
  • DQMHistoTests: Total successes: 3812366
  • DQMHistoTests: Total skipped: 20
  • DQMHistoTests: Total Missing objects: 0
  • DQMHistoSizes: Histogram memory added: 0.0 KiB( 48 files compared)
  • Checked 218 log files, 186 edm output root files, 49 DQM output files
  • TriggerResults: no differences found

@makortel
Copy link
Contributor

Comparison differences are related to #46416

@makortel
Copy link
Contributor

+heterogeneous

@cmsbuild
Copy link
Contributor

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @antoniovilela, @mandrenguyen, @sextonkennedy, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

@mandrenguyen
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit e455f1f into cms-sw:master Nov 21, 2024
12 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants