Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ragged batching interface for SonicTriton #40814

Merged
merged 31 commits into from
Mar 2, 2023

Conversation

kpedro88
Copy link
Contributor

PR description:

Triton finally added a client-side friendly interface for ragged batching (processing requests with different shapes together). This PR adapts SonicTriton internally to switch between rectangular batching and ragged batching automatically, depending on how shapes are specified, without the user (module developer) having to worry about it. An option to switch manually is also provided. The changes are backwards-compatible: new functionality is available, but existing code that does not use the new functionality will continue to work as before. The README is updated to explain these different cases and the new features. The default Triton server image is also moved to a newer version. This PR may resolve several of the occasional IB failures (etc.) that have been noted in past months; these will be rechecked once it is merged.

Technical details: in Triton, a "request" consists of data that all have the same shape. Therefore, to send multiple inputs with different shapes at the same time, multiple requests must be created. Triton now provides InferMulti() and AsyncInferMulti() interfaces that take a vector of requests and loop over them, in order to handle this more general case.

Contingent details: this interface was actually implemented in Summer 2022, specifically intended to be used with ParticleNet (where we currently pad all inputs to achieve a uniform shape). However, because of various limitations in both PyTorch and ONNX, the performance in the ParticleNet case currently does not improve when using ragged batching. We expect this will eventually be resolved. In the meantime, we want to merge this updated interface so further development can continue on top of these rather involved changes.

PR validation:

  • SonicTriton unit tests pass (including new unit test specifically for ragged batching functionality)
  • DRN unit test passes
  • DRN matrix workflow 10805.31 passes

@makortel
Copy link
Contributor

makortel commented Mar 1, 2023

+heterogeneous

@cmsbuild
Copy link
Contributor

cmsbuild commented Mar 1, 2023

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

@rappoccio
Copy link
Contributor

+1

@cmsbuild cmsbuild merged commit b833bfe into cms-sw:master Mar 2, 2023
@makortel
Copy link
Contributor

makortel commented Mar 2, 2023

@rappoccio I suppose the related external and data PRs need to be merged too.

@kpedro88
Copy link
Contributor Author

kpedro88 commented Mar 2, 2023

Yes, all the PRs listed in #40814 (comment) need to be merged together or the next IB will fail.

@smuzaffar
Copy link
Contributor

@perrotta
Copy link
Contributor

perrotta commented Mar 2, 2023

@kpedro88 tests show a couple of modified user-floats changes for slimmed pat photons in the single gamma workflow 10805.31. Does it have anything to do with the modifications implemented with cms-data/RecoEgamma-EgammaPhotonProducers#2, which was supposed to be only a "minor syntactic change"?

@kpedro88
Copy link
Contributor Author

kpedro88 commented Mar 2, 2023

@perrotta thanks for pointing these out, I had missed them. It may be due to the data update. This special workflow 10805.31 isn't yet run in production. @ssrothman can you take a look?

@perrotta
Copy link
Contributor

perrotta commented Mar 2, 2023

Thank you @kpedro88.
Maintaining this PR without the additional external data will break the IBs (it is already breaking some tests indeed). Would you suggest to merge the externals anyhow, or revert this one till the end of your investigations and re-revert it afterward? I would rather revert it and merge only when we are sure about its outputs (so that possible mistakes do not remain unattended): but please let us know if you have a better plan, or if you think that your investigations can be quick enough,

@kpedro88
Copy link
Contributor Author

kpedro88 commented Mar 2, 2023

@perrotta let's merge the externals now if that's okay. These will just go into the first prerelease, and we will definitely fix any unintended regressions before 13_1_X is final. Other development needs to start on top of the code and external changes here.

@perrotta
Copy link
Contributor

perrotta commented Mar 2, 2023

@perrotta let's merge the externals now if that's okay. These will just go into the first prerelease, and we will definitely fix any unintended regressions before 13_1_X is final. Other development needs to start on top of the code and external changes here.

Uhm, wouldn't it be easier to identify such a possible regression if there is a "clean" baseline to compare with?

@kpedro88
Copy link
Contributor Author

kpedro88 commented Mar 2, 2023

For two photon userfloats, it's easy enough to run tests in an older release and compare by hand. (And actually I see that pre1 is already out, so this can be done with pre1 vs. pre2.)

@perrotta
Copy link
Contributor

perrotta commented Mar 2, 2023

For two photon userfloats, it's easy enough to run tests in an older release and compare by hand. (And actually I see that pre1 is already out, so this can be done with pre1 vs. pre2.)

Well, those two photon userfloats could point to some deeper unintended regression which is not limited to them...

I think it is better to have this reverted. A revert of the revert can be submitted at the same time, and it can be merged as soon as we are satisfied with the checks. We still have some time before pre2 for it.

@kpedro88
Copy link
Contributor Author

kpedro88 commented Mar 2, 2023

If you insist on reverting, it's ultimately your prerogative. But I have seen these reversion/unreversion cycles introduce their own confusion and regressions. Historically, our policy was that regressions in prereleases were allowed in order to facilitate ongoing development.

@perrotta
Copy link
Contributor

perrotta commented Mar 2, 2023

@kpedro88 we discussed it with Sal. Even if not optimal, let have this merged for now, together with all needed externals.
We're going to open a github issue at the same time, to be cleared by pre2 at most, No further development will be allowed in that packege till then.

@rappoccio
Copy link
Contributor

@kpedro88 we can do the merge, but will make an issue marked as "urgent" #40938. Can you please fix it ASAP (before the next pre release) so we can ensure that there is nothing awry?

Also, @cms-sw/reconstruction-l2, we will bypass signatures on
cms-data/RecoEcal-EgammaClusterProducers#3
cms-data/RecoEgamma-EgammaPhotonProducers#2

but assign to take a look in the issue.

@kpedro88
Copy link
Contributor Author

kpedro88 commented Mar 2, 2023

many thanks @perrotta @rappoccio ! we are actively working on fixing any regressions and should have another PR very soon.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants