Ragged batching interface for SonicTriton #40814

kpedro88 · 2023-02-17T19:31:58Z

PR description:

Triton finally added a client-side friendly interface for ragged batching (processing requests with different shapes together). This PR adapts SonicTriton internally to switch between rectangular batching and ragged batching automatically, depending on how shapes are specified, without the user (module developer) having to worry about it. An option to switch manually is also provided. The changes are backwards-compatible: new functionality is available, but existing code that does not use the new functionality will continue to work as before. The README is updated to explain these different cases and the new features. The default Triton server image is also moved to a newer version. This PR may resolve several of the occasional IB failures (etc.) that have been noted in past months; these will be rechecked once it is merged.

Technical details: in Triton, a "request" consists of data that all have the same shape. Therefore, to send multiple inputs with different shapes at the same time, multiple requests must be created. Triton now provides InferMulti() and AsyncInferMulti() interfaces that take a vector of requests and loop over them, in order to handle this more general case.

Contingent details: this interface was actually implemented in Summer 2022, specifically intended to be used with ParticleNet (where we currently pad all inputs to achieve a uniform shape). However, because of various limitations in both PyTorch and ONNX, the performance in the ParticleNet case currently does not improve when using ragged batching. We expect this will eventually be resolved. In the meantime, we want to merge this updated interface so further development can continue on top of these rather involved changes.

PR validation:

SonicTriton unit tests pass (including new unit test specifically for ragged batching functionality)
DRN unit test passes
DRN matrix workflow 10805.31 passes

…ged batching (WIP)

makortel · 2023-03-01T20:59:06Z

+heterogeneous

cmsbuild · 2023-03-01T20:59:34Z

This pull request is fully signed and it will be integrated in one of the next master IBs (tests are also fine). This pull request will now be reviewed by the release team before it's merged. @perrotta, @dpiparo, @rappoccio (and backports should be raised in the release meeting by the corresponding L2)

rappoccio · 2023-03-02T16:04:54Z

+1

makortel · 2023-03-02T16:23:08Z

@rappoccio I suppose the related external and data PRs need to be merged too.

kpedro88 · 2023-03-02T16:47:57Z

Yes, all the PRs listed in #40814 (comment) need to be merged together or the next IB will fail.

smuzaffar · 2023-03-02T17:02:47Z

@rappoccio @perrotta , this PR requires the following PRs too
cms-sw/cmsdist#8324
cms-data/HeterogeneousCore-SonicTriton#2
cms-data/RecoEgamma-EgammaPhotonProducers#2
cms-data/RecoEcal-EgammaClusterProducers#3

perrotta · 2023-03-02T17:14:12Z

@kpedro88 tests show a couple of modified user-floats changes for slimmed pat photons in the single gamma workflow 10805.31. Does it have anything to do with the modifications implemented with cms-data/RecoEgamma-EgammaPhotonProducers#2, which was supposed to be only a "minor syntactic change"?

kpedro88 · 2023-03-02T17:16:49Z

@perrotta thanks for pointing these out, I had missed them. It may be due to the data update. This special workflow 10805.31 isn't yet run in production. @ssrothman can you take a look?

perrotta · 2023-03-02T17:25:37Z

Thank you @kpedro88.
Maintaining this PR without the additional external data will break the IBs (it is already breaking some tests indeed). Would you suggest to merge the externals anyhow, or revert this one till the end of your investigations and re-revert it afterward? I would rather revert it and merge only when we are sure about its outputs (so that possible mistakes do not remain unattended): but please let us know if you have a better plan, or if you think that your investigations can be quick enough,

kpedro88 · 2023-03-02T17:34:30Z

@perrotta let's merge the externals now if that's okay. These will just go into the first prerelease, and we will definitely fix any unintended regressions before 13_1_X is final. Other development needs to start on top of the code and external changes here.

perrotta · 2023-03-02T17:43:00Z

@perrotta let's merge the externals now if that's okay. These will just go into the first prerelease, and we will definitely fix any unintended regressions before 13_1_X is final. Other development needs to start on top of the code and external changes here.

Uhm, wouldn't it be easier to identify such a possible regression if there is a "clean" baseline to compare with?

kpedro88 · 2023-03-02T17:45:25Z

For two photon userfloats, it's easy enough to run tests in an older release and compare by hand. (And actually I see that pre1 is already out, so this can be done with pre1 vs. pre2.)

perrotta · 2023-03-02T17:54:59Z

For two photon userfloats, it's easy enough to run tests in an older release and compare by hand. (And actually I see that pre1 is already out, so this can be done with pre1 vs. pre2.)

Well, those two photon userfloats could point to some deeper unintended regression which is not limited to them...

I think it is better to have this reverted. A revert of the revert can be submitted at the same time, and it can be merged as soon as we are satisfied with the checks. We still have some time before pre2 for it.

kpedro88 · 2023-03-02T17:57:10Z

If you insist on reverting, it's ultimately your prerogative. But I have seen these reversion/unreversion cycles introduce their own confusion and regressions. Historically, our policy was that regressions in prereleases were allowed in order to facilitate ongoing development.

perrotta · 2023-03-02T18:23:03Z

@kpedro88 we discussed it with Sal. Even if not optimal, let have this merged for now, together with all needed externals.
We're going to open a github issue at the same time, to be cleared by pre2 at most, No further development will be allowed in that packege till then.

rappoccio · 2023-03-02T18:31:36Z

@kpedro88 we can do the merge, but will make an issue marked as "urgent" #40938. Can you please fix it ASAP (before the next pre release) so we can ensure that there is nothing awry?

Also, @cms-sw/reconstruction-l2, we will bypass signatures on
cms-data/RecoEcal-EgammaClusterProducers#3
cms-data/RecoEgamma-EgammaPhotonProducers#2

but assign to take a look in the issue.

kpedro88 · 2023-03-02T19:57:49Z

many thanks @perrotta @rappoccio ! we are actively working on fixing any regressions and should have another PR very soon.

kpedro88 added 27 commits February 14, 2023 14:15

changes for new triton version

5c1d217

combine shape/request info into TritonDataEntry for multi-request rag…

9d0ab56

…ged batching (WIP)

finish initial propagation (still WIP)

e9f8c61

simplify synchronization of nEntries across inputs/outputs

f570f3c

fix various mistakes/typos

31db492

propagate to mem resources

899836d

fix off-by-one issues; unit tests now pass

8321b4f

some fixes for compatibility checks

c89b4b3

update server image to newest release

5e20b34

add a test for ragged inputs

71b4f1e

fix bug revealed by test

7f00e86

fix off-by-one

111d248

use simpler example, fix output printing

36a7ae9

simplify

0125089

fix offset error

e5f84dc

update test docs, fix model fetching

0ee76bb

update readme for ragged case

1ec9190

handle batch size zero w/ ragged (including test)

4c844ec

improved batching interface

0ca9b30

fix nEntries handling

c96bb07

handle ragged -> rectangular by removing entries

8b95069

update batching terminology in docs

733d151

try to handle empty batches and size zero inputs automatically

d5a708d

correct size check

a635221

update server version

4539240

fix counting bugs for new batching interface

e77801f

only create shared_ptr once (avoid double free)

4b12f67

cmsbuild added this to the CMSSW_13_1_X milestone Feb 17, 2023

cmsbuild added pending-signatures tests-pending labels Feb 17, 2023

cmsbuild added fully-signed heterogeneous-approved and removed pending-signatures heterogeneous-pending labels Mar 1, 2023

cmsbuild added orp-approved and removed orp-pending labels Mar 2, 2023

cmsbuild merged commit b833bfe into cms-sw:master Mar 2, 2023

perrotta mentioned this pull request Mar 2, 2023

Revert "Ragged batching interface for SonicTriton" #40936

Closed

rappoccio mentioned this pull request Mar 2, 2023

Fix user-floats that changed after updating SonicTriton service in #40814 #40938

Closed

perrotta mentioned this pull request Mar 3, 2023

Fix weights to be identical to original commit cms-data/RecoEgamma-EgammaPhotonProducers#3

Merged

iarspider mentioned this pull request Mar 4, 2023

[GPU] Unit test HeterogeneousCore/SonicTriton failed: CUDA driver version is insufficient for CUDA runtime version #40911

Closed

perrotta mentioned this pull request Mar 15, 2023

Non deterministic outputs of photonDRN from SonicTriton #41060

Open

kpedro88 mentioned this pull request Apr 20, 2023

TritonInputCpuShmResource::copyInput is passed nullptr #38560

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ragged batching interface for SonicTriton #40814

Ragged batching interface for SonicTriton #40814

kpedro88 commented Feb 17, 2023

makortel commented Mar 1, 2023

cmsbuild commented Mar 1, 2023

rappoccio commented Mar 2, 2023

makortel commented Mar 2, 2023 •

edited

Loading

kpedro88 commented Mar 2, 2023

smuzaffar commented Mar 2, 2023

perrotta commented Mar 2, 2023 •

edited

Loading

kpedro88 commented Mar 2, 2023

perrotta commented Mar 2, 2023

kpedro88 commented Mar 2, 2023

perrotta commented Mar 2, 2023

kpedro88 commented Mar 2, 2023

perrotta commented Mar 2, 2023

kpedro88 commented Mar 2, 2023

perrotta commented Mar 2, 2023

rappoccio commented Mar 2, 2023

kpedro88 commented Mar 2, 2023

Ragged batching interface for SonicTriton #40814

Ragged batching interface for SonicTriton #40814

Conversation

kpedro88 commented Feb 17, 2023

PR description:

PR validation:

makortel commented Mar 1, 2023

cmsbuild commented Mar 1, 2023

rappoccio commented Mar 2, 2023

makortel commented Mar 2, 2023 • edited Loading

kpedro88 commented Mar 2, 2023

smuzaffar commented Mar 2, 2023

perrotta commented Mar 2, 2023 • edited Loading

kpedro88 commented Mar 2, 2023

perrotta commented Mar 2, 2023

kpedro88 commented Mar 2, 2023

perrotta commented Mar 2, 2023

kpedro88 commented Mar 2, 2023

perrotta commented Mar 2, 2023

kpedro88 commented Mar 2, 2023

perrotta commented Mar 2, 2023

rappoccio commented Mar 2, 2023

kpedro88 commented Mar 2, 2023

makortel commented Mar 2, 2023 •

edited

Loading

perrotta commented Mar 2, 2023 •

edited

Loading