cms-sw · cmsbuild · Mar 2, 2023 · Aug 4, 2022 · Apr 16, 2022 · Apr 18, 2022
diff --git a/HeterogeneousCore/SonicTriton/README.md b/HeterogeneousCore/SonicTriton/README.md
@@ -9,7 +9,7 @@ Triton supports multiple named inputs and outputs with different types. The allo
 boolean, unsigned integer (8, 16, 32, or 64 bits), integer (8, 16, 32, or 64 bits), floating point (16, 32, or 64 bit), or string.
 
 Triton additionally supports inputs and outputs with multiple dimensions, some of which might be variable (denoted by -1).
-Concrete values for variable dimensions must be specified for each call (event).
+Concrete values for variable dimensions must be specified for each entry (see [Batching](#batching) below).
 
 ## Client
 
@@ -34,22 +34,37 @@ The model information from the server can be printed by enabling `verbose` outpu
 * `useSharedMemory`: enable use of shared memory (see [below](#shared-memory)) with local servers (default: true)
 * `compression`: enable compression of input and output data to reduce bandwidth (using gzip or deflate) (default: none)
 
-The batch size should be set using the client accessor, in order to ensure a consistent value across all inputs:
+### Batching
+
+SonicTriton supports two types of batching, rectangular and ragged, depicted below:
+![batching diagrams](./doc/batching_diagrams.png)  
+In the rectangular case, the inputs for each object in an event have the same shape, so they can be combined into a single entry.
+(In this case, the batch size is specified as the "outer dimension" of the shape.)
+In the ragged case, the inputs for each object in an event do not have the same shape, so they cannot be combined;
+instead, they are represented internally as separate entries, each with its own shape specified explicitly.
+
+The batch size is set and accessed using the client, in order to ensure a consistent value across all inputs.
+The batch mode can also be changed manually, in order to allow optimizing the allocation of entries.
+(If two entries with different shapes are specified, the batch mode will always automatically switch to ragged.)
 * `setBatchSize()`: set a new batch size
   * some models may not support batching
+* `batchSize()`: return current batch size
+* `setBatchMode()`: set the batch mode (`Rectangular` or `Ragged`)
+* `batchMode()`: get the current batch mode
 
 Useful `TritonData` accessors include:
 * `variableDims()`: return true if any variable dimensions
 * `sizeDims()`: return product of dimensions (-1 if any variable dimensions)
-* `shape()`: return actual shape (list of dimensions)
-* `sizeShape()`: return product of shape dimensions (returns `sizeDims()` if no variable dimensions)
+* `shape(unsigned entry=0)`: return actual shape (list of dimensions) for specified entry
+* `sizeShape(unsigned entry=0)`: return product of shape dimensions (returns `sizeDims()` if no variable dimensions) for specified entry
 * `byteSize()`: return number of bytes for data type
 * `dname()`: return name of data type
-* `batchSize()`: return current batch size
 
 To update the `TritonData` shape in the variable-dimension case:
-* `setShape(const std::vector<int64_t>& newShape)`: update all (variable) dimensions with values provided in `newShape`
-* `setShape(unsigned loc, int64_t val)`: update variable dimension at `loc` with `val`
+* `setShape(const std::vector<int64_t>& newShape, unsigned entry=0)`: update all (variable) dimensions with values provided in `newShape` for specified entry
+* `setShape(unsigned loc, int64_t val, unsigned entry=0)`: update variable dimension at `loc` with `val` for specified entry
+
+### I/O types
 
 There are specific local input and output containers that should be used in producers.
 Here, `T` is a primitive type, and the two aliases listed below are passed to `TritonInputData::toServer()`
@@ -58,7 +73,7 @@ and returned by `TritonOutputData::fromServer()`, respectively:
 * `TritonOutput<T> = std::vector<edm::Span<const T*>>`
 
 The `TritonInputContainer` object should be created using the helper function described below.
-It expects one vector per batch entry (i.e. the size of the outer vector is the batch size).
+It expects one vector per batch entry (i.e. the size of the outer vector is the batch size (rectangular case) or number of entries (ragged case)).
 Therefore, it is best to call `TritonClient::setBatchSize()`, if necessary, before calling the helper.
 It will also reserve the expected size of the input in each inner vector (by default),
 if the concrete shape is available (i.e. `setShape()` was already called, if the input has variable dimensions).
@@ -100,11 +115,11 @@ In a SONIC Triton producer, the basic flow should follow this pattern:
     a. access input object(s) from `TritonInputMap`  
     b. allocate input data using `allocate<T>()`  
     c. fill input data  
-    d. set input shape(s) (optional, only if any variable dimensions)  
+    d. set input shape(s) (optional for rectangular case, only if any variable dimensions; required for ragged case)  
     e. convert using `toServer()` function of input object(s)  
 2. `produce()`:  
-    a. access output object(s) from `TritonOutputMap`  
-    b. obtain output data as `TritonOutput<T>` using `fromServer()` function of output object(s) (sets output shape(s) if variable dimensions exist)  
+    a. access output object(s) from `TritonOutputMap` (includes shapes)  
+    b. obtain output data as `TritonOutput<T>` using `fromServer()` function of output object(s)  
     c. fill output products  
 
 ## Services
@@ -116,14 +131,14 @@ The script has two operations (`start` and `stop`) and the following options:
 * `-d`: use Docker instead of Apptainer
 * `-f`: force reuse of (possibly) existing container instance
 * `-g`: use GPU instead of CPU
-* `-i` [name]`: server image name (default: fastml/triton-torchgeo:20.09-py3-geometric)
+* `-i` [name]`: server image name (default: fastml/triton-torchgeo:22.07-py3-geometric)
 * `-M [dir]`: model repository (can be given more than once)
 * `-m [dir]`: specific model directory (can be given more than one)
 * `-n [name]`: name of container instance, also used for hidden temporary dir (default: triton_server_instance)
 * `-P [port]`: base port number for services (-1: automatically find an unused port range) (default: 8000)
 * `-p [pid]`: automatically shut down server when process w/ specified PID ends (-1: use parent process PID)
 * `-r [num]`: number of retries when starting container (default: 3)
-* `-s [dir]`: Apptainer sandbox directory (default: /cvmfs/unpacked.cern.ch/registry.hub.docker.com/fastml/triton-torchgeo:20.09-py3-geometric)
+* `-s [dir]`: Apptainer sandbox directory (default: /cvmfs/unpacked.cern.ch/registry.hub.docker.com/fastml/triton-torchgeo:22.07-py3-geometric)
 * `-t [dir]`: non-default hidden temporary dir
 * `-v`: (verbose) start: activate server debugging info; stop: keep server logs
 * `-w [time]`: maximum time to wait for server to start (default: 300 seconds)
@@ -172,4 +187,11 @@ The fallback server has a separate set of options, mostly related to the invocat
 
 ## Examples
 
-Several example producers (running image classification networks or Graph Attention Network) can be found in the [test](./test) directory.
+Several example producers can be found in the [test](./test) directory.
+
+## Legend
+
+The SonicTriton documentation uses different terms than Triton itself for certain concepts.
+The SonicTriton:Triton correspondence for those terms is given here:
+* Entry : request
+* Rectangular batching : Triton-supported batching
diff --git a/HeterogeneousCore/SonicTriton/doc/batching_diagrams.png b/HeterogeneousCore/SonicTriton/doc/batching_diagrams.png
diff --git a/HeterogeneousCore/SonicTriton/interface/TritonClient.h b/HeterogeneousCore/SonicTriton/interface/TritonClient.h
@@ -16,6 +16,8 @@
 #include "grpc_client.h"
 #include "grpc_service.pb.h"
 
+enum class TritonBatchMode { Rectangular = 1, Ragged = 2 };
+
 class TritonClient : public SonicClient<TritonInputMap, TritonOutputMap> {
 public:
   struct ServerSideStats {
@@ -36,21 +38,26 @@ class TritonClient : public SonicClient<TritonInputMap, TritonOutputMap> {
   ~TritonClient() override;
 
   //accessors
-  unsigned batchSize() const { return batchSize_; }
+  unsigned batchSize() const;
+  TritonBatchMode batchMode() const { return batchMode_; }
   bool verbose() const { return verbose_; }
   bool useSharedMemory() const { return useSharedMemory_; }
   void setUseSharedMemory(bool useShm) { useSharedMemory_ = useShm; }
   bool setBatchSize(unsigned bsize);
+  void setBatchMode(TritonBatchMode batchMode);
+  void resetBatchMode();
   void reset() override;
-  bool noBatch() const { return noBatch_; }
   TritonServerType serverType() const { return serverType_; }
 
   //for fillDescriptions
   static void fillPSetDescription(edm::ParameterSetDescription& iDesc);
 
 protected:
   //helpers
-  void getResults(std::shared_ptr<triton::client::InferResult> results);
+  bool noOuterDim() const { return noOuterDim_; }
+  unsigned outerDim() const { return outerDim_; }
+  unsigned nEntries() const;
+  void getResults(const std::vector<std::shared_ptr<triton::client::InferResult>>& results);
   void evaluate() override;
   template <typename F>
   bool handle_exception(F&& call);
@@ -62,29 +69,30 @@ class TritonClient : public SonicClient<TritonInputMap, TritonOutputMap> {
   inference::ModelStatistics getServerSideStatus() const;
 
   //members
-  unsigned maxBatchSize_;
-  unsigned batchSize_;
-  bool noBatch_;
+  unsigned maxOuterDim_;
+  unsigned outerDim_;
+  bool noOuterDim_;
+  unsigned nEntries_;
+  TritonBatchMode batchMode_;
+  bool manualBatchMode_;
   bool verbose_;
   bool useSharedMemory_;
   TritonServerType serverType_;
   grpc_compression_algorithm compressionAlgo_;
   triton::client::Headers headers_;
 
-  //IO pointers for triton
-  std::vector<triton::client::InferInput*> inputsTriton_;
-  std::vector<const triton::client::InferRequestedOutput*> outputsTriton_;
-
   std::unique_ptr<triton::client::InferenceServerGrpcClient> client_;
   //stores timeout, model name and version
-  triton::client::InferOptions options_;
+  std::vector<triton::client::InferOptions> options_;
 
 private:
   friend TritonInputData;
   friend TritonOutputData;
 
   //private accessors only used by data
   auto client() { return client_.get(); }
+  void addEntry(unsigned entry);
+  void resizeEntries(unsigned entry);
 };
 
 #endif
diff --git a/HeterogeneousCore/SonicTriton/interface/TritonData.h b/HeterogeneousCore/SonicTriton/interface/TritonData.h
@@ -55,8 +55,8 @@ class TritonData {
   TritonData(const std::string& name, const TensorMetadata& model_info, TritonClient* client, const std::string& pid);
 
   //some members can be modified
-  void setShape(const ShapeType& newShape);
-  void setShape(unsigned loc, int64_t val);
+  void setShape(const ShapeType& newShape, unsigned entry = 0);
+  void setShape(unsigned loc, int64_t val, unsigned entry = 0);
 
   //io accessors
   template <typename DT>
@@ -68,16 +68,17 @@ class TritonData {
   TritonOutput<DT> fromServer() const;
 
   //const accessors
-  const ShapeView& shape() const { return shape_; }
+  const ShapeView& shape(unsigned entry = 0) const { return entries_.at(entry).shape_; }
   int64_t byteSize() const { return byteSize_; }
   const std::string& dname() const { return dname_; }
-  unsigned batchSize() const { return batchSize_; }
 
   //utilities
   bool variableDims() const { return variableDims_; }
   int64_t sizeDims() const { return productDims_; }
   //default to dims if shape isn't filled
-  int64_t sizeShape() const { return variableDims_ ? dimProduct(shape_) : sizeDims(); }
+  int64_t sizeShape(unsigned entry = 0) const {
+    return variableDims_ ? dimProduct(entries_.at(entry).shape_) : sizeDims();
+  }
 
 private:
   friend class TritonClient;
@@ -88,15 +89,65 @@ class TritonData {
   friend class TritonGpuShmResource<IO>;
 #endif
 
+  //group together all relevant information for a single request
+  //helpful for organizing multi-request ragged batching case
+  class TritonDataEntry {
+  public:
+    //constructors
+    TritonDataEntry(const ShapeType& dims, bool noOuterDim, const std::string& name, const std::string& dname)
+        : fullShape_(dims),
+          shape_(fullShape_.begin() + (noOuterDim ? 0 : 1), fullShape_.end()),
+          sizeShape_(0),
+          byteSizePerBatch_(0),
+          totalByteSize_(0),
+          offset_(0),
+          output_(nullptr) {
+      //create input or output object
+      IO* iotmp;
+      createObject(&iotmp, name, dname);
+      data_.reset(iotmp);
+    }
+    //default needed to be able to use std::vector resize()
+    TritonDataEntry()
+        : shape_(fullShape_.begin(), fullShape_.end()),
+          sizeShape_(0),
+          byteSizePerBatch_(0),
+          totalByteSize_(0),
+          offset_(0),
+          output_(nullptr) {}
+
+  private:
+    friend class TritonData<IO>;
+    friend class TritonClient;
+    friend class TritonMemResource<IO>;
+    friend class TritonHeapResource<IO>;
+    friend class TritonCpuShmResource<IO>;
+#ifdef TRITON_ENABLE_GPU
+    friend class TritonGpuShmResource<IO>;
+#endif
+
+    //accessors
+    void createObject(IO** ioptr, const std::string& name, const std::string& dname);
+    void computeSizes(int64_t shapeSize, int64_t byteSize, int64_t batchSize);
+
+    //members
+    ShapeType fullShape_;
+    ShapeView shape_;
+    size_t sizeShape_, byteSizePerBatch_, totalByteSize_;
+    std::shared_ptr<IO> data_;
+    std::shared_ptr<Result> result_;
+    unsigned offset_;
+    const uint8_t* output_;
+  };
+
   //private accessors only used internally or by client
-  unsigned fullLoc(unsigned loc) const { return loc + (noBatch_ ? 0 : 1); }
-  void setBatchSize(unsigned bsize);
+  void checkShm() {}
+  unsigned fullLoc(unsigned loc) const;
   void reset();
-  void setResult(std::shared_ptr<Result> result) { result_ = result; }
-  IO* data() { return data_.get(); }
+  void setResult(std::shared_ptr<Result> result, unsigned entry = 0) { entries_[entry].result_ = result; }
+  IO* data(unsigned entry = 0) { return entries_[entry].data_.get(); }
   void updateMem(size_t size);
   void computeSizes();
-  void resetSizes();
   triton::client::InferenceServerGrpcClient* client();
   template <typename DT>
   void checkType() const {
@@ -110,41 +161,37 @@ class TritonData {
     return std::any_of(vec.begin(), vec.end(), [](int64_t i) { return i < 0; });
   }
   int64_t dimProduct(const ShapeView& vec) const {
-    return std::accumulate(vec.begin(), vec.end(), 1, std::multiplies<int64_t>());
+    //lambda treats negative dimensions as 0 to avoid overflows
+    return std::accumulate(
+        vec.begin(), vec.end(), 1, [](int64_t dim1, int64_t dim2) { return dim1 * std::max(0l, dim2); });
   }
-  void createObject(IO** ioptr);
   //generates a unique id number for each instance of the class
   unsigned uid() const {
     static std::atomic<unsigned> uid{0};
     return ++uid;
   }
   std::string xput() const;
+  void addEntry(unsigned entry);
+  void addEntryImpl(unsigned entry);
 
   //members
   std::string name_;
-  std::shared_ptr<IO> data_;
   TritonClient* client_;
   bool useShm_;
   std::string shmName_;
   const ShapeType dims_;
-  bool noBatch_;
-  unsigned batchSize_;
-  ShapeType fullShape_;
-  ShapeView shape_;
   bool variableDims_;
   int64_t productDims_;
   std::string dname_;
   inference::DataType dtype_;
   int64_t byteSize_;
-  size_t sizeShape_;
-  size_t byteSizePerBatch_;
+  std::vector<TritonDataEntry> entries_;
   size_t totalByteSize_;
   //can be modified in otherwise-const fromServer() method in TritonMemResource::copyOutput():
   //TritonMemResource holds a non-const pointer to an instance of this class
   //so that TritonOutputGpuShmResource can store data here
   std::shared_ptr<void> holder_;
   std::shared_ptr<TritonMemResource<IO>> memResource_;
-  std::shared_ptr<Result> result_;
   //can be modified in otherwise-const fromServer() method to prevent multiple calls
   CMS_SA_ALLOW mutable bool done_{};
 };
@@ -156,6 +203,16 @@ using TritonOutputMap = std::unordered_map<std::string, TritonOutputData>;
 
 //avoid "explicit specialization after instantiation" error
 template <>
+void TritonInputData::TritonDataEntry::createObject(triton::client::InferInput** ioptr,
+                                                    const std::string& name,
+                                                    const std::string& dname);
+template <>
+void TritonOutputData::TritonDataEntry::createObject(triton::client::InferRequestedOutput** ioptr,
+                                                     const std::string& name,
+                                                     const std::string& dname);
+template <>
+void TritonOutputData::checkShm();
+template <>
 std::string TritonInputData::xput() const;
 template <>
 std::string TritonOutputData::xput() const;
@@ -170,14 +227,6 @@ void TritonOutputData::prepare();
 template <>
 template <typename DT>
 TritonOutput<DT> TritonOutputData::fromServer() const;
-template <>
-void TritonInputData::reset();
-template <>
-void TritonOutputData::reset();
-template <>
-void TritonInputData::createObject(triton::client::InferInput** ioptr);
-template <>
-void TritonOutputData::createObject(triton::client::InferRequestedOutput** ioptr);
 
 //explicit template instantiation declarations
 extern template class TritonData<triton::client::InferInput>;