Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ragged batching interface for SonicTriton #40814

Merged
merged 31 commits into from
Mar 2, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
31 commits
Select commit Hold shift + click to select a range
5c1d217
changes for new triton version
kpedro88 Aug 4, 2022
9d0ab56
combine shape/request info into TritonDataEntry for multi-request rag…
kpedro88 Apr 16, 2022
e9f8c61
finish initial propagation (still WIP)
kpedro88 Apr 18, 2022
f570f3c
simplify synchronization of nEntries across inputs/outputs
kpedro88 Apr 18, 2022
31db492
fix various mistakes/typos
kpedro88 Apr 18, 2022
899836d
propagate to mem resources
kpedro88 Apr 18, 2022
8321b4f
fix off-by-one issues; unit tests now pass
kpedro88 Apr 19, 2022
c89b4b3
some fixes for compatibility checks
kpedro88 Apr 19, 2022
5e20b34
update server image to newest release
kpedro88 Apr 19, 2022
71b4f1e
add a test for ragged inputs
kpedro88 Apr 19, 2022
7f00e86
fix bug revealed by test
kpedro88 Apr 19, 2022
111d248
fix off-by-one
kpedro88 Apr 20, 2022
36a7ae9
use simpler example, fix output printing
kpedro88 Apr 20, 2022
0125089
simplify
kpedro88 Apr 20, 2022
e5f84dc
fix offset error
kpedro88 Apr 20, 2022
0ee76bb
update test docs, fix model fetching
kpedro88 Apr 21, 2022
1ec9190
update readme for ragged case
kpedro88 Apr 21, 2022
4c844ec
handle batch size zero w/ ragged (including test)
kpedro88 Jun 3, 2022
0ca9b30
improved batching interface
kpedro88 Jul 14, 2022
c96bb07
fix nEntries handling
kpedro88 Jul 14, 2022
8b95069
handle ragged -> rectangular by removing entries
kpedro88 Jul 14, 2022
733d151
update batching terminology in docs
kpedro88 Jul 14, 2022
d5a708d
try to handle empty batches and size zero inputs automatically
kpedro88 Jul 25, 2022
a635221
correct size check
kpedro88 Jun 30, 2022
4539240
update server version
kpedro88 Aug 8, 2022
e77801f
fix counting bugs for new batching interface
kpedro88 Aug 31, 2022
4b12f67
only create shared_ptr once (avoid double free)
kpedro88 Feb 17, 2023
2342a44
code format
kpedro88 Feb 18, 2023
b658adc
move image file
kpedro88 Feb 27, 2023
b088dfe
improve memory handling
kpedro88 Feb 28, 2023
570a966
remove unnecessary moves
kpedro88 Mar 1, 2023
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 36 additions & 14 deletions HeterogeneousCore/SonicTriton/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,7 +9,7 @@ Triton supports multiple named inputs and outputs with different types. The allo
boolean, unsigned integer (8, 16, 32, or 64 bits), integer (8, 16, 32, or 64 bits), floating point (16, 32, or 64 bit), or string.

Triton additionally supports inputs and outputs with multiple dimensions, some of which might be variable (denoted by -1).
Concrete values for variable dimensions must be specified for each call (event).
Concrete values for variable dimensions must be specified for each entry (see [Batching](#batching) below).

## Client

Expand All @@ -34,22 +34,37 @@ The model information from the server can be printed by enabling `verbose` outpu
* `useSharedMemory`: enable use of shared memory (see [below](#shared-memory)) with local servers (default: true)
* `compression`: enable compression of input and output data to reduce bandwidth (using gzip or deflate) (default: none)

The batch size should be set using the client accessor, in order to ensure a consistent value across all inputs:
### Batching

SonicTriton supports two types of batching, rectangular and ragged, depicted below:
![batching diagrams](./doc/batching_diagrams.png)
In the rectangular case, the inputs for each object in an event have the same shape, so they can be combined into a single entry.
(In this case, the batch size is specified as the "outer dimension" of the shape.)
In the ragged case, the inputs for each object in an event do not have the same shape, so they cannot be combined;
instead, they are represented internally as separate entries, each with its own shape specified explicitly.

The batch size is set and accessed using the client, in order to ensure a consistent value across all inputs.
The batch mode can also be changed manually, in order to allow optimizing the allocation of entries.
(If two entries with different shapes are specified, the batch mode will always automatically switch to ragged.)
* `setBatchSize()`: set a new batch size
* some models may not support batching
* `batchSize()`: return current batch size
* `setBatchMode()`: set the batch mode (`Rectangular` or `Ragged`)
* `batchMode()`: get the current batch mode

Useful `TritonData` accessors include:
* `variableDims()`: return true if any variable dimensions
* `sizeDims()`: return product of dimensions (-1 if any variable dimensions)
* `shape()`: return actual shape (list of dimensions)
* `sizeShape()`: return product of shape dimensions (returns `sizeDims()` if no variable dimensions)
* `shape(unsigned entry=0)`: return actual shape (list of dimensions) for specified entry
* `sizeShape(unsigned entry=0)`: return product of shape dimensions (returns `sizeDims()` if no variable dimensions) for specified entry
* `byteSize()`: return number of bytes for data type
* `dname()`: return name of data type
* `batchSize()`: return current batch size

To update the `TritonData` shape in the variable-dimension case:
* `setShape(const std::vector<int64_t>& newShape)`: update all (variable) dimensions with values provided in `newShape`
* `setShape(unsigned loc, int64_t val)`: update variable dimension at `loc` with `val`
* `setShape(const std::vector<int64_t>& newShape, unsigned entry=0)`: update all (variable) dimensions with values provided in `newShape` for specified entry
* `setShape(unsigned loc, int64_t val, unsigned entry=0)`: update variable dimension at `loc` with `val` for specified entry

### I/O types

There are specific local input and output containers that should be used in producers.
Here, `T` is a primitive type, and the two aliases listed below are passed to `TritonInputData::toServer()`
Expand All @@ -58,7 +73,7 @@ and returned by `TritonOutputData::fromServer()`, respectively:
* `TritonOutput<T> = std::vector<edm::Span<const T*>>`

The `TritonInputContainer` object should be created using the helper function described below.
It expects one vector per batch entry (i.e. the size of the outer vector is the batch size).
It expects one vector per batch entry (i.e. the size of the outer vector is the batch size (rectangular case) or number of entries (ragged case)).
Therefore, it is best to call `TritonClient::setBatchSize()`, if necessary, before calling the helper.
It will also reserve the expected size of the input in each inner vector (by default),
if the concrete shape is available (i.e. `setShape()` was already called, if the input has variable dimensions).
Expand Down Expand Up @@ -100,11 +115,11 @@ In a SONIC Triton producer, the basic flow should follow this pattern:
a. access input object(s) from `TritonInputMap`
b. allocate input data using `allocate<T>()`
c. fill input data
d. set input shape(s) (optional, only if any variable dimensions)
d. set input shape(s) (optional for rectangular case, only if any variable dimensions; required for ragged case)
e. convert using `toServer()` function of input object(s)
2. `produce()`:
a. access output object(s) from `TritonOutputMap`
b. obtain output data as `TritonOutput<T>` using `fromServer()` function of output object(s) (sets output shape(s) if variable dimensions exist)
a. access output object(s) from `TritonOutputMap` (includes shapes)
b. obtain output data as `TritonOutput<T>` using `fromServer()` function of output object(s)
c. fill output products

## Services
Expand All @@ -116,14 +131,14 @@ The script has two operations (`start` and `stop`) and the following options:
* `-d`: use Docker instead of Apptainer
* `-f`: force reuse of (possibly) existing container instance
* `-g`: use GPU instead of CPU
* `-i` [name]`: server image name (default: fastml/triton-torchgeo:20.09-py3-geometric)
* `-i` [name]`: server image name (default: fastml/triton-torchgeo:22.07-py3-geometric)
* `-M [dir]`: model repository (can be given more than once)
* `-m [dir]`: specific model directory (can be given more than one)
* `-n [name]`: name of container instance, also used for hidden temporary dir (default: triton_server_instance)
* `-P [port]`: base port number for services (-1: automatically find an unused port range) (default: 8000)
* `-p [pid]`: automatically shut down server when process w/ specified PID ends (-1: use parent process PID)
* `-r [num]`: number of retries when starting container (default: 3)
* `-s [dir]`: Apptainer sandbox directory (default: /cvmfs/unpacked.cern.ch/registry.hub.docker.com/fastml/triton-torchgeo:20.09-py3-geometric)
* `-s [dir]`: Apptainer sandbox directory (default: /cvmfs/unpacked.cern.ch/registry.hub.docker.com/fastml/triton-torchgeo:22.07-py3-geometric)
* `-t [dir]`: non-default hidden temporary dir
* `-v`: (verbose) start: activate server debugging info; stop: keep server logs
* `-w [time]`: maximum time to wait for server to start (default: 300 seconds)
Expand Down Expand Up @@ -172,4 +187,11 @@ The fallback server has a separate set of options, mostly related to the invocat

## Examples

Several example producers (running image classification networks or Graph Attention Network) can be found in the [test](./test) directory.
Several example producers can be found in the [test](./test) directory.

## Legend

The SonicTriton documentation uses different terms than Triton itself for certain concepts.
The SonicTriton:Triton correspondence for those terms is given here:
* Entry : request
* Rectangular batching : Triton-supported batching
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
30 changes: 19 additions & 11 deletions HeterogeneousCore/SonicTriton/interface/TritonClient.h
Original file line number Diff line number Diff line change
Expand Up @@ -16,6 +16,8 @@
#include "grpc_client.h"
#include "grpc_service.pb.h"

enum class TritonBatchMode { Rectangular = 1, Ragged = 2 };

class TritonClient : public SonicClient<TritonInputMap, TritonOutputMap> {
public:
struct ServerSideStats {
Expand All @@ -36,21 +38,26 @@ class TritonClient : public SonicClient<TritonInputMap, TritonOutputMap> {
~TritonClient() override;

//accessors
unsigned batchSize() const { return batchSize_; }
unsigned batchSize() const;
TritonBatchMode batchMode() const { return batchMode_; }
bool verbose() const { return verbose_; }
bool useSharedMemory() const { return useSharedMemory_; }
void setUseSharedMemory(bool useShm) { useSharedMemory_ = useShm; }
bool setBatchSize(unsigned bsize);
void setBatchMode(TritonBatchMode batchMode);
void resetBatchMode();
void reset() override;
bool noBatch() const { return noBatch_; }
TritonServerType serverType() const { return serverType_; }

//for fillDescriptions
static void fillPSetDescription(edm::ParameterSetDescription& iDesc);

protected:
//helpers
void getResults(std::shared_ptr<triton::client::InferResult> results);
bool noOuterDim() const { return noOuterDim_; }
unsigned outerDim() const { return outerDim_; }
unsigned nEntries() const;
void getResults(const std::vector<std::shared_ptr<triton::client::InferResult>>& results);
void evaluate() override;
template <typename F>
bool handle_exception(F&& call);
Expand All @@ -62,29 +69,30 @@ class TritonClient : public SonicClient<TritonInputMap, TritonOutputMap> {
inference::ModelStatistics getServerSideStatus() const;

//members
unsigned maxBatchSize_;
unsigned batchSize_;
bool noBatch_;
unsigned maxOuterDim_;
unsigned outerDim_;
bool noOuterDim_;
unsigned nEntries_;
TritonBatchMode batchMode_;
bool manualBatchMode_;
bool verbose_;
bool useSharedMemory_;
TritonServerType serverType_;
grpc_compression_algorithm compressionAlgo_;
triton::client::Headers headers_;

//IO pointers for triton
std::vector<triton::client::InferInput*> inputsTriton_;
std::vector<const triton::client::InferRequestedOutput*> outputsTriton_;

std::unique_ptr<triton::client::InferenceServerGrpcClient> client_;
//stores timeout, model name and version
triton::client::InferOptions options_;
std::vector<triton::client::InferOptions> options_;

private:
friend TritonInputData;
friend TritonOutputData;

//private accessors only used by data
auto client() { return client_.get(); }
void addEntry(unsigned entry);
void resizeEntries(unsigned entry);
};

#endif
105 changes: 77 additions & 28 deletions HeterogeneousCore/SonicTriton/interface/TritonData.h
Original file line number Diff line number Diff line change
Expand Up @@ -55,8 +55,8 @@ class TritonData {
TritonData(const std::string& name, const TensorMetadata& model_info, TritonClient* client, const std::string& pid);

//some members can be modified
void setShape(const ShapeType& newShape);
void setShape(unsigned loc, int64_t val);
void setShape(const ShapeType& newShape, unsigned entry = 0);
void setShape(unsigned loc, int64_t val, unsigned entry = 0);

//io accessors
template <typename DT>
Expand All @@ -68,16 +68,17 @@ class TritonData {
TritonOutput<DT> fromServer() const;

//const accessors
const ShapeView& shape() const { return shape_; }
const ShapeView& shape(unsigned entry = 0) const { return entries_.at(entry).shape_; }
int64_t byteSize() const { return byteSize_; }
const std::string& dname() const { return dname_; }
unsigned batchSize() const { return batchSize_; }

//utilities
bool variableDims() const { return variableDims_; }
int64_t sizeDims() const { return productDims_; }
//default to dims if shape isn't filled
int64_t sizeShape() const { return variableDims_ ? dimProduct(shape_) : sizeDims(); }
int64_t sizeShape(unsigned entry = 0) const {
return variableDims_ ? dimProduct(entries_.at(entry).shape_) : sizeDims();
}

private:
friend class TritonClient;
Expand All @@ -88,15 +89,65 @@ class TritonData {
friend class TritonGpuShmResource<IO>;
#endif

//group together all relevant information for a single request
//helpful for organizing multi-request ragged batching case
class TritonDataEntry {
public:
//constructors
TritonDataEntry(const ShapeType& dims, bool noOuterDim, const std::string& name, const std::string& dname)
: fullShape_(dims),
shape_(fullShape_.begin() + (noOuterDim ? 0 : 1), fullShape_.end()),
sizeShape_(0),
byteSizePerBatch_(0),
totalByteSize_(0),
offset_(0),
output_(nullptr) {
//create input or output object
IO* iotmp;
createObject(&iotmp, name, dname);
data_.reset(iotmp);
}
//default needed to be able to use std::vector resize()
TritonDataEntry()
: shape_(fullShape_.begin(), fullShape_.end()),
sizeShape_(0),
byteSizePerBatch_(0),
totalByteSize_(0),
offset_(0),
output_(nullptr) {}

private:
friend class TritonData<IO>;
friend class TritonClient;
friend class TritonMemResource<IO>;
friend class TritonHeapResource<IO>;
friend class TritonCpuShmResource<IO>;
#ifdef TRITON_ENABLE_GPU
friend class TritonGpuShmResource<IO>;
#endif

//accessors
void createObject(IO** ioptr, const std::string& name, const std::string& dname);
void computeSizes(int64_t shapeSize, int64_t byteSize, int64_t batchSize);

//members
ShapeType fullShape_;
ShapeView shape_;
size_t sizeShape_, byteSizePerBatch_, totalByteSize_;
std::shared_ptr<IO> data_;
std::shared_ptr<Result> result_;
unsigned offset_;
const uint8_t* output_;
};

//private accessors only used internally or by client
unsigned fullLoc(unsigned loc) const { return loc + (noBatch_ ? 0 : 1); }
void setBatchSize(unsigned bsize);
void checkShm() {}
unsigned fullLoc(unsigned loc) const;
void reset();
void setResult(std::shared_ptr<Result> result) { result_ = result; }
IO* data() { return data_.get(); }
void setResult(std::shared_ptr<Result> result, unsigned entry = 0) { entries_[entry].result_ = result; }
IO* data(unsigned entry = 0) { return entries_[entry].data_.get(); }
void updateMem(size_t size);
void computeSizes();
void resetSizes();
triton::client::InferenceServerGrpcClient* client();
template <typename DT>
void checkType() const {
Expand All @@ -110,41 +161,37 @@ class TritonData {
return std::any_of(vec.begin(), vec.end(), [](int64_t i) { return i < 0; });
}
int64_t dimProduct(const ShapeView& vec) const {
return std::accumulate(vec.begin(), vec.end(), 1, std::multiplies<int64_t>());
//lambda treats negative dimensions as 0 to avoid overflows
return std::accumulate(
vec.begin(), vec.end(), 1, [](int64_t dim1, int64_t dim2) { return dim1 * std::max(0l, dim2); });
}
void createObject(IO** ioptr);
//generates a unique id number for each instance of the class
unsigned uid() const {
static std::atomic<unsigned> uid{0};
return ++uid;
}
std::string xput() const;
void addEntry(unsigned entry);
void addEntryImpl(unsigned entry);

//members
std::string name_;
std::shared_ptr<IO> data_;
TritonClient* client_;
bool useShm_;
std::string shmName_;
const ShapeType dims_;
bool noBatch_;
unsigned batchSize_;
ShapeType fullShape_;
ShapeView shape_;
bool variableDims_;
int64_t productDims_;
std::string dname_;
inference::DataType dtype_;
int64_t byteSize_;
size_t sizeShape_;
size_t byteSizePerBatch_;
std::vector<TritonDataEntry> entries_;
size_t totalByteSize_;
//can be modified in otherwise-const fromServer() method in TritonMemResource::copyOutput():
//TritonMemResource holds a non-const pointer to an instance of this class
//so that TritonOutputGpuShmResource can store data here
std::shared_ptr<void> holder_;
std::shared_ptr<TritonMemResource<IO>> memResource_;
std::shared_ptr<Result> result_;
//can be modified in otherwise-const fromServer() method to prevent multiple calls
CMS_SA_ALLOW mutable bool done_{};
};
Expand All @@ -156,6 +203,16 @@ using TritonOutputMap = std::unordered_map<std::string, TritonOutputData>;

//avoid "explicit specialization after instantiation" error
template <>
void TritonInputData::TritonDataEntry::createObject(triton::client::InferInput** ioptr,
const std::string& name,
const std::string& dname);
template <>
void TritonOutputData::TritonDataEntry::createObject(triton::client::InferRequestedOutput** ioptr,
const std::string& name,
const std::string& dname);
template <>
void TritonOutputData::checkShm();
template <>
std::string TritonInputData::xput() const;
template <>
std::string TritonOutputData::xput() const;
Expand All @@ -170,14 +227,6 @@ void TritonOutputData::prepare();
template <>
template <typename DT>
TritonOutput<DT> TritonOutputData::fromServer() const;
template <>
void TritonInputData::reset();
template <>
void TritonOutputData::reset();
template <>
void TritonInputData::createObject(triton::client::InferInput** ioptr);
template <>
void TritonOutputData::createObject(triton::client::InferRequestedOutput** ioptr);

//explicit template instantiation declarations
extern template class TritonData<triton::client::InferInput>;
Expand Down
Loading