Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SONIC updates for site support #45182

Merged
merged 16 commits into from
Aug 6, 2024
Merged
Show file tree
Hide file tree
Changes from 15 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
12 changes: 6 additions & 6 deletions HeterogeneousCore/SonicTriton/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -132,19 +132,19 @@ The script has three operations (`start`, `stop`, `check`) and the following opt
* `-c`: don't cleanup temporary dir (for debugging)
* `-C [dir]`: directory containing Nvidia compatibility drivers (checks CMSSW_BASE by default if available)
* `-D`: dry run: print container commands rather than executing them
* `-d`: use Docker instead of Apptainer
* `-d [exe]`: container choice: apptainer, docker, podman, podman-hpc (default: apptainer)
* `-E [path]`: include extra path(s) for executables (default: /cvmfs/oasis.opensciencegrid.org/mis/apptainer/current/bin)
* `-f`: force reuse of (possibly) existing container instance
* `-g`: use GPU instead of CPU
* `-i` [name]`: server image name (default: fastml/triton-torchgeo:22.07-py3-geometric)
* `-g [device]`: device choice: auto (try to detect GPU), CPU, GPU (default: auto)
* `-i [name]`: server image name (default: fastml/triton-torchgeo:22.07-py3-geometric)
* `-I [num]`: number of model instances (default: 0 -> means no local editing of config files)
* `-M [dir]`: model repository (can be given more than once)
* `-m [dir]`: specific model directory (can be given more than one)
* `-n [name]`: name of container instance, also used for hidden temporary dir (default: triton_server_instance)
* `-P [port]`: base port number for services (-1: automatically find an unused port range) (default: 8000)
* `-p [pid]`: automatically shut down server when process w/ specified PID ends (-1: use parent process PID)
* `-r [num]`: number of retries when starting container (default: 3)
* `-s [dir]`: Apptainer sandbox directory (default: /cvmfs/unpacked.cern.ch/registry.hub.docker.com/fastml/triton-torchgeo:22.07-py3-geometric)
* `-s [dir]`: apptainer sandbox directory (default: /cvmfs/unpacked.cern.ch/registry.hub.docker.com/fastml/triton-torchgeo:22.07-py3-geometric)
* `-t [dir]`: non-default hidden temporary dir
* `-v`: (verbose) start: activate server debugging info; stop: keep server logs
* `-w [time]`: maximum time to wait for server to start (default: 300 seconds)
Expand Down Expand Up @@ -200,8 +200,8 @@ The fallback server has a separate set of options, mostly related to the invocat
* `enable`: enable the fallback server
* `debug`: enable debugging (equivalent to `-c` in `cmsTriton`)
* `verbose`: enable verbose output in logs (equivalent to `-v` in `cmsTriton`)
* `useDocker`: use Docker instead of Apptainer (equivalent to `-d` in `cmsTriton`)
* `useGPU`: run on local GPU (equivalent to `-g` in `cmsTriton`)
* `container`: container choice (equivalent to `-d` in `cmsTriton`)
* `device`: device choice (equivalent to `-g` in `cmsTriton`)
* `retries`: number of retries when starting container (passed to `-r [num]` in `cmsTriton` if >= 0; default: -1)
* `wait`: maximum time to wait for server to start (passed to `-w time` in `cmsTriton` if >= 0; default: -1)
* `instanceBaseName`: base name for server instance if random names are enabled (default: triton_server_instance)
Expand Down
9 changes: 5 additions & 4 deletions HeterogeneousCore/SonicTriton/interface/TritonService.h
Original file line number Diff line number Diff line change
Expand Up @@ -36,8 +36,8 @@ class TritonService {
: enable(pset.getUntrackedParameter<bool>("enable")),
debug(pset.getUntrackedParameter<bool>("debug")),
verbose(pset.getUntrackedParameter<bool>("verbose")),
useDocker(pset.getUntrackedParameter<bool>("useDocker")),
useGPU(pset.getUntrackedParameter<bool>("useGPU")),
container(pset.getUntrackedParameter<std::string>("container")),
device(pset.getUntrackedParameter<std::string>("device")),
retries(pset.getUntrackedParameter<int>("retries")),
wait(pset.getUntrackedParameter<int>("wait")),
instanceName(pset.getUntrackedParameter<std::string>("instanceName")),
Expand All @@ -54,8 +54,8 @@ class TritonService {
bool enable;
bool debug;
bool verbose;
bool useDocker;
bool useGPU;
std::string container;
std::string device;
int retries;
int wait;
std::string instanceName;
Expand Down Expand Up @@ -89,6 +89,7 @@ class TritonService {
std::unordered_set<std::string> models;
static const std::string fallbackName;
static const std::string fallbackAddress;
static const std::string siteconfName;
};
struct Model {
Model(const std::string& path_ = "") : path(path_) {}
Expand Down
10 changes: 0 additions & 10 deletions HeterogeneousCore/SonicTriton/python/TritonService_cff.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,18 +2,8 @@

from Configuration.ProcessModifiers.enableSonicTriton_cff import enableSonicTriton

_gpu_available_cached = None

def _gpu_available():
global _gpu_available_cached
if _gpu_available_cached is None:
import os
_gpu_available_cached = (os.system("nvidia-smi -L") == 0)
return _gpu_available_cached

enableSonicTriton.toModify(TritonService,
fallback = dict(
enable = True,
useGPU = _gpu_available(),
),
)
Loading