Control number of threads used by CPU server #2018

kpedro88 · 2020-09-14T21:25:59Z

Is your feature request related to a problem? Please describe.
https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/run.html#running-triton-on-a-system-without-a-gpu describes how to run a Triton server on a system with no GPU, so that the CPU is used to perform inference. I have found this very useful as a rapid testing option. I would also like to use this as a fallback option in a production context; however, I do not see any switches to control how many CPU threads the CPU-only server is allowed to use.

Describe the solution you'd like
An option to control the maximum number of threads used by a CPU-only server. (At a deeper level, it's important to understand if a server using 4 threads will use all available threads for a single request, or will run one request per thread, up to the limit.)

Describe alternatives you've considered
Looked through documentation to see if this was already available, but did not find anything.

Additional context
N/A

deadeyegoodwin · 2020-09-24T17:28:20Z

Triton "core" itself uses one CPU thread per model instance. This thread batches/schedules and issues requests to that model instance. How busy this thread is depends on the number of requests to the model. Triton HTTP/REST endpoint and GRPC endpoint will also use threads. Each FW backend will also use threads in a FW specific manner and Triton may or may not be able to control that.

The total number of threads doesn't really matter does it? As you say, perhaps what you want to be able to control is to ensure that a single request doesn't completely monopolize the server so that other requests don't make progress. We are working on a more general solution to this problem in the rate-limiter. See #1507 (comment)

kpedro88 · 2020-10-08T18:53:24Z

I think I didn't describe the use case well enough. Here's some more information (it's certainly a non-standard use): having already written the C++ code to deal with preparing inputs for and processing outputs from the Triton server, we would prefer not to have to write a separate piece of C++ code to use ML frameworks natively if a GPU server isn't available. The CPU server capability is a nice way to reuse our existing, simpler C++ implementations.

However, in a production scenario, there may be multiple jobs sharing the same physical node. For example, an 8-core node could be split into two 4-core jobs. Therefore, we need to ensure that, if each job sets up its own Triton server using local CPU, each server only tries to use 4 cores. (Ideally, it should be restricted even further to use 1 core per request, since the other cores might be scheduled by other CPU processes running in the job. The rate-limiter may help here.)

If we launch the server using Docker (or equivalent), there are arguments like --cpus to restrict the container. However, this universally seems to require superuser permission, which is not necessarily available.

deadeyegoodwin · 2020-10-08T23:57:20Z

That seems like a restriction that should be controlled outside of triton. Perhaps by cgroups? https://serverfault.com/questions/478946/how-can-i-create-and-use-linux-cgroups-as-a-non-root-user

dzier · 2021-09-10T19:12:31Z

Closing due to inactivity.

kpedro88 · 2022-11-08T19:55:51Z

For future reference: this is now implemented in backend-specific model configuration settings for ONNX and TensorFlow.

This was referenced Sep 14, 2020

Initialize local CPU server if remote server not available fastmachinelearning/SonicCMS#12

Closed

Open issues for triton-inference-server (round 2) fastmachinelearning/SonicCMS#17

Open

kpedro88 mentioned this issue Dec 29, 2020

Introduce TritonService and workflows cms-sw/cmssw#32576

Merged

dzier closed this as completed Sep 10, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Control number of threads used by CPU server #2018

Control number of threads used by CPU server #2018

kpedro88 commented Sep 14, 2020

deadeyegoodwin commented Sep 24, 2020

kpedro88 commented Oct 8, 2020

deadeyegoodwin commented Oct 8, 2020

dzier commented Sep 10, 2021

kpedro88 commented Nov 8, 2022

Control number of threads used by CPU server #2018

Control number of threads used by CPU server #2018

Comments

kpedro88 commented Sep 14, 2020

deadeyegoodwin commented Sep 24, 2020

kpedro88 commented Oct 8, 2020

deadeyegoodwin commented Oct 8, 2020

dzier commented Sep 10, 2021

kpedro88 commented Nov 8, 2022