Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Control number of threads used by CPU server #2018

Closed
kpedro88 opened this issue Sep 14, 2020 · 5 comments
Closed

Control number of threads used by CPU server #2018

kpedro88 opened this issue Sep 14, 2020 · 5 comments

Comments

@kpedro88
Copy link
Contributor

Is your feature request related to a problem? Please describe.
https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/run.html#running-triton-on-a-system-without-a-gpu describes how to run a Triton server on a system with no GPU, so that the CPU is used to perform inference. I have found this very useful as a rapid testing option. I would also like to use this as a fallback option in a production context; however, I do not see any switches to control how many CPU threads the CPU-only server is allowed to use.

Describe the solution you'd like
An option to control the maximum number of threads used by a CPU-only server. (At a deeper level, it's important to understand if a server using 4 threads will use all available threads for a single request, or will run one request per thread, up to the limit.)

Describe alternatives you've considered
Looked through documentation to see if this was already available, but did not find anything.

Additional context
N/A

@deadeyegoodwin
Copy link
Contributor

Triton "core" itself uses one CPU thread per model instance. This thread batches/schedules and issues requests to that model instance. How busy this thread is depends on the number of requests to the model. Triton HTTP/REST endpoint and GRPC endpoint will also use threads. Each FW backend will also use threads in a FW specific manner and Triton may or may not be able to control that.

The total number of threads doesn't really matter does it? As you say, perhaps what you want to be able to control is to ensure that a single request doesn't completely monopolize the server so that other requests don't make progress. We are working on a more general solution to this problem in the rate-limiter. See #1507 (comment)

@kpedro88
Copy link
Contributor Author

kpedro88 commented Oct 8, 2020

I think I didn't describe the use case well enough. Here's some more information (it's certainly a non-standard use): having already written the C++ code to deal with preparing inputs for and processing outputs from the Triton server, we would prefer not to have to write a separate piece of C++ code to use ML frameworks natively if a GPU server isn't available. The CPU server capability is a nice way to reuse our existing, simpler C++ implementations.

However, in a production scenario, there may be multiple jobs sharing the same physical node. For example, an 8-core node could be split into two 4-core jobs. Therefore, we need to ensure that, if each job sets up its own Triton server using local CPU, each server only tries to use 4 cores. (Ideally, it should be restricted even further to use 1 core per request, since the other cores might be scheduled by other CPU processes running in the job. The rate-limiter may help here.)

If we launch the server using Docker (or equivalent), there are arguments like --cpus to restrict the container. However, this universally seems to require superuser permission, which is not necessarily available.

@deadeyegoodwin
Copy link
Contributor

That seems like a restriction that should be controlled outside of triton. Perhaps by cgroups? https://serverfault.com/questions/478946/how-can-i-create-and-use-linux-cgroups-as-a-non-root-user

@dzier
Copy link
Contributor

dzier commented Sep 10, 2021

Closing due to inactivity.

@dzier dzier closed this as completed Sep 10, 2021
@kpedro88
Copy link
Contributor Author

kpedro88 commented Nov 8, 2022

For future reference: this is now implemented in backend-specific model configuration settings for ONNX and TensorFlow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants