-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Control number of threads used by CPU server #2018
Comments
Triton "core" itself uses one CPU thread per model instance. This thread batches/schedules and issues requests to that model instance. How busy this thread is depends on the number of requests to the model. Triton HTTP/REST endpoint and GRPC endpoint will also use threads. Each FW backend will also use threads in a FW specific manner and Triton may or may not be able to control that. The total number of threads doesn't really matter does it? As you say, perhaps what you want to be able to control is to ensure that a single request doesn't completely monopolize the server so that other requests don't make progress. We are working on a more general solution to this problem in the rate-limiter. See #1507 (comment) |
I think I didn't describe the use case well enough. Here's some more information (it's certainly a non-standard use): having already written the C++ code to deal with preparing inputs for and processing outputs from the Triton server, we would prefer not to have to write a separate piece of C++ code to use ML frameworks natively if a GPU server isn't available. The CPU server capability is a nice way to reuse our existing, simpler C++ implementations. However, in a production scenario, there may be multiple jobs sharing the same physical node. For example, an 8-core node could be split into two 4-core jobs. Therefore, we need to ensure that, if each job sets up its own Triton server using local CPU, each server only tries to use 4 cores. (Ideally, it should be restricted even further to use 1 core per request, since the other cores might be scheduled by other CPU processes running in the job. The rate-limiter may help here.) If we launch the server using Docker (or equivalent), there are arguments like |
That seems like a restriction that should be controlled outside of triton. Perhaps by cgroups? https://serverfault.com/questions/478946/how-can-i-create-and-use-linux-cgroups-as-a-non-root-user |
Closing due to inactivity. |
For future reference: this is now implemented in backend-specific model configuration settings for ONNX and TensorFlow. |
Is your feature request related to a problem? Please describe.
https://docs.nvidia.com/deeplearning/triton-inference-server/master-user-guide/docs/run.html#running-triton-on-a-system-without-a-gpu describes how to run a Triton server on a system with no GPU, so that the CPU is used to perform inference. I have found this very useful as a rapid testing option. I would also like to use this as a fallback option in a production context; however, I do not see any switches to control how many CPU threads the CPU-only server is allowed to use.
Describe the solution you'd like
An option to control the maximum number of threads used by a CPU-only server. (At a deeper level, it's important to understand if a server using 4 threads will use all available threads for a single request, or will run one request per thread, up to the limit.)
Describe alternatives you've considered
Looked through documentation to see if this was already available, but did not find anything.
Additional context
N/A
The text was updated successfully, but these errors were encountered: