-
Notifications
You must be signed in to change notification settings - Fork 1.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Facing OOM while running TF models #1499
Comments
TF dynamically allocates memory to perform inference... and it is likely that the amount of memory needed can vary over the lifetime of a given inference execution. So the high-water allocation will occur when all 4 models are simultaneously running inferences (and perhaps even when each inference is in a particular part of the execution). The dynamic allocation will also increase with larger batch sizes in the requests. Do you see OOM if you do 1 request at a time across the 4 models? 2 at a time? You can use multiple simultaneous runs of perf-client to put a lot of simultaneous model load on the server. Characterizing model memory usage is a difficult problem but we are working on a characterization tools that may help in cases like this to understand what the high-water memory usage is and so understand which models can co-exist. |
No I don't see OOM issue until I start simultaneous requests to all 4 models. It works fine till any 3 simultaneous models are called. |
So then it appears that the high-water allocation of running all 4 models at the same time exceeds the GPU memory available. See #1507 (comment) for some comments on how Triton may be able to handle this problem in the future. |
Description
I'm running 4 models using NVIDIA triton inference server and see 465MB (FB memory usage) of free memory after all 4 models are loaded/ warmed up (at least called for one inference request by client each). It runs successfully for some time, but suddenly in between it starts resulting in OOM issues. Though the container heals itself and starts running successfully afterwards, but the OOMs are worrisome for production environment. I've faced 4 individual timespans with OOMs during a 12 hour run.
Triton Information
What version of Triton are you using?
tritonserver:20.03-py3
Are you using the Triton container or did you build it yourself?
Using Triton container
To Reproduce
Steps to reproduce the behavior.
Load TF models which takes up memory leaving around 465MB free memory after loading up/serving one request at least. Keep calling the inference requests for some time. OOM will be seen for once or more.
Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).
Expected behavior
A container with 465MB of free space after serving inference requests for same models shouldn't run into OOM issues.
The text was updated successfully, but these errors were encountered: