Facing OOM while running TF models #1499

yasirofficial · 2020-05-18T01:14:39Z

Description
I'm running 4 models using NVIDIA triton inference server and see 465MB (FB memory usage) of free memory after all 4 models are loaded/ warmed up (at least called for one inference request by client each). It runs successfully for some time, but suddenly in between it starts resulting in OOM issues. Though the container heals itself and starts running successfully afterwards, but the OOMs are worrisome for production environment. I've faced 4 individual timespans with OOMs during a 12 hour run.

Triton Information
What version of Triton are you using?
tritonserver:20.03-py3

Are you using the Triton container or did you build it yourself?
Using Triton container

To Reproduce
Steps to reproduce the behavior.
Load TF models which takes up memory leaving around 465MB free memory after loading up/serving one request at least. Keep calling the inference requests for some time. OOM will be seen for once or more.

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).

Tensorpack's Mask R-CNN models.
Config file : config.pbtxt.txt

Expected behavior
A container with 465MB of free space after serving inference requests for same models shouldn't run into OOM issues.

deadeyegoodwin · 2020-05-18T22:36:20Z

TF dynamically allocates memory to perform inference... and it is likely that the amount of memory needed can vary over the lifetime of a given inference execution. So the high-water allocation will occur when all 4 models are simultaneously running inferences (and perhaps even when each inference is in a particular part of the execution). The dynamic allocation will also increase with larger batch sizes in the requests.

Do you see OOM if you do 1 request at a time across the 4 models? 2 at a time? You can use multiple simultaneous runs of perf-client to put a lot of simultaneous model load on the server.

Characterizing model memory usage is a difficult problem but we are working on a characterization tools that may help in cases like this to understand what the high-water memory usage is and so understand which models can co-exist.

yasirofficial · 2020-05-20T10:28:12Z

No I don't see OOM issue until I start simultaneous requests to all 4 models. It works fine till any 3 simultaneous models are called.

deadeyegoodwin · 2020-05-21T18:58:50Z

So then it appears that the high-water allocation of running all 4 models at the same time exceeds the GPU memory available. See #1507 (comment) for some comments on how Triton may be able to handle this problem in the future.

deadeyegoodwin closed this as completed May 21, 2020

deadeyegoodwin mentioned this issue Jun 5, 2020

which gpu will instance model exist if not set gpus:[0] #1609

Closed

loveppdog mentioned this issue Aug 7, 2020

GPU memory issue with triton server 20.03 tensorflow/tensorflow#41759

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Facing OOM while running TF models #1499

Facing OOM while running TF models #1499

yasirofficial commented May 18, 2020

deadeyegoodwin commented May 18, 2020

yasirofficial commented May 20, 2020

deadeyegoodwin commented May 21, 2020

Facing OOM while running TF models #1499

Facing OOM while running TF models #1499

Comments

yasirofficial commented May 18, 2020

deadeyegoodwin commented May 18, 2020

yasirofficial commented May 20, 2020

deadeyegoodwin commented May 21, 2020