Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Facing OOM while running TF models #1499

Closed
yasirofficial opened this issue May 18, 2020 · 3 comments
Closed

Facing OOM while running TF models #1499

yasirofficial opened this issue May 18, 2020 · 3 comments

Comments

@yasirofficial
Copy link

Description
I'm running 4 models using NVIDIA triton inference server and see 465MB (FB memory usage) of free memory after all 4 models are loaded/ warmed up (at least called for one inference request by client each). It runs successfully for some time, but suddenly in between it starts resulting in OOM issues. Though the container heals itself and starts running successfully afterwards, but the OOMs are worrisome for production environment. I've faced 4 individual timespans with OOMs during a 12 hour run.

Triton Information
What version of Triton are you using?
tritonserver:20.03-py3

Are you using the Triton container or did you build it yourself?
Using Triton container

To Reproduce
Steps to reproduce the behavior.
Load TF models which takes up memory leaving around 465MB free memory after loading up/serving one request at least. Keep calling the inference requests for some time. OOM will be seen for once or more.

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).

Expected behavior
A container with 465MB of free space after serving inference requests for same models shouldn't run into OOM issues.

@deadeyegoodwin
Copy link
Contributor

TF dynamically allocates memory to perform inference... and it is likely that the amount of memory needed can vary over the lifetime of a given inference execution. So the high-water allocation will occur when all 4 models are simultaneously running inferences (and perhaps even when each inference is in a particular part of the execution). The dynamic allocation will also increase with larger batch sizes in the requests.

Do you see OOM if you do 1 request at a time across the 4 models? 2 at a time? You can use multiple simultaneous runs of perf-client to put a lot of simultaneous model load on the server.

Characterizing model memory usage is a difficult problem but we are working on a characterization tools that may help in cases like this to understand what the high-water memory usage is and so understand which models can co-exist.

@yasirofficial
Copy link
Author

No I don't see OOM issue until I start simultaneous requests to all 4 models. It works fine till any 3 simultaneous models are called.

@deadeyegoodwin
Copy link
Contributor

So then it appears that the high-water allocation of running all 4 models at the same time exceeds the GPU memory available. See #1507 (comment) for some comments on how Triton may be able to handle this problem in the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants