General observation of Triton Server OOMs. #1507

whatdhack · 2020-05-19T14:42:36Z

Description
We have observed that the Triton Server OOMs under certain conditions. The OOMs generaliy occur when the aggregate size of model footprint exceeds the GPU memory and all the models are exercised simulataneously.

We have used TF, but should be similar for PT. If we just use TF Saved Model and load it in plain TF, each model works without OOM say at 7 GB. Then if we load say 4 of these models into Triton. They would load without any problem. Then if we start inferencing each of the models simultaneously, we would observe intermittent OOMs.

I think problem here is memory management in a holistic manner. Memory management is complicated by Triton's ( TfServe for that matter too) dynamic juggling of multiple models whose sum total of memory footprint exceeds that of the GPU and TF's own dynamic memory management (e.g. growth).

Warmup does not seem to help here.

Triton Information
What version of Triton are you using?
20.3

Are you using the Triton container or did you build it yourself?
20.3

To Reproduce
The general way to reproducs this problem is as follows.

Load a number of models whose aggregate memory footprint is higher then the GPU memory. Say 5 models each with a footprint of 5GB for T4.
Exercise each models simultaneously ( say at 5 per sec per model;)
Monitor the log, and you will start seeing OOMs

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).

You can use TF with MaskRCCN, which is known to be resource intensive.

Expected behavior

Should not OOM.

deadeyegoodwin · 2020-05-19T16:16:50Z

Your observations are correct and I have made similar comments in other tickets. Triton relies on each framework backend to manage its own memory and unifying those backends to use a common allocator seems an unrealistic goal and also wouldn't solve the problem. Note that TensorRT does not have this problem as it does all allocation at load time. Two options that are possible:

Some type of retry capability, so that when an inference fails Triton can back-off and retry. The acceptable rate of retries can be a per-model option.
Rate-limiting across models. This has many benefits beyond what we are discussing here, but in this context it would allow you to load multiple models but have limits on how many could be executing simultaneously. So if you knew that only 3 out of the 4 models could run simultaneously due to dynamic memory allocation you could specify that in the configuration and thus avoid the OOM while still having access to all 4 models.

nieksand · 2020-05-21T15:07:06Z

Have a cross-model max in-flight request limit would be handy for me as well.

For our slow models, I'm currently using 20.03 with max_queue_size=1. My inference front-end treats queue size exceeded errors as retryable, hits the load balancer again, and hopefully directs the retried queries to an idle inference server.

This works quite well and has driven down tail latency under load. However, we host multiple models per inference server, so having a global limit would make that even more robust.

jacobdang · 2020-06-13T17:55:11Z

Resolving this issue is critical for our application of medical image analysis, as we have several large-sized 3D CNN for analyzing CT images. Running them all at once will surely lead to OOM, even with a batch size of one. However, execute them one by one may lead to suboptimal GPU utility. Hence, it is very useful to schedule them properly with the consideration of GPU RAM allocation.

razorx89 · 2020-11-30T18:15:49Z

Since this issue is for some folks somewhat critical, e.g. medical imaging with 3D networks and huge memory requirements, I am going to bump this back up.

I love using Triton for accelerating computation on medical research cohorts, but also for research purposes integrated into the clinical routine. At the moment, we have 2D networks (object detection, regression, classification) and 3D semantic segmentation networks. All of them are running quite rarely, some every few minutes, others only upon GUI actions. I also observed OOM crashes when one or more different models where queried at once, or even when the very same model with different versions was queried (e.g. ensemble from different weights). This is also why currently the only way to perform an ensemble inference for 3D semantic segmentation is to do the individual requests on the client side and strictly separate the different model versions (ensuring that only one model runs in parallel). An ensemble model in Triton would be much better, especially regarding the large payload send to/retrieved from the server. But this also only works if you have a single application sending requests.

I think one solution might be to explicitly define a mutual access to a GPU instance. This could be implemented via a mutex, system- or process visible. Some models are absolutely fine to run in parallel (2D), but we need a way to tell the scheduler that some models never ever have a chance to be executed if not all of the memory can be allocated for a single model instance.

Example:

  instance_group [
    {
      count: 1
      kind: KIND_GPU
      exclusive_execution: true
    }
  ]

deadeyegoodwin · 2020-11-30T18:24:28Z

@razorx89 The rate-limiter I mentioned above will be able to solve this problem (and other related issues). We are 50-75% done with the rate-limiter implementation but unfortunately it has been delayed a couple of times due to other priorities. We do understand that the rate-limiter is an important feature that many use-cases (like yours) require so we are hoping to get it completed as soon as we can.

razorx89 · 2020-11-30T18:31:55Z

Great to hear that it is being worked on! Any guess when this feature might be ready and released?

deadeyegoodwin · 2020-11-30T18:46:48Z

We are aiming for release in Feb. or March (that means it would likely be available 2-4 weeks earlier on the master branch) but can't commit to that.

erfaneshrati · 2021-08-19T22:43:09Z

@deadeyegoodwin Do you have an update on this issue? Thanks.

deadeyegoodwin · 2021-08-20T18:21:26Z

The rate limiter is now integrated into the main branch. We still have some additional testing to perform but it seems likely that as least some of the functionality will be enabled in the 21.09 or 21.10 releases.

deadeyegoodwin mentioned this issue May 21, 2020

Facing OOM while running TF models #1499

Closed

deadeyegoodwin mentioned this issue Jun 5, 2020

which gpu will instance model exist if not set gpus:[0] #1609

Closed

This was referenced Jul 14, 2020

CUDA out of memory during inference, not during model loading #1787

Closed

warm up issue #1810

Closed

loveppdog mentioned this issue Aug 7, 2020

GPU memory issue with triton server 20.03 tensorflow/tensorflow#41759

Closed

deadeyegoodwin mentioned this issue Sep 24, 2020

Control number of threads used by CPU server #2018

Closed

deadeyegoodwin mentioned this issue Dec 10, 2020

ensemble model question and model priority #1194

Closed

CoderHam closed this as completed Sep 30, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

General observation of Triton Server OOMs. #1507

General observation of Triton Server OOMs. #1507

whatdhack commented May 19, 2020

deadeyegoodwin commented May 19, 2020 •

edited

Loading

nieksand commented May 21, 2020 •

edited

Loading

jacobdang commented Jun 13, 2020

razorx89 commented Nov 30, 2020

deadeyegoodwin commented Nov 30, 2020

razorx89 commented Nov 30, 2020

deadeyegoodwin commented Nov 30, 2020

erfaneshrati commented Aug 19, 2021

deadeyegoodwin commented Aug 20, 2021

General observation of Triton Server OOMs. #1507

General observation of Triton Server OOMs. #1507

Comments

whatdhack commented May 19, 2020

deadeyegoodwin commented May 19, 2020 • edited Loading

nieksand commented May 21, 2020 • edited Loading

jacobdang commented Jun 13, 2020

razorx89 commented Nov 30, 2020

deadeyegoodwin commented Nov 30, 2020

razorx89 commented Nov 30, 2020

deadeyegoodwin commented Nov 30, 2020

erfaneshrati commented Aug 19, 2021

deadeyegoodwin commented Aug 20, 2021

deadeyegoodwin commented May 19, 2020 •

edited

Loading

nieksand commented May 21, 2020 •

edited

Loading