Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

General observation of Triton Server OOMs. #1507

Closed
whatdhack opened this issue May 19, 2020 · 9 comments
Closed

General observation of Triton Server OOMs. #1507

whatdhack opened this issue May 19, 2020 · 9 comments

Comments

@whatdhack
Copy link

Description
We have observed that the Triton Server OOMs under certain conditions. The OOMs generaliy occur when the aggregate size of model footprint exceeds the GPU memory and all the models are exercised simulataneously.

We have used TF, but should be similar for PT. If we just use TF Saved Model and load it in plain TF, each model works without OOM say at 7 GB. Then if we load say 4 of these models into Triton. They would load without any problem. Then if we start inferencing each of the models simultaneously, we would observe intermittent OOMs.

I think problem here is memory management in a holistic manner. Memory management is complicated by Triton's ( TfServe for that matter too) dynamic juggling of multiple models whose sum total of memory footprint exceeds that of the GPU and TF's own dynamic memory management (e.g. growth).

Warmup does not seem to help here.

Triton Information
What version of Triton are you using?
20.3

Are you using the Triton container or did you build it yourself?
20.3

To Reproduce
The general way to reproducs this problem is as follows.

  1. Load a number of models whose aggregate memory footprint is higher then the GPU memory. Say 5 models each with a footprint of 5GB for T4.
  2. Exercise each models simultaneously ( say at 5 per sec per model;)
  3. Monitor the log, and you will start seeing OOMs

Describe the models (framework, inputs, outputs), ideally include the model configuration file (if using an ensemble include the model configuration file for that as well).

You can use TF with MaskRCCN, which is known to be resource intensive.

Expected behavior

Should not OOM.

@deadeyegoodwin
Copy link
Contributor

deadeyegoodwin commented May 19, 2020

Your observations are correct and I have made similar comments in other tickets. Triton relies on each framework backend to manage its own memory and unifying those backends to use a common allocator seems an unrealistic goal and also wouldn't solve the problem. Note that TensorRT does not have this problem as it does all allocation at load time. Two options that are possible:

  • Some type of retry capability, so that when an inference fails Triton can back-off and retry. The acceptable rate of retries can be a per-model option.
  • Rate-limiting across models. This has many benefits beyond what we are discussing here, but in this context it would allow you to load multiple models but have limits on how many could be executing simultaneously. So if you knew that only 3 out of the 4 models could run simultaneously due to dynamic memory allocation you could specify that in the configuration and thus avoid the OOM while still having access to all 4 models.

@nieksand
Copy link
Contributor

nieksand commented May 21, 2020

Have a cross-model max in-flight request limit would be handy for me as well.

For our slow models, I'm currently using 20.03 with max_queue_size=1. My inference front-end treats queue size exceeded errors as retryable, hits the load balancer again, and hopefully directs the retried queries to an idle inference server.

This works quite well and has driven down tail latency under load. However, we host multiple models per inference server, so having a global limit would make that even more robust.

@jacobdang
Copy link

Resolving this issue is critical for our application of medical image analysis, as we have several large-sized 3D CNN for analyzing CT images. Running them all at once will surely lead to OOM, even with a batch size of one. However, execute them one by one may lead to suboptimal GPU utility. Hence, it is very useful to schedule them properly with the consideration of GPU RAM allocation.

@razorx89
Copy link

Since this issue is for some folks somewhat critical, e.g. medical imaging with 3D networks and huge memory requirements, I am going to bump this back up.

I love using Triton for accelerating computation on medical research cohorts, but also for research purposes integrated into the clinical routine. At the moment, we have 2D networks (object detection, regression, classification) and 3D semantic segmentation networks. All of them are running quite rarely, some every few minutes, others only upon GUI actions. I also observed OOM crashes when one or more different models where queried at once, or even when the very same model with different versions was queried (e.g. ensemble from different weights). This is also why currently the only way to perform an ensemble inference for 3D semantic segmentation is to do the individual requests on the client side and strictly separate the different model versions (ensuring that only one model runs in parallel). An ensemble model in Triton would be much better, especially regarding the large payload send to/retrieved from the server. But this also only works if you have a single application sending requests.

I think one solution might be to explicitly define a mutual access to a GPU instance. This could be implemented via a mutex, system- or process visible. Some models are absolutely fine to run in parallel (2D), but we need a way to tell the scheduler that some models never ever have a chance to be executed if not all of the memory can be allocated for a single model instance.

Example:

  instance_group [
    {
      count: 1
      kind: KIND_GPU
      exclusive_execution: true
    }
  ]

@deadeyegoodwin
Copy link
Contributor

@razorx89 The rate-limiter I mentioned above will be able to solve this problem (and other related issues). We are 50-75% done with the rate-limiter implementation but unfortunately it has been delayed a couple of times due to other priorities. We do understand that the rate-limiter is an important feature that many use-cases (like yours) require so we are hoping to get it completed as soon as we can.

@razorx89
Copy link

Great to hear that it is being worked on! Any guess when this feature might be ready and released?

@deadeyegoodwin
Copy link
Contributor

We are aiming for release in Feb. or March (that means it would likely be available 2-4 weeks earlier on the master branch) but can't commit to that.

@erfaneshrati
Copy link

@deadeyegoodwin Do you have an update on this issue? Thanks.

@deadeyegoodwin
Copy link
Contributor

The rate limiter is now integrated into the main branch. We still have some additional testing to perform but it seems likely that as least some of the functionality will be enabled in the 21.09 or 21.10 releases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

7 participants