Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ensemble model question and model priority #1194

Closed
mengkai94 opened this issue Mar 17, 2020 · 5 comments
Closed

ensemble model question and model priority #1194

mengkai94 opened this issue Mar 17, 2020 · 5 comments

Comments

@mengkai94
Copy link

1、ensemble model question:
An ensemble model represents a pipeline of one or more models. Can the models be
distribution in different gpus.

2、model priority
TRTIS can set model priority. Mode priority is equal to CUDA stream priority.
Why model priority only work in TensorRT model.

@deadeyegoodwin
Copy link
Contributor

  1. Yes, you can place a model on specific GPU(s) using the instance_group model configuration options: https://docs.nvidia.com/deeplearning/sdk/tensorrt-inference-server-guide/docs/model_configuration.html#instance-groups
    This controls where the model runs (when part of an ensemble or not... doesn't matter).

  2. We can only set stream priority when we are able to create the CUDA stream ourselves. As far as we known, the other model frameworks manage the CUDA streams themselves. Please let us know if you have some insight into how to do this for a particular framework.

Note that setting the CUDA stream priority doesn't really do that much. In 20.03 we added some additional prioritization options to the dynamic batch scheduler... they only prioritize within a model, not across models but may be interesting to you: https://github.com/NVIDIA/tensorrt-inference-server/blob/master/docs/model_configuration.rst#dynamic-batcher

@mengkai94
Copy link
Author

How does pre-model‘s output tensor of an ensemble trans to post-model's input tensor in different GPUs.
the output tensor copy to system memory and then as input tensor?

@deadeyegoodwin
Copy link
Contributor

A peer-to-peer copy should be performed if the GPUs support it. Otherwise the tensor will have to stage through CPU memory. Are you seeing different behavior?

@mys007
Copy link

mys007 commented Dec 10, 2020

Sorry for invading this old issue but is there currently any way how to prioritize computation (not just streams) across models? Let's say that emptying some model queues should be prioritized over dealing with requests from some other model queues?

@deadeyegoodwin
Copy link
Contributor

Rate limiter is being worked on: #1507 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants