Roadmap 🏎️

Overview

CUDA Threading Model:

[1] GPU Performance Background User's Guide

Inference

Knowledge Distillation
- Pytorch
- Arxiv
Transfer Learning
- Pytorch
- Pytroch
Batch Normalization
- Pytorch
Attention
Deterministic approximations using unsupervised domain adaptation
- [Arxiv - paper may be coming]
- Arxiv
Asynchronous inferencing / preemption
- Reddit
- Medium
Property Distillation
Geometric Distillation
- Arxiv
Future
- ACM
- Arxiv
- Arxiv
- Arxiv
- ACM
- Arxiv: Bridge Sampling
- Google Scolar

Measurement

To improve performance, you need to prove it.

Tools

Currently, our recommendation is to use Kineto to profile Pytorch based ML/AI implementations, Tensorboard to analyze performance with visualization, and the tensorboard profiler plugin as integration glue. Kineto profiles CUDA and CPU well enough. perf is superior, although i am unaware of any perf integration with CUDA or PyTorch.

Profile requirements.txt

ubuntu@20230828-gpudev:~/profile$ cat requirements.txt
tensorboard
torch_tb_profiler
vllm
tensorboard-plugin-profile
--extra-index-url https://download.pytorch.org/whl/nightly/cu118
torch

With the following deps:

(v-profile) ubuntu@20230828-gpudev:~/profile$ pip list
Package                    Version    Editable project location
-------------------------- ---------- -------------------------
absl-py                    1.4.0
aiosignal                  1.3.1
anyio                      3.7.1
attrs                      23.1.0
cachetools                 5.3.1
certifi                    2023.7.22
charset-normalizer         3.2.0
click                      8.1.7
cmake                      3.27.2
exceptiongroup             1.1.3
fastapi                    0.103.0
filelock                   3.12.3
frozenlist                 1.4.0
fsspec                     2023.6.0
google-auth                2.22.0
google-auth-oauthlib       1.0.0
grpcio                     1.57.0
gviz-api                   1.10.0
h11                        0.14.0
huggingface-hub            0.16.4
idna                       3.4
Jinja2                     3.1.2
jsonschema                 4.19.0
jsonschema-specifications  2023.7.1
lit                        16.0.6
Markdown                   3.4.4
MarkupSafe                 2.1.3
mpmath                     1.3.0
msgpack                    1.0.5
networkx                   3.1
ninja                      1.11.1
numpy                      1.25.2
nvidia-cublas-cu11         11.10.3.66
nvidia-cuda-cupti-cu11     11.7.101
nvidia-cuda-nvrtc-cu11     11.7.99
nvidia-cuda-runtime-cu11   11.7.99
nvidia-cudnn-cu11          8.5.0.96
nvidia-cufft-cu11          10.9.0.58
nvidia-curand-cu11         10.2.10.91
nvidia-cusolver-cu11       11.4.0.1
nvidia-cusparse-cu11       11.7.4.91
nvidia-nccl-cu11           2.14.3
nvidia-nvtx-cu11           11.7.91
oauthlib                   3.2.2
packaging                  23.1
pandas                     2.0.3
pip                        22.0.2
protobuf                   4.24.2
psutil                     5.9.5
pyasn1                     0.5.0
pyasn1-modules             0.3.0
pydantic                   1.10.12
python-dateutil            2.8.2
pytz                       2023.3
PyYAML                     6.0.1
ray                        2.6.3
referencing                0.30.2
regex                      2023.8.8
requests                   2.31.0
requests-oauthlib          1.3.1
rpds-py                    0.10.0
rsa                        4.9
safetensors                0.3.3
sentencepiece              0.1.99
setuptools                 59.6.0
six                        1.16.0
sniffio                    1.3.0
starlette                  0.27.0
sympy                      1.12
tensorboard                2.14.0
tensorboard-data-server    0.7.1
tensorboard-plugin-profile 2.13.1
tokenizers                 0.13.3
torch                      2.0.1
torch-tb-profiler          0.4.1
tqdm                       4.66.1
transformers               4.32.1
triton                     2.0.0
typing_extensions          4.7.1
tzdata                     2023.3
urllib3                    1.26.16
uvicorn                    0.23.2
vllm                       0.1.4      /home/ubuntu/repos/vllm
Werkzeug                   2.3.7
wheel                      0.37.1
xformers                   0.0.21

`vllm` Performance Results

Inference Performance Study

Model: meta-llama/llama-2-7b-chat-hf
Study Author: Steven Dake
License: CC-BY-ND-4.0.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Roadmap 🏎️

Overview

CUDA Threading Model:

Inference

Measurement

Tools

Profile requirements.txt

`vllm` Performance Results

Inference Performance Study

Clone this wiki locally

Roadmap 🏎️

Overview

CUDA Threading Model:

Inference

Measurement

Tools

Profile requirements.txt

vllm Performance Results

Inference Performance Study

Clone this wiki locally

`vllm` Performance Results