-
Notifications
You must be signed in to change notification settings - Fork 4
Roadmap 🏎️
Steven Dake edited this page Aug 31, 2023
·
7 revisions
- Comparing long-range context attention
- NVIDIA CUDA Performance
- NVIDIA Matrix Multiplication Background User's Guide
- NVIDIA GPU Performance Background User's Guide
- Guide
- CUDA
[1] GPU Performance Background User's Guide
- Knowledge Distillation
- Transfer Learning
- Batch Normalization
- Attention
- Deterministic approximations using unsupervised domain adaptation
- [Arxiv - paper may be coming]
- Arxiv
- Asynchronous inferencing / preemption
- Property Distillation
- Geometric Distillation
- Future
To improve performance, you need to prove it.
Currently, our recommendation is to use Kineto
to profile Pytorch
based ML/AI implementations, Tensorboard
to analyze performance with visualization, and the tensorboard profiler plugin
as integration glue. Kineto
profiles CUDA
and CPU
well enough. perf
is superior, although i am unaware of any perf
integration with CUDA
or PyTorch
.
- Holistic trace analysis - looks very well done, a must to eval
- Perf is always exceptional and improves, like a fine wine, with age
- TensorBoard Visualization
- Pytorch Tensorboard profiler
- Pytorch Profiler
ubuntu@20230828-gpudev:~/profile$ cat requirements.txt
tensorboard
torch_tb_profiler
vllm
tensorboard-plugin-profile
--extra-index-url https://download.pytorch.org/whl/nightly/cu118
torch
With the following deps:
(v-profile) ubuntu@20230828-gpudev:~/profile$ pip list
Package Version Editable project location
-------------------------- ---------- -------------------------
absl-py 1.4.0
aiosignal 1.3.1
anyio 3.7.1
attrs 23.1.0
cachetools 5.3.1
certifi 2023.7.22
charset-normalizer 3.2.0
click 8.1.7
cmake 3.27.2
exceptiongroup 1.1.3
fastapi 0.103.0
filelock 3.12.3
frozenlist 1.4.0
fsspec 2023.6.0
google-auth 2.22.0
google-auth-oauthlib 1.0.0
grpcio 1.57.0
gviz-api 1.10.0
h11 0.14.0
huggingface-hub 0.16.4
idna 3.4
Jinja2 3.1.2
jsonschema 4.19.0
jsonschema-specifications 2023.7.1
lit 16.0.6
Markdown 3.4.4
MarkupSafe 2.1.3
mpmath 1.3.0
msgpack 1.0.5
networkx 3.1
ninja 1.11.1
numpy 1.25.2
nvidia-cublas-cu11 11.10.3.66
nvidia-cuda-cupti-cu11 11.7.101
nvidia-cuda-nvrtc-cu11 11.7.99
nvidia-cuda-runtime-cu11 11.7.99
nvidia-cudnn-cu11 8.5.0.96
nvidia-cufft-cu11 10.9.0.58
nvidia-curand-cu11 10.2.10.91
nvidia-cusolver-cu11 11.4.0.1
nvidia-cusparse-cu11 11.7.4.91
nvidia-nccl-cu11 2.14.3
nvidia-nvtx-cu11 11.7.91
oauthlib 3.2.2
packaging 23.1
pandas 2.0.3
pip 22.0.2
protobuf 4.24.2
psutil 5.9.5
pyasn1 0.5.0
pyasn1-modules 0.3.0
pydantic 1.10.12
python-dateutil 2.8.2
pytz 2023.3
PyYAML 6.0.1
ray 2.6.3
referencing 0.30.2
regex 2023.8.8
requests 2.31.0
requests-oauthlib 1.3.1
rpds-py 0.10.0
rsa 4.9
safetensors 0.3.3
sentencepiece 0.1.99
setuptools 59.6.0
six 1.16.0
sniffio 1.3.0
starlette 0.27.0
sympy 1.12
tensorboard 2.14.0
tensorboard-data-server 0.7.1
tensorboard-plugin-profile 2.13.1
tokenizers 0.13.3
torch 2.0.1
torch-tb-profiler 0.4.1
tqdm 4.66.1
transformers 4.32.1
triton 2.0.0
typing_extensions 4.7.1
tzdata 2023.3
urllib3 1.26.16
uvicorn 0.23.2
vllm 0.1.4 /home/ubuntu/repos/vllm
Werkzeug 2.3.7
wheel 0.37.1
xformers 0.0.21
- Model: meta-llama/llama-2-7b-chat-hf
- Study Author: Steven Dake
- License: CC-BY-ND-4.0.