Skip to content

Latest commit

 

History

History
63 lines (47 loc) · 3.89 KB

archive.md

File metadata and controls

63 lines (47 loc) · 3.89 KB

⚙️ Benchmarking ML Engines

This file contains numbers for different engines and precision. Since a lot of upgrades in models and engines were made. So these results are now archived. However latest implementation does not have benchmarks for Metal or Mac CPU. So if you want to see that, feel free to check those out here.

A100 80GB Inference Bench:

Environment:

  • Model: LLAMA-2-7B
  • CUDA Version: 11.7
  • Command: ./benchmark.sh --repetitions 10 --max_tokens 512 --device cuda --prompt 'Write an essay about the transformer model architecture'

Performance Metrics: (unit: Tokens / second)

Engine float32 float16 int8 int4
candle - 36.78 ± 2.17 - -
llama.cpp - - 79.15 ± 1.20 100.90 ± 1.46
ctranslate 35.23 ± 4.01 55.72 ± 16.66 35.73 ± 10.87 -
onnx - 54.16 ± 3.15 - -
transformers (pytorch) 43.79 ± 0.61 46.39 ± 0.28 6.98 ± 0.05 21.72 ± 0.11
vllm 90.78 ± 1.60 90.54 ± 2.22 - 114.69 ± 11.20
exllamav2 - - 121.63 ± 0.74 130.16 ± 0.35
ctransformers - - 76.75 ± 10.36 84.26 ± 5.79
AutoGPTQ 42.01 ± 1.03 30.24 ± 0.41 - -
AutoAWQ - - - 109.20 ± 3.28
DeepSpeed - 81.44 ± 8.13 -
PyTorch Lightning 24.85 ± 0.07 44.56 ± 2.89 10.50 ± 0.12 24.83 ± 0.05
Optimum Nvidia 110.36 ± 0.52 109.09 ± 4.26 - -
Nvidia TensorRT-LLM 55.19 ± 1.03 85.03 ± 0.62 167.66 ± 2.05 235.18 ± 3.20

*(Data updated: 05th April 2024)

M2 MAX 32GB Inference Bench:

CPU

Environment:

  • Model: LLAMA-2-7B
  • CUDA Version: NA
  • Command: ./benchmark.sh --repetitions 10 --max_tokens 512 --device cpu --prompt 'Write an essay about the transformer model architecture'

Performance Metrics: (unit: Tokens / second)

Engine float32 float16 int8 int4
candle - 3.43 ± 0.02 - -
llama.cpp - - 13.24 ± 0.62 21.43 ± 0.47
ctranslate - - 1.87 ± 0.14 -
ctransformers - - 13.50 ± 0.48 20.57 ± 2.50

GPU (Metal)

Command: ./benchmark.sh --repetitions 10 --max_tokens 512 --device metal --prompt 'Write an essay about the transformer model architecture'

Performance Metrics: (unit: Tokens / second)

Engine float32 float16 int8 int4
llama.cpp - - 30.11 ± 0.45 44.27 ± 0.12
ctransformers - - 20.75 ± 0.36 34.04 ± 2.11

*(Data updated: 05th April 2024)