State of quantization #154

RomeoV · 2023-09-27T00:30:44Z

Hello, first of all thanks for this awesome package - this is very impressive!

We often run into resource constraints when running larger models from huggingface, even just for inference.
A common strategy has been to apply quantization to model weights before running them.

Do you know if quantization has been succesfully applied at all in the Julia ML ecosystem? Is this something that might potentially be part of this package in the future?

chengchingwen · 2023-09-28T12:46:15Z

It's widely used in some transformer implementation like llama.cpp, but currently it's with low priority on my list to support in this package.

RomeoV · 2023-09-28T21:15:20Z

Small update, it seems that CUDA.jl supports Float16 natively, but Float8 and Float4 are not even datatypes in Base Julia -- although there an issue in CUDA.jl for supporting Float8, but it doesn't seem to be active.

However, without looking into the exact specifics, it seems that supporting Float16 should be possible by loading the weights from a Float16 quantized model.

RomeoV · 2023-09-28T21:23:35Z

Grepping for Float32 in this repo seems to suggest one would need to change

layers/embed.jl
layers/base.jl
huggingface/models/load.jl
and then in the respective load.jl files for each model.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

State of quantization #154

State of quantization #154

RomeoV commented Sep 27, 2023

chengchingwen commented Sep 28, 2023

RomeoV commented Sep 28, 2023

RomeoV commented Sep 28, 2023

State of quantization #154

State of quantization #154

Comments

RomeoV commented Sep 27, 2023

chengchingwen commented Sep 28, 2023

RomeoV commented Sep 28, 2023

RomeoV commented Sep 28, 2023