You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hello, first of all thanks for this awesome package - this is very impressive!
We often run into resource constraints when running larger models from huggingface, even just for inference.
A common strategy has been to apply quantization to model weights before running them.
Do you know if quantization has been succesfully applied at all in the Julia ML ecosystem? Is this something that might potentially be part of this package in the future?
The text was updated successfully, but these errors were encountered:
Small update, it seems that CUDA.jl supports Float16 natively, but Float8 and Float4 are not even datatypes in Base Julia -- although there an issue in CUDA.jl for supporting Float8, but it doesn't seem to be active.
However, without looking into the exact specifics, it seems that supporting Float16 should be possible by loading the weights from a Float16 quantized model.
Hello, first of all thanks for this awesome package - this is very impressive!
We often run into resource constraints when running larger models from huggingface, even just for inference.
A common strategy has been to apply quantization to model weights before running them.
Do you know if quantization has been succesfully applied at all in the Julia ML ecosystem? Is this something that might potentially be part of this package in the future?
The text was updated successfully, but these errors were encountered: