GPTQ Quantization via `from_pretrained`: why enforcing `fp16`? #25888

HanGuo97 · 2023-08-31T13:19:09Z

Feature request

Hi, I noticed in the following line that model has to be in fp16 format before GPTQ quantization. I'm curious whether this condition can be dropped?

transformers/src/transformers/modeling_utils.py

Line 2504 in ccb92be

torch_dtype = torch.float16

Motivation

My use case runs into troubles with fp16 but works with bf16, and I noticed that if I simply remove this line and keep torch_dtype=None everything runs fine.

Your contribution

NA

The text was updated successfully, but these errors were encountered:

amyeroberts · 2023-08-31T13:30:28Z

cc @younesbelkada

younesbelkada · 2023-08-31T13:34:56Z

Hi !
That might be a copypasta from previous bnb integration but not sure, we should probably override it to torch.float16 only if torch_dtype=None, @SunMarc what do you think? Maybe there is something I have overlooked about GPTQ

HanGuo97 · 2023-08-31T13:44:37Z

Why override it if torch_dtype=None? I think fp32 runs just fine too?

younesbelkada · 2023-08-31T13:46:31Z

if we set it to float32 by default it will create a lot of overhead for non-linear modules being in fp32 (such as embedding layer) making it not possible to fit some models on Google colab for example, therefore for bnb we set them to half-precision with a logger.info explaining what is happening under the hood.
(for bnb) you can always cast the non-linear modules in fp32 by sending torch_dtype=torch.float32

HanGuo97 · 2023-08-31T13:52:02Z

Understood -- thanks for the explanation!

So just to confirm, there are no correctness issues with using torch.float32, it’s just that using fp16 instead can result in better efficiency for certain workloads?

If that’s the case, would it be more effective to add a warning when torch_dtype=None and suggest using fp16 for better efficiency? Personally, I prefer having fewer overrides, but I’m open to either approach.

SunMarc · 2023-08-31T14:41:21Z

Hi @HanGuo97 , the backend in auto_gptq library always used torch_dtype = torch.float16 by default and I ran into a couple of issues with torch.dtype = torch.float32 in the past most probably due to how the kernels were implemented. So this is why i hardcoded to torch.float16. But I guess that if it works for you, I will do as you suggested !

HanGuo97 · 2023-08-31T14:45:33Z

Interesting, thanks for the clarification!

I briefly looked into the auto_gptq library, and I think they have a different code path depending on whether the data is in fp16 or not.

SunMarc · 2023-08-31T14:49:51Z

Yeah, I must have forgotten to deactivate use_cuda_fp16 as it is enabled by default ;)

HanGuo97 · 2023-08-31T14:53:16Z

Oh yes you are right, I missed this :)

(In hindsight, it's a bit odd they set this to True by default when it clearly depends on the model.)

Edit: optimum will detect the proper flag here

SunMarc · 2023-08-31T15:03:12Z

Thanks again for looking into that !

SunMarc mentioned this issue Aug 31, 2023

remove torch_dtype override #25894

Merged

SunMarc closed this as completed in #25894 Aug 31, 2023

younesbelkada mentioned this issue Oct 30, 2023

[core / Quantization ] AWQ integration #27045

Merged

7 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPTQ Quantization via `from_pretrained`: why enforcing `fp16`? #25888

GPTQ Quantization via `from_pretrained`: why enforcing `fp16`? #25888

HanGuo97 commented Aug 31, 2023

amyeroberts commented Aug 31, 2023

younesbelkada commented Aug 31, 2023

HanGuo97 commented Aug 31, 2023 •

edited

Loading

younesbelkada commented Aug 31, 2023 •

edited

Loading

HanGuo97 commented Aug 31, 2023

SunMarc commented Aug 31, 2023

HanGuo97 commented Aug 31, 2023

SunMarc commented Aug 31, 2023

HanGuo97 commented Aug 31, 2023 •

edited

Loading

SunMarc commented Aug 31, 2023

GPTQ Quantization via from_pretrained: why enforcing fp16? #25888

GPTQ Quantization via from_pretrained: why enforcing fp16? #25888

Comments

HanGuo97 commented Aug 31, 2023

Feature request

Motivation

Your contribution

amyeroberts commented Aug 31, 2023

younesbelkada commented Aug 31, 2023

HanGuo97 commented Aug 31, 2023 • edited Loading

younesbelkada commented Aug 31, 2023 • edited Loading

HanGuo97 commented Aug 31, 2023

SunMarc commented Aug 31, 2023

HanGuo97 commented Aug 31, 2023

SunMarc commented Aug 31, 2023

HanGuo97 commented Aug 31, 2023 • edited Loading

SunMarc commented Aug 31, 2023

GPTQ Quantization via `from_pretrained`: why enforcing `fp16`? #25888

GPTQ Quantization via `from_pretrained`: why enforcing `fp16`? #25888

HanGuo97 commented Aug 31, 2023 •

edited

Loading

younesbelkada commented Aug 31, 2023 •

edited

Loading

HanGuo97 commented Aug 31, 2023 •

edited

Loading