Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPTQ Quantization via from_pretrained: why enforcing fp16? #25888

Closed
HanGuo97 opened this issue Aug 31, 2023 · 10 comments · Fixed by #25894
Closed

GPTQ Quantization via from_pretrained: why enforcing fp16? #25888

HanGuo97 opened this issue Aug 31, 2023 · 10 comments · Fixed by #25894

Comments

@HanGuo97
Copy link

Feature request

Hi, I noticed in the following line that model has to be in fp16 format before GPTQ quantization. I'm curious whether this condition can be dropped?

torch_dtype = torch.float16

Motivation

My use case runs into troubles with fp16 but works with bf16, and I noticed that if I simply remove this line and keep torch_dtype=None everything runs fine.

Your contribution

NA

@amyeroberts
Copy link
Collaborator

cc @younesbelkada

@younesbelkada
Copy link
Contributor

Hi !
That might be a copypasta from previous bnb integration but not sure, we should probably override it to torch.float16 only if torch_dtype=None, @SunMarc what do you think? Maybe there is something I have overlooked about GPTQ

@HanGuo97
Copy link
Author

HanGuo97 commented Aug 31, 2023

Why override it if torch_dtype=None? I think fp32 runs just fine too?

@younesbelkada
Copy link
Contributor

younesbelkada commented Aug 31, 2023

if we set it to float32 by default it will create a lot of overhead for non-linear modules being in fp32 (such as embedding layer) making it not possible to fit some models on Google colab for example, therefore for bnb we set them to half-precision with a logger.info explaining what is happening under the hood.
(for bnb) you can always cast the non-linear modules in fp32 by sending torch_dtype=torch.float32

@HanGuo97
Copy link
Author

Understood -- thanks for the explanation!

So just to confirm, there are no correctness issues with using torch.float32, it’s just that using fp16 instead can result in better efficiency for certain workloads?

If that’s the case, would it be more effective to add a warning when torch_dtype=None and suggest using fp16 for better efficiency? Personally, I prefer having fewer overrides, but I’m open to either approach.

@SunMarc
Copy link
Member

SunMarc commented Aug 31, 2023

Hi @HanGuo97 , the backend in auto_gptq library always used torch_dtype = torch.float16 by default and I ran into a couple of issues with torch.dtype = torch.float32 in the past most probably due to how the kernels were implemented. So this is why i hardcoded to torch.float16. But I guess that if it works for you, I will do as you suggested !

@HanGuo97
Copy link
Author

Interesting, thanks for the clarification!

I briefly looked into the auto_gptq library, and I think they have a different code path depending on whether the data is in fp16 or not.

@SunMarc
Copy link
Member

SunMarc commented Aug 31, 2023

Yeah, I must have forgotten to deactivate use_cuda_fp16 as it is enabled by default ;)

@HanGuo97
Copy link
Author

HanGuo97 commented Aug 31, 2023

Oh yes you are right, I missed this :)

(In hindsight, it's a bit odd they set this to True by default when it clearly depends on the model.)

Edit: optimum will detect the proper flag here

@SunMarc
Copy link
Member

SunMarc commented Aug 31, 2023

Thanks again for looking into that !

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants