-
Notifications
You must be signed in to change notification settings - Fork 1.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
support 2bit quip# method #1293
Comments
Hi! cc @SunMarc @Titus-von-Koeller FYI |
The canonical way to install QuIP# kernels is to install the fast-hadamard-transform package and build quiptools (in our codebase on github). We do not have a pypi package yet but are planning on having one in the future when the project becomes more stable. The two key "linear" classes that QuIP# relies on are here https://github.com/Cornell-RelaxML/quip-sharp/tree/main/lib/linear and you can see how we replace nn.Linear in llama with those classes in https://github.com/Cornell-RelaxML/quip-sharp/blob/main/model/llama.py. A few questions: how are you planning on integrating QuIP# into huggingface code? Where will it be integrated into and how will you keep up with future itertations on QuIP#? |
Hi @tsengalb99 |
I think it would be best to avoid duplicating code from the QuIP# codebase. The QuantizedLinear class is not standalone and relies on implementations in the codebook files (eg here for E8P https://github.com/Cornell-RelaxML/quip-sharp/blob/1d6e3c2d4c144eba80b945cca5429ce8d79d2cec/lib/codebook/latticee8_padded12.py#L180), which means you'll have to copy all those over as well. QuIP# is still in active development and we will almost certainly make changes to the codebooks in the future that will require you to update your copies as well. Perhaps you can include QuIP# as a submodule or something similar so users only have to pull our code once. |
We still plan to support Quip# inference, @tsengalb99 I will provide more details once huggingface/transformers#28703 gets merged |
Hi @tsengalb99 ! |
Hi Younes, I’ll take a look at that, it definitely sounds interesting!
|
Awesome, thanks very much @tsengalb99 ! |
Hi @tsengalb99 ! Let me know if you need any help to kickoff Quip# integration in transformers! 🙏 With the recent quantizer support it should be quite straightforward and I am happy to help if needed |
Hi Younes, will do. I got caught up with some other stuff but just released the updated quip# code and models today (https://github.com/Cornell-RelaxML/quip-sharp, https://arxiv.org/abs/2402.04396). Hoping to get integration going soon.
|
Awesome, thanks so much @tsengalb99 let me know if you face into any issue! |
@younesbelkada we've finally started working on this, expect some progress in a week or so. |
Nice, thanks very much ! Let me know if you need any help or guidance ! |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. |
We are still working on integration, albeit very slowly. |
thanks again @tsengalb99 ! 🚀 |
AQLM is already fine #1476 |
This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread. |
We have a better method coming out soon so quip# development has been superceded. We may eventually get around to hf support but without working cuda graphs during general its difficult to justify spending time on integration.
Get Outlook for Android<https://aka.ms/AAb9ysg>
…________________________________
From: github-actions[bot] ***@***.***>
Sent: Tuesday, June 4, 2024 8:04:11 AM
To: huggingface/peft ***@***.***>
Cc: Albert Tseng ***@***.***>; Mention ***@***.***>
Subject: Re: [huggingface/peft] support 2bit quip# method (Issue #1293)
Closed #1293<#1293> as completed.
—
Reply to this email directly, view it on GitHub<#1293 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AH6WZSHGHP4PU4NA4X44KATZFXJOXAVCNFSM6AAAAABBANQF2CVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJTGAZTQMBRGI2DIMI>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Thanks for the update @tsengalb99 ! Very excited for this new methods 🔥 Would you mind explaining a bit more why cuda graphs are needed ? Also, in general, do you have any recommendation on what to improve on transformers to allow better support of quantization methods ? |
Hi Marc,
Cuda graphs are essential for fast inference since they mask out much of the kernel launch overheads. Many quantization algorithms like QuIP# use multiple kernels during inference and the launch overhead can often be much higher than the actual inference part. For example, with Cuda graphs, QuIP# can hit 170 tok/s for 2 bit 7B. Without, iirc it does around 20-30 tok/s.
Much of this could be solved by kernel fusion, and groups that have large teams working on quantization have the engineering manpower to do that. However, smaller teams like ours can't always do everything, so having Cuda graph support would be very useful.
Get Outlook for Android<https://aka.ms/AAb9ysg>
…________________________________
From: Marc Sun ***@***.***>
Sent: Wednesday, June 5, 2024 4:31:41 AM
To: huggingface/peft ***@***.***>
Cc: Albert Tseng ***@***.***>; Mention ***@***.***>
Subject: Re: [huggingface/peft] support 2bit quip# method (Issue #1293)
Thanks for the update @tsengalb99<https://github.com/tsengalb99> ! Very excited for this new methods 🔥 Would you mind explaining a bit more why cuda graphs are needed ? Also, in general, do you have any recommendation on what to improve on transformers to allow better support of quantization methods ?
—
Reply to this email directly, view it on GitHub<#1293 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/AH6WZSABTGHSQYYBA4EEGG3ZF3ZJ3AVCNFSM6AAAAABBANQF2CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNBZGYYDENZSGE>.
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
Cuda graphs are supported in transformers for models that support static kv cache |
Is there a list of such models and a guide on how to use cuda graphs with transformers? I just tried torch.compile(model.generate, mode=’reduce-overhead’) on transformers 4.42.3 with Llama 2 7B and get the following error
```
>> gen(input)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/home/at676/miniconda3/envs/latest_hf/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 451, in _fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/at676/miniconda3/envs/latest_hf/lib/python3.11/site-packages/torch/_dynamo/external_utils.py", line 36, in inner
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/at676/miniconda3/envs/latest_hf/lib/python3.11/site-packages/torch/utils/_contextlib.py", line 115, in decorate_context
return func(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^^^
File "/home/at676/miniconda3/envs/latest_hf/lib/python3.11/site-packages/transformers/generation/utils.py", line 1538, in generate
@torch.no_grad()
File "/home/at676/miniconda3/envs/latest_hf/lib/python3.11/site-packages/transformers/generation/utils.py", line 1456, in _prepare_special_tokens
def _prepare_special_tokens(
File "/home/at676/miniconda3/envs/latest_hf/lib/python3.11/site-packages/torch/_dynamo/eval_frame.py", line 451, in _fn
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/at676/miniconda3/envs/latest_hf/lib/python3.11/site-packages/torch/_dynamo/external_utils.py", line 36, in inner
return fn(*args, **kwargs)
^^^^^^^^^^^^^^^^^^^
File "/home/at676/miniconda3/envs/latest_hf/lib/python3.11/site-packages/torch/_functorch/aot_autograd.py", line 917, in forward
return compiled_fn(full_args)
^^^^^^^^^^^^^^^^^^^^^^
File "/home/at676/miniconda3/envs/latest_hf/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/utils.py", line 89, in g
return f(*args)
^^^^^^^^
File "/home/at676/miniconda3/envs/latest_hf/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/runtime_wrappers.py", line 106, in runtime_wrapper
all_outs = call_func_at_runtime_with_args(
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/at676/miniconda3/envs/latest_hf/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/utils.py", line 113, in call_func_at_runtime_with_args
out = normalize_as_list(f(args))
^^^^^^^
File "/home/at676/miniconda3/envs/latest_hf/lib/python3.11/site-packages/torch/_functorch/_aot_autograd/jit_compile_runtime_wrappers.py", line 152, in rng_functionalization_wrapper
return compiled_fw(args)
^^^^^^^^^^^^^^^^^
File "/home/at676/miniconda3/envs/latest_hf/lib/python3.11/site-packages/torch/_inductor/codecache.py", line 906, in __call__
return self.get_current_callable()(inputs)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/home/at676/miniconda3/envs/latest_hf/lib/python3.11/site-packages/torch/_inductor/compile_fx.py", line 838, in run
return compiled_fn(new_inputs)
^^^^^^^^^^^^^^^^^^^^^^^
File "/home/at676/miniconda3/envs/latest_hf/lib/python3.11/site-packages/torch/_inductor/cudagraph_trees.py", line 381, in deferred_cudagraphify
copy_misaligned_inputs(inputs, check_input_idxs)
File "/home/at676/miniconda3/envs/latest_hf/lib/python3.11/site-packages/torch/_inductor/compile_fx.py", line 751, in copy_misaligned_inputs
if new_inputs[i].data_ptr() % ALIGNMENT:
^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: Error: accessing tensor output of CUDAGraphs that has been overwritten by a subsequent run. Stack trace: File "/home/at676/miniconda3/envs/latest_hf/lib/python3.11/site-packages/transformers/generation/utils.py", line 1500, in _prepare_special_tokens
eos_token_id = eos_token_id.unsqueeze(0). To prevent overwriting, clone the tensor outside of torch.compile() or call torch.compiler.cudagraph_mark_step_begin() before each model invocation.
```
From: Arthur ***@***.***>
Sent: Friday, June 7, 2024 1:19 AM
To: huggingface/peft ***@***.***>
Cc: Albert Tseng ***@***.***>; Mention ***@***.***>
Subject: Re: [huggingface/peft] support 2bit quip# method (Issue #1293)
Cuda graphs are supported in transformers for models that support static kv cache
—
Reply to this email directly, view it on GitHub <#1293 (comment)> , or unsubscribe <https://github.com/notifications/unsubscribe-auth/AH6WZSG34YNNI7KO6ARHYBDZGFUI7AVCNFSM6AAAAABBANQF2CVHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDCNJUGMZTKMBYGU> .
You are receiving this because you were mentioned.Message ID: ***@***.***>
|
The compile should be run on the forward not generate for now! huggingface/transformers#30788 will add end to end spport |
https://github.com/Cornell-RelaxML/quip-sharp
The text was updated successfully, but these errors were encountered: