Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Code Completion and Code Infilling - LLama 7B Inference #256

Closed
akaneshiro7 opened this issue Oct 19, 2023 · 7 comments
Closed

Code Completion and Code Infilling - LLama 7B Inference #256

akaneshiro7 opened this issue Oct 19, 2023 · 7 comments
Labels

Comments

@akaneshiro7
Copy link

akaneshiro7 commented Oct 19, 2023

Hello,

I am trying to run inference for Code LLama using HuggingFace Transformers Accelerate Model for llama 2 7B. I am able to run inference with the examples/inference.py script. However, when I try to run the example command for code completion and code infilling, I get the following error:

TypeError: llama_forward() got an unexpected keyword argument 'padding_mask'

Warning: The BetterTransformer implementation does not support padding during training, as the fused kernels do not support attention masks. Beware that passing padded batched data during training may result in unexpected outputs. Please refer to https://huggingface.co/docs/optimum/bettertransformer/overview for more details.

Command: python examples/code_llama/code_infilling_example.py --model_name ../hf_transformers/7B/ --prompt_file examples/code_llama/code_infilling_prompt.txt --temperature 0.2 --top_p 0.9

Versions:
torch==2.0.1
transformers==4.34.1
tokenizers==0.14.1
optimum==1.13.2

@jxmorris12
Copy link

I ran into this issue too. It's supposed to be fixed in the latest update of Transformers: huggingface/optimum#1446

@HamidShojanazeri
Copy link
Contributor

thanks @jxmorris12 , @akaneshiro7 seems the fix in optimum should help. Pls let me us know if that works for you.

@eduardosanchezg
Copy link

Can confirm it works after installing from source.

@akaneshiro7
Copy link
Author

@HamidShojanazeri

Yes, installing from source seems to have fixed the issue with running Code Completion and Code Infilling with the HuggingFace Accelerate Model. Am I able to run it with a PEFT Fine-Tuned Model? When I try to run the same command but passing in the peft_model, I get the following error:

File "/home/ubuntu/anaconda3/envs/vllm/lib/python3.8/site-packages/torch/nn/modules/module.py", line 1685, in delattr
super().delattr(name)
AttributeError: _hf_hook

@huangyangyu
Copy link

I meet the issue "AttributeError: _hf_hook" too.

@callumHub
Copy link

It went away when I set 'use_fast_kernels=False'

I meet the issue "AttributeError: _hf_hook" too.

@XvHaidong
Copy link

I meet the issue "AttributeError: _hf_hook" too. Could you please fix this? We really want to speed up the inference stage. (With LORA model). Thanks!

@init27 init27 closed this as completed Aug 19, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

8 participants