flash-attention is not running, although is_flash_attn_2_available() returns true #30547

6sixteen · 2024-04-29T15:07:52Z

System Info

transformers version: 4.40.1
Platform: Windows-10-10.0.22631-SP0
Python version: 3.11.9
Huggingface_hub version: 0.22.2
Safetensors version: 0.4.3
Accelerate version: 0.29.3
Accelerate config: not found
PyTorch version (GPU?): 2.3.0+cu121 (True)
Tensorflow version (GPU?): not installed (NA)
Flax version (CPU?/GPU?/TPU?): not installed (NA)
Jax version: not installed
JaxLib version: not installed
Using GPU in script?:
Using distributed or parallel set-up in script?:

Who can help?

@Narsil
No response

Information

The official example scripts
My own modified scripts

Tasks

An officially supported task in the examples folder (such as GLUE/SQuAD, ...)
My own task or dataset (give details below)

Reproduction

I pip flash-attention through https://github.com/bdashore3/flash-attention/releases， which provides compiled package for windows. And is_flash_attn_2_available() returns true.
I run the comfyui workflow https://github.com/ZHO-ZHO-ZHO/ComfyUI-Phi-3-mini, which runs the pipeline of model microsoft/Phi-3-mini-4k-instruct .
I received "You are not running the flash-attention implementation, expect numerical differences."
I try to search this message in transformers, but find nothing （Unbelievable！！This is the first question）. Then I clone the newest code from github，and I find this message in modeling_phi3.py class Phi3Attention(nn.Module). And I find the transformers installed by pip is different between the transformers in github. The latter have a phi3 folder in model folder. So I try to build transformer by running python setup.py install , but i received
error: [Errno 2] No such file or directory: 'c:\users\79314\anaconda3\envs\comfyuitest\lib\site-packages\transformers-4.41.0.dev0py3.11.egg\transformers\models\deprecated\trajectory_transformer\pycache\convert_trajectory_transformer_original_pytorch_checkpoint_to_pytorch.cpython-311.pyc.1368162759184'
I try my best to use the flash-attention, but I failed.

Expected behavior

How to run with flash-attention？
My curiosity：
Is there another reason limiting flash-attention，except is_flash_attn_2_available()
Why I can't find "You are not running the flash-attention implementation, expect numerical differences." Which file contains these message?
How can I build the latest transformers?

The text was updated successfully, but these errors were encountered:

amyeroberts · 2024-04-29T17:06:47Z

cc @ArthurZucker @Rocketknight1

vikram71198 · 2024-04-30T00:45:59Z

Yeah, so the PR to integrate Phi-3 with transformers has already been merged here. There hasn't been a stable release yet, which is why there's a difference between the pip version and when you install directly from source. So, you have to do the latter for now.

That being said, I also see the message You are not running the flash-attention implementation, expect numerical differences. pop up & it seems like this is the first LOC in the forward func, although I do not understand what this message means myself.

Maybe @gugarosa can help, since this was his PR.

nunovix · 2024-05-16T17:45:54Z

Hi!
Might be out of topic, but I'm also trying to install the flash-attn package to run the Phi-3 model.
I've run into an issue trying to installl the flash-attn package from its github repo with the following "python setup.py install".

I get a lot of warning like the one below, it does not crash but it does not seem to ever finish as well. Has anyone ran into something similar and was able to solve it?

(I'm using a cuda with version 12.1 and pytorch 2.3)

"/usr/local/cuda-12/include/cusparse.h:254:20: note: declared here
254 | struct pruneInfo* pruneInfo_t CUSPARSE_DEPRECATED_TYPE;
| ^~~~~~~~~~~
/usr/local/cuda-12/include/cusparse.h:4868:366: warning: 'pruneInfo_t' is deprecated: The type will be removed in the next major release [-Wdeprecated-declarations]
4868 | cusparseSpruneCsr2csrByPercentage(cusparseHandle_t handle,
| ^
/usr/local/cuda-12/include/cusparse.h:254:20: note: declared here
254 | struct pruneInfo* pruneInfo_t CUSPARSE_DEPRECATED_TYPE;
| ^~~~~~~~~~~
/usr/local/cuda-12/include/cusparse.h:4886:368: warning: 'pruneInfo_t' is deprecated: The type will be removed in the next major release [-Wdeprecated-declarations]
4886 | cusparseDpruneCsr2csrByPercentage(cusparseHandle_t handle,
| ^
/usr/local/cuda-12/include/cusparse.h:254:20: note: declared here
254 | struct pruneInfo* pruneInfo_t CUSPARSE_DEPRECATED_TYPE;"

naveenfaclon · 2024-05-18T06:55:06Z

i am trying to use Phi-3-128k model, getting this problem

.modeling_phi3:You are not running the flash-attention implementation, expect numerical differences.
Killed

has anyone faced this error. How to solve this

BlueFoxPrime · 2024-05-22T12:42:58Z

Works fine after telling the pipeline which attention mechanism to use. I don't think there is a problem with Transformers lib:

!pip install transformers
!pip install flash-attn

from transformers import AutoModelForCausalLM
from transformers.utils import is_flash_attn_2_available

model = AutoModelForCausalLM.from_pretrained(
    "microsoft/Phi-3-mini-128k-instruct",
    device_map="cuda",
    torch_dtype="auto",
    trust_remote_code=True,
    attn_implementation="flash_attention_2"
)

is_flash_attn_2_available()
# True

naveenfaclon · 2024-05-29T11:45:47Z

i still sometimes get the same issue and also CUDA is out of memory issue.

github-actions · 2024-08-12T08:06:10Z

This issue has been automatically marked as stale because it has not had recent activity. If you think this still needs to be addressed please comment on this thread.

Please note that issues that do not follow the contributing guidelines are likely to be ignored.

hammoudhasan mentioned this issue May 23, 2024

Phi-3 conversation format, example training script and perplexity metric axolotl-ai-cloud/axolotl#1582

Merged

huggingface deleted a comment from github-actions bot Jun 23, 2024

huggingface deleted a comment from github-actions bot Jul 18, 2024

github-actions bot closed this as completed Aug 20, 2024

JosephPai mentioned this issue Dec 14, 2024

Error while training showlab/VideoLISA#3

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

flash-attention is not running, although is_flash_attn_2_available() returns true #30547

flash-attention is not running, although is_flash_attn_2_available() returns true #30547

6sixteen commented Apr 29, 2024 •

edited

Loading

amyeroberts commented Apr 29, 2024

vikram71198 commented Apr 30, 2024

nunovix commented May 16, 2024 •

edited

Loading

naveenfaclon commented May 18, 2024 •

edited

Loading

BlueFoxPrime commented May 22, 2024 •

edited

Loading

naveenfaclon commented May 29, 2024

github-actions bot commented Aug 12, 2024

flash-attention is not running, although is_flash_attn_2_available() returns true #30547

flash-attention is not running, although is_flash_attn_2_available() returns true #30547

Comments

6sixteen commented Apr 29, 2024 • edited Loading

System Info

Who can help?

Information

Tasks

Reproduction

Expected behavior

amyeroberts commented Apr 29, 2024

vikram71198 commented Apr 30, 2024

nunovix commented May 16, 2024 • edited Loading

naveenfaclon commented May 18, 2024 • edited Loading

BlueFoxPrime commented May 22, 2024 • edited Loading

naveenfaclon commented May 29, 2024

github-actions bot commented Aug 12, 2024

6sixteen commented Apr 29, 2024 •

edited

Loading

nunovix commented May 16, 2024 •

edited

Loading

naveenfaclon commented May 18, 2024 •

edited

Loading

BlueFoxPrime commented May 22, 2024 •

edited

Loading