-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is there any way to compile the codes with nvcc debug flag(-G)? #1364
Comments
You can try the
|
Btw I always just print things out to debug |
Thx for the answer. |
I 'm trying to compile the debug version of FlashAttention 2 (on A100 cuda12.4 pytorch 2.4), but the compilation failed due to OOM. The machine has 1 TiB RAM and 64 cores.
I tried to set
I set |
@miaomiaoma0703 After few hours of workaround, I've managed to compile the code without an error. |
@Dev-Jahn Thank you very much! Perhaps because single ptx intermediate file takes few GBs and 1 TiB RAM is too small to compile successfully. I has compiled the release version successfully after 15 mins, and I plan to print things out to debug like the auther, or like you, just explicitly instantiate EVERY template parameters only for my debug case and remove all others(As you did in first comment) in debug version. |
I'm trying to implement custom behaviors with flash-attn 3 (hopper) base.
There's no problem with building library in general, but compile takes too much time when adding
nvcc -G
flag (or--ptxas-options=-g
) to debug the mainloop and tile schedulers.Above is list of all nvcc flags I'm using.
-g -O0
tocxx_flags
is compilable in reasonable time-g -O0
tonvcc_flags
is compilable in reasonable time-G
(device code debug) tonvcc_flags
never finishes/tmp
dir ramdiskMy dev machine has 224 CPU cores but increasing ninja or nvcc threads is meaningless cuz cicc and ptxas is not parallelizable.
ptxas process for single ptx file takes almost forever(more than 2 hours).
I'm kinda new to CUDA, so I might miss some important options.
Is there any other way to reduce the compilation time when adding device debug flag?
(edit)
Above is printed log due to compilation failure after few hours.(Don't know exactly how long due to afk)
Compilation takes all of the ram (2TiB) before pruning templates, but after the workaround I've mentioned above it consumes under 100G so I guess it's not a OOM issue.
The text was updated successfully, but these errors were encountered: