Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

monarch_cuda_interface_fwd_bf16.cu failed with invalid argument (1). #6

Open
jeohalves opened this issue Nov 23, 2023 · 2 comments
Open
Labels
bug Something isn't working

Comments

@jeohalves
Copy link

ERROR: CUDA RT call "cudaFuncSetAttribute(&monarch_conv_cuda_32_32_32_kernel<32, 8, 32768, 2, 16, false, 2, 8, 8>, cudaFuncAttributeMaxDynamicSharedMemorySize, 135168)" in line 969 of file /root/flash-fft-conv/csrc/flashfftconv/monarch_cuda/monarch_cuda_interface_fwd_bf16.cu failed with invalid argument (1). CUDA Runtime Error at: /root/flash-fft-conv/csrc/flashfftconv/monarch_cuda/monarch_cuda_interface_fwd_bf16.cu:1041 invalid argument

Tried the example code with my_flashfftconv(x, k) and tests/test_flashfftconv.py using the Nvidia PyTorch docker container (23.05). Previously, I used conda with different CUDA versions (12.1, 12.2 and 12.3).

I'm using two NVIDIA RTX 3090 with Driver Version: 535.129.03 and CUDA Version: 12.2

Is there any fix for this problem? (changing tensor types didn't fixed)

@DanFu09
Copy link
Contributor

DanFu09 commented Nov 24, 2023

Thanks for this bug report! This is because the RTX series has less SRAM than A100/H100 (99 KB vs. 163/227 KB), which I didn't check for during development. You should be good for now for sequences up to 16K, and sequence lengths between 64K and 524K.

We'll try to fill in the rest of the sequence lengths for 3090 & 4090 in the next week or so, up to 2M (it requires some code changes and special-casing for different GPUs).

@KarlUlbaek
Copy link

Thanks for your excellent work on this project! I'm running into the same CUDA error with an RTX 4090.

I specifically need to use sequence lengths between 8K and 64K, which seems to be the current gap in RTX support due to the SRAM limitations compared to A100/H100. Are there any plans to address this particular range for RTX cards? It would be really valuable for my use case.

Thanks for all your work on this!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants