You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
ERROR: CUDA RT call "cudaFuncSetAttribute(&monarch_conv_cuda_32_32_32_kernel<32, 8, 32768, 2, 16, false, 2, 8, 8>, cudaFuncAttributeMaxDynamicSharedMemorySize, 135168)" in line 969 of file /root/flash-fft-conv/csrc/flashfftconv/monarch_cuda/monarch_cuda_interface_fwd_bf16.cu failed with invalid argument (1). CUDA Runtime Error at: /root/flash-fft-conv/csrc/flashfftconv/monarch_cuda/monarch_cuda_interface_fwd_bf16.cu:1041 invalid argument
Tried the example code with my_flashfftconv(x, k) and tests/test_flashfftconv.py using the Nvidia PyTorch docker container (23.05). Previously, I used conda with different CUDA versions (12.1, 12.2 and 12.3).
I'm using two NVIDIA RTX 3090 with Driver Version: 535.129.03 and CUDA Version: 12.2
Is there any fix for this problem? (changing tensor types didn't fixed)
The text was updated successfully, but these errors were encountered:
Thanks for this bug report! This is because the RTX series has less SRAM than A100/H100 (99 KB vs. 163/227 KB), which I didn't check for during development. You should be good for now for sequences up to 16K, and sequence lengths between 64K and 524K.
We'll try to fill in the rest of the sequence lengths for 3090 & 4090 in the next week or so, up to 2M (it requires some code changes and special-casing for different GPUs).
Thanks for your excellent work on this project! I'm running into the same CUDA error with an RTX 4090.
I specifically need to use sequence lengths between 8K and 64K, which seems to be the current gap in RTX support due to the SRAM limitations compared to A100/H100. Are there any plans to address this particular range for RTX cards? It would be really valuable for my use case.
ERROR: CUDA RT call "cudaFuncSetAttribute(&monarch_conv_cuda_32_32_32_kernel<32, 8, 32768, 2, 16, false, 2, 8, 8>, cudaFuncAttributeMaxDynamicSharedMemorySize, 135168)" in line 969 of file /root/flash-fft-conv/csrc/flashfftconv/monarch_cuda/monarch_cuda_interface_fwd_bf16.cu failed with invalid argument (1). CUDA Runtime Error at: /root/flash-fft-conv/csrc/flashfftconv/monarch_cuda/monarch_cuda_interface_fwd_bf16.cu:1041 invalid argument
Tried the example code with my_flashfftconv(x, k) and tests/test_flashfftconv.py using the Nvidia PyTorch docker container (23.05). Previously, I used conda with different CUDA versions (12.1, 12.2 and 12.3).
I'm using two NVIDIA RTX 3090 with Driver Version: 535.129.03 and CUDA Version: 12.2
Is there any fix for this problem? (changing tensor types didn't fixed)
The text was updated successfully, but these errors were encountered: