-
Notifications
You must be signed in to change notification settings - Fork 10.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Command R Plus crashed on large context (~40K) with CUDA #6948
Comments
I can reproduce this issue and I don't think this is specific to this quant but I'll test other ones.
EDIT: Same issue with other quants. |
I also tested using PR #6563 and it doesn't resolve this issue, but I think this may be int overflow related. |
You can run it with |
It might crash at Lines 1259 to 1267 in 7bb36cc
All parameters of this function uses It sometimes also crashed at Lines 2297 to 2301 in 7bb36cc
Get into the actual implementation of MUL op in BTW I'm using CUDA 11.8 for compiling. |
The kernel launchs are async, so knowing the call that returns an error is not very useful because any previous kernel may have caused the error. You could try setting the |
Got it. Would synchronize manually inside |
Setting |
That took several hours... hopefully this is useful. compute-sanitizer
|
These changes are fixing the issue, I updated my branch from PR #6563
|
I tested Command R Plus on 4 L20 cards with maximum 64K context, with 64 layers offloaded to GPU, 16 layers per card.
My prompt is relatively large, it costs around 50K tokens. During the prefill phase, llama.cpp crashed at ~40K tokens.
Here's the error message:
I'm using @dranger003 's Q6_K model with the perplexity test fix in #6491 applied.
I also tested on 32K context and it works fine.
The text was updated successfully, but these errors were encountered: