-
Notifications
You must be signed in to change notification settings - Fork 10.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Support for large context sizes #2021
Comments
This will be resolved with ggml-org/ggml#288 |
Hi, did you get that error immediately, or does it run for a while before faulting? I use wiki.test.raw, and run perplexity but I'm afraid how long it'll take on my device. There's no indication that it's operating other than It would be neat to see an ETA, but maybe I'm just doing it wrong. |
@JackJollimore It gives a time estimate after it finishes the first bucket. Wirh a context length of 8k there are only 40 buckets or so. Depending on the speed of your computer, a bucket can take many minutes. It only runs out of memory towards the end of the bucket when the used context length approaches the max. content length. So, you need to be patient to get to the assert. |
I understand. My android device is maxed at 3 toks/second so it'll take a while. Thank you for explaining how it works - it'll show ETA after finishing the 1st bucket. |
This issue was closed because it has been inactive for 14 days since being marked as stale. |
Currently of one attempts to use a context size larger than some threshold,
llama.cpp
fails.On the CPU, it fails with an assert such as
On the GPU using CUDA the buffer overrun is not detected and we get NaNs (for context sizes > ~5120 at 7B, and >~3600 at 13B).
The context size at which this occurs is dependent on the model size.
The text was updated successfully, but these errors were encountered: