Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support for large context sizes #2021

Closed
ikawrakow opened this issue Jun 27, 2023 · 5 comments
Closed

Support for large context sizes #2021

ikawrakow opened this issue Jun 27, 2023 · 5 comments
Labels
enhancement New feature or request stale

Comments

@ikawrakow
Copy link
Contributor

Currently of one attempts to use a context size larger than some threshold, llama.cpp fails.

On the CPU, it fails with an assert such as

./bin/perplexity -m q6k.bin -f ../tests/wikitext-2-raw/wiki.test.raw -s 1234 -t 16 -c 8192
...
system_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
perplexity: calculating perplexity over 40 chunks, batch_size=512
ggml_new_tensor_impl: not enough space in the scratch memory pool (needed 545259520, available 536870912)
Segmentation fault (core dumped)

On the GPU using CUDA the buffer overrun is not detected and we get NaNs (for context sizes > ~5120 at 7B, and >~3600 at 13B).

The context size at which this occurs is dependent on the model size.

@ggerganov
Copy link
Owner

This will be resolved with ggml-org/ggml#288

@ggerganov ggerganov added the enhancement New feature or request label Jun 28, 2023
@ghost
Copy link

ghost commented Jun 29, 2023

ggml_new_tensor_impl: not enough space in the scratch memory pool (needed 545259520, available 536870912)
Segmentation fault (core dumped)

Hi, did you get that error immediately, or does it run for a while before faulting?

I use wiki.test.raw, and run perplexity but I'm afraid how long it'll take on my device. There's no indication that it's operating other than
perplexity: calculating perplexity over 109 chunks, batch_size=10

It would be neat to see an ETA, but maybe I'm just doing it wrong.

@ikawrakow
Copy link
Contributor Author

@JackJollimore It gives a time estimate after it finishes the first bucket. Wirh a context length of 8k there are only 40 buckets or so. Depending on the speed of your computer, a bucket can take many minutes. It only runs out of memory towards the end of the bucket when the used context length approaches the max. content length. So, you need to be patient to get to the assert.

@ghost
Copy link

ghost commented Jun 30, 2023

@JackJollimore It gives a time estimate after it finishes the first bucket. Wirh a context length of 8k there are only 40 buckets or so. Depending on the speed of your computer, a bucket can take many minutes. It only runs out of memory towards the end of the bucket when the used context length approaches the max. content length. So, you need to be patient to get to the assert.

I understand. My android device is maxed at 3 toks/second so it'll take a while. Thank you for explaining how it works - it'll show ETA after finishing the 1st bucket.

@github-actions github-actions bot added the stale label Mar 25, 2024
Copy link
Contributor

github-actions bot commented Apr 9, 2024

This issue was closed because it has been inactive for 14 days since being marked as stale.

@github-actions github-actions bot closed this as completed Apr 9, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request stale
Projects
None yet
Development

No branches or pull requests

2 participants