Support for large context sizes #2021

ikawrakow · 2023-06-27T14:11:25Z

Currently of one attempts to use a context size larger than some threshold, llama.cpp fails.

On the CPU, it fails with an assert such as

./bin/perplexity -m q6k.bin -f ../tests/wikitext-2-raw/wiki.test.raw -s 1234 -t 16 -c 8192
...
system_info: n_threads = 16 / 32 | AVX = 1 | AVX2 = 1 | AVX512 = 0 | AVX512_VBMI = 0 | AVX512_VNNI = 0 | FMA = 1 | NEON = 0 | ARM_FMA = 0 | F16C = 1 | FP16_VA = 0 | WASM_SIMD = 0 | BLAS = 1 | SSE3 = 1 | VSX = 0 | 
perplexity: calculating perplexity over 40 chunks, batch_size=512
ggml_new_tensor_impl: not enough space in the scratch memory pool (needed 545259520, available 536870912)
Segmentation fault (core dumped)

On the GPU using CUDA the buffer overrun is not detected and we get NaNs (for context sizes > ~5120 at 7B, and >~3600 at 13B).

The context size at which this occurs is dependent on the model size.

The text was updated successfully, but these errors were encountered:

ggerganov · 2023-06-28T17:04:24Z

This will be resolved with ggml-org/ggml#288

ghost · 2023-06-29T18:35:38Z

ggml_new_tensor_impl: not enough space in the scratch memory pool (needed 545259520, available 536870912)
Segmentation fault (core dumped)

Hi, did you get that error immediately, or does it run for a while before faulting?

I use wiki.test.raw, and run perplexity but I'm afraid how long it'll take on my device. There's no indication that it's operating other than
perplexity: calculating perplexity over 109 chunks, batch_size=10

It would be neat to see an ETA, but maybe I'm just doing it wrong.

ikawrakow · 2023-06-30T07:08:48Z

@JackJollimore It gives a time estimate after it finishes the first bucket. Wirh a context length of 8k there are only 40 buckets or so. Depending on the speed of your computer, a bucket can take many minutes. It only runs out of memory towards the end of the bucket when the used context length approaches the max. content length. So, you need to be patient to get to the assert.

ghost · 2023-06-30T12:33:40Z

@JackJollimore It gives a time estimate after it finishes the first bucket. Wirh a context length of 8k there are only 40 buckets or so. Depending on the speed of your computer, a bucket can take many minutes. It only runs out of memory towards the end of the bucket when the used context length approaches the max. content length. So, you need to be patient to get to the assert.

I understand. My android device is maxed at 3 toks/second so it'll take a while. Thank you for explaining how it works - it'll show ETA after finishing the 1st bucket.

github-actions · 2024-04-09T01:08:39Z

This issue was closed because it has been inactive for 14 days since being marked as stale.

ggerganov added the enhancement New feature or request label Jun 28, 2023

github-actions bot added the stale label Mar 25, 2024

github-actions bot closed this as completed Apr 9, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Support for large context sizes #2021

Support for large context sizes #2021

ikawrakow commented Jun 27, 2023

ggerganov commented Jun 28, 2023

ghost commented Jun 29, 2023

ikawrakow commented Jun 30, 2023

ghost commented Jun 30, 2023

github-actions bot commented Apr 9, 2024

Support for large context sizes #2021

Support for large context sizes #2021

Comments

ikawrakow commented Jun 27, 2023

ggerganov commented Jun 28, 2023

ghost commented Jun 29, 2023

ikawrakow commented Jun 30, 2023

ghost commented Jun 30, 2023

github-actions bot commented Apr 9, 2024