Llama2_7b Example Will Crash When the Model Outputs Too Many Words #378

shenzhiy21 · 2025-01-13T04:17:55Z

How to Reproduce

Just make the model keep generating new words and non-stop, until the generated sequence length exceeds the default seq_len.

For example, change the prompt into

prompt = 'a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a'

and it will crash after generating 1022 tokens:

    local_cache = val_cache.select(0, l).narrow(0, pos, 3)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: start (1022) + length (3) exceeds dimension size (1024).

How to Fix

The bug is due to the construction of local_cache:

local_cache = val_cache.select(0, l).narrow(0, pos, 3)

when pos = seq_len - 2, using val_cache for this in-place construction for local_cache will cause an error.

For a quick (but perhaps not "beautiful") fix, just change line 74 into

val_cache = torch.zeros([n_layers, seq_len + 3, dim], dtype=data_type, device=device).clone()

to reserve more place for local_cache.

The text was updated successfully, but these errors were encountered:

ghostplant · 2025-01-13T06:09:24Z

Thank you, the kv cache is by default set up to 1024 - 2 words. To support longer context, you can also change seq_len from 1024 to 4096 here

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Llama2_7b Example Will Crash When the Model Outputs Too Many Words #378

Llama2_7b Example Will Crash When the Model Outputs Too Many Words #378

shenzhiy21 commented Jan 13, 2025

ghostplant commented Jan 13, 2025 •

edited

Loading

Llama2_7b Example Will Crash When the Model Outputs Too Many Words #378

Llama2_7b Example Will Crash When the Model Outputs Too Many Words #378

Comments

shenzhiy21 commented Jan 13, 2025

How to Reproduce

How to Fix

ghostplant commented Jan 13, 2025 • edited Loading

ghostplant commented Jan 13, 2025 •

edited

Loading