Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Llama2_7b Example Will Crash When the Model Outputs Too Many Words #378

Open
shenzhiy21 opened this issue Jan 13, 2025 · 1 comment
Open

Comments

@shenzhiy21
Copy link
Contributor

How to Reproduce

Just make the model keep generating new words and non-stop, until the generated sequence length exceeds the default seq_len.

For example, change the prompt into

prompt = 'a a a a a a a a a a a a a a a a a a a a a a a a a a a a a a'

and it will crash after generating 1022 tokens:

    local_cache = val_cache.select(0, l).narrow(0, pos, 3)
                  ^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
RuntimeError: start (1022) + length (3) exceeds dimension size (1024).

How to Fix

The bug is due to the construction of local_cache:

local_cache = val_cache.select(0, l).narrow(0, pos, 3)

when pos = seq_len - 2, using val_cache for this in-place construction for local_cache will cause an error.

For a quick (but perhaps not "beautiful") fix, just change line 74 into

val_cache = torch.zeros([n_layers, seq_len + 3, dim], dtype=data_type, device=device).clone()

to reserve more place for local_cache.

@ghostplant
Copy link
Contributor

ghostplant commented Jan 13, 2025

Thank you, the kv cache is by default set up to 1024 - 2 words. To support longer context, you can also change seq_len from 1024 to 4096 here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants