-
Notifications
You must be signed in to change notification settings - Fork 325
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Batched & chunked prefill #216
Labels
Comments
@lucasavila00, this looks great. It'll require modifying the attention mask calculation of every model, so it may be helpful to factor those out into a |
Closed
EricLBuehler
added
models
Additions to model or architectures
and removed
urgent
labels
Apr 28, 2024
@lucasavila00, I am actually going to end up adding this in #242. |
@lucasavila00 this has been added already! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Similar to what was described here huggingface/candle#2108
"When prompts get longer than trivial sizes, the memory usage spikes as the prompt is thrown into one Tensor and sent off to a forward pass in the model at whatever length it comes in as. These spikes can be reduced by processing the batch in chunks."
There's a candle implementation here huggingface/candle#2111
Let's say we configure a setting batch_size = 512.
The scheduler would need to be aware of it and only schedule 2 prompts if they're less than 512 tokens combined.
And the engine should be aware of it and if a sequence is larger than 512 tokens, split it.
To reproduce it locally, use the benchmark with a high enough
-p
and you get an OOMBut generating this same amount of tokens work
The text was updated successfully, but these errors were encountered: