-
Notifications
You must be signed in to change notification settings - Fork 10.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reset token budget after every user intervention. #306
Conversation
In interactive mode, every time the model has to respond to user input it has an increasingly reduced token budget, eventually generating only a few words before stopping. The token budget in interactive should apply to every batch of tokens after user intervention, not globally
main.cpp
Outdated
@@ -1054,11 +1054,11 @@ int main(int argc, char ** argv) { | |||
embd_inp.insert(embd_inp.end(), inp_sfx.begin(), inp_sfx.end()); | |||
} | |||
|
|||
remaining_tokens -= line_inp.size(); | |||
remaining_tokens = params.n_predict - line_inp.size(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This can get bigger then remaining space in context.
and after https://github.com/ggerganov/llama.cpp/blob/da5303c1ea68aa19db829c634f1e10d08d409680/main.cpp#L850 remaining_tokens, are actually all the space that is left. no?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
resetting remaining_tokens
to params.n_predict
would only make sense when we reset the memory, which we don't right now. see #71
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see. Yes, going over the context size can be a problem. But remaining_tokens
usually is smaller than the size of the context (because params.n_predict
is), so it should still be reset after every interaction with the user so that the series of tokens can be as long as the first one or as long as the remaining context space allows. It should be clamped to never go over the remaining space in the context.
std::min(params.n_predict, model.hparams.n_ctx - (int) embd_inp.size())
is exactly the formula that should be used instead of the simple assignment I did to reset it so that it doesn't overflow the context.
And in fact, it should be used also when resetting it due to running out of tokens, too. Good catch.
I'm closing this pull request because it doesn't make sense as things are right now with the lack of context shifting and when that's figured out this will have to be implemented differently. |
In interactive mode, every time the model has to respond to user input it has an increasingly reduced token budget, eventually generating only a few words before stopping. The token budget in interactive should apply to every batch of tokens after user intervention, not globally.