-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add interactive mode #61
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🦙
I know we can't expect much without instruction tuning, but this is hilariously bad.
|
It always ends in an assertion error for me, before this pull request got merged I had messed with the code increasing the max tokens from 512 to 2048 to get longer outputs. Maybe there is some memory limit that needs to be increased to enable it to keep going for longer? Thank your for the chat prompt example, I didn't really realize how good LLaMA could be until now...
|
@ssvenn I suspect this is because I'm currently not accounting for the tokens that get fed in by subsequent user interactions: the loop ensures that prompt + generated tokens < max tokens, but prompt + generated tokens + subsequent inputs can exceed it, presumably resulting in the crash you see. Shouldn't be too hard to fix... (edit: should hopefully be fixed by 460c482, let me know) |
@blackhole89 Alas, this does not fix the problem. I fear the challenge is buried deeper in the key/value caching mechanism. |
Related issue: #71 |
@semiring Ah, I see. Now that I check it again, the text fragment that you posted only comes out to 628 tokens on my end, so maybe something about the way you extended the max. number of tokens to 2048 did not quite work out. (When I ran out of tokens before the patch earlier, I would simply get a segfault.) Do you have a diff for what you did to the source to increase the max. tokens? |
@blackhole89 It was @ssvenn who originally posted this concern, but I've run into the same problem. Let a dialogue run for a number of turns and it will eventually happen every time. |
Is there a way to skip the model computing the output that replicates the original prompt (via caching or similar), before generating the new text? For a large prompt it will take some time to "reach" the end of the prompt. |
@blackhole89 I think this is the core challenge: #71 (comment) |
Introduction `-sysf FNAME` / `--system-file FNAME` -e` escapes both prompt and the system
Add UTF-8 Encoding in read_text.
Add support for an interactive mode, where the user can interject to add more tokens to the context after generation started. (#23)
Features: