-
Notifications
You must be signed in to change notification settings - Fork 301
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
decoding LM vocab #19
Comments
The mapping between token IDs and words are stored in the tokenizer, which is this line: https://github.com/HazyResearch/state-spaces/blob/83a9f136a6353648681cdd5dcc2a0eac48a69340/src/dataloaders/lm.py#L431 If you've already trained your model, the tokenizer should be cached: https://github.com/HazyResearch/state-spaces/blob/83a9f136a6353648681cdd5dcc2a0eac48a69340/src/dataloaders/lm.py#L469 We have not actually tried to generate from the trained LM, so unfortunately can't help you with this. Let us know if you get it working and maybe we can get incorporate it into a PR. |
Great thanks. I'm still working on getting it working. If I do I'll let you know. |
The generation script has been improved, and we now have a trained WikiText-103 checkpoint that generates text. Instructions can be found here |
hello, I trained a model using something like the wt103 task and modified the sashimi generate script to generate text like a CLM. So basically conditioning on a text string, generate the next N words sequentially in the same loop like the Sashimi generation script. I believe that I have it working however I don't know what integer output corresponds to what word in the vocab. Is there a hash table or something that stores the vocab somewhere that's easily accessible? Sorry I can't seem to find any obvious place that it would reside. Thank you for your help.
The text was updated successfully, but these errors were encountered: