You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
When I download the BARTModel that is being stored in 'chemical/checkpoints/bart.large', it comes with a dict.txt that is ~52k lines. To my understanding this is BPE for the all of the words that BART was trained on.
Now, when I try to run my code to load the model while keeping the default dict.txt, I get an error that basically is saying that the number of tokens in the model do not match the checkpoint I am trying to load.
But if I place the dict.txt that gets generated from the preprocessing.py file (same as the chem.vocab.fs), it loads the model fine.
My question is, is this valid? Do I need a separate dict.txt? I'm concerned because the original dict.txt from BART-large is pairs of {token, count}, whereas the other dict.txt we are using is {str, token}.
The text was updated successfully, but these errors were encountered:
Hi,
I am loading this model via fairseq in python, and am more or less copying the code in compute_score.py, line 79-81:
bart = BARTModel.from_pretrained(model, checkpoint_file = chkpt_path, bpe="sentencepiece", sentencepiece_model=f"{root}/BARTSmiles/chemical/tokenizer/chem.model")
And here is what I am calling:
When I download the BARTModel that is being stored in 'chemical/checkpoints/bart.large', it comes with a dict.txt that is ~52k lines. To my understanding this is BPE for the all of the words that BART was trained on.
Now, when I try to run my code to load the model while keeping the default dict.txt, I get an error that basically is saying that the number of tokens in the model do not match the checkpoint I am trying to load.
But if I place the dict.txt that gets generated from the preprocessing.py file (same as the chem.vocab.fs), it loads the model fine.
My question is, is this valid? Do I need a separate dict.txt? I'm concerned because the original dict.txt from BART-large is pairs of {token, count}, whereas the other dict.txt we are using is {str, token}.
The text was updated successfully, but these errors were encountered: