Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When loading BART via fairseq, which dict.txt should be in the model directory? #8

Open
kosonocky opened this issue Dec 9, 2022 · 0 comments

Comments

@kosonocky
Copy link

Hi,

I am loading this model via fairseq in python, and am more or less copying the code in compute_score.py, line 79-81:

bart = BARTModel.from_pretrained(model, checkpoint_file = chkpt_path, bpe="sentencepiece", sentencepiece_model=f"{root}/BARTSmiles/chemical/tokenizer/chem.model")

And here is what I am calling:

bart = BARTModel.from_pretrained('chemical/checkpoints/bart.large', 
                                 checkpoint_file='pretrained.pt', 
                                 bpe = 'sentencepiece',
                                 sentencepiece_model=f"chemical/tokenizer/chem.model")

When I download the BARTModel that is being stored in 'chemical/checkpoints/bart.large', it comes with a dict.txt that is ~52k lines. To my understanding this is BPE for the all of the words that BART was trained on.

Now, when I try to run my code to load the model while keeping the default dict.txt, I get an error that basically is saying that the number of tokens in the model do not match the checkpoint I am trying to load.

But if I place the dict.txt that gets generated from the preprocessing.py file (same as the chem.vocab.fs), it loads the model fine.

My question is, is this valid? Do I need a separate dict.txt? I'm concerned because the original dict.txt from BART-large is pairs of {token, count}, whereas the other dict.txt we are using is {str, token}.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant