-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
What are the data formats of dataset
and vocab
folder?
#81
Comments
Jieba is used only for constructing whole-word masking, which does not affect the tokenizer of the model. If you want to replace it, you can either change the dictionary of Jieba by following this link, which helps Jieba recognize new words in your training data. Or you can use another tokenizer by changing the dataloader in the pre-training codebase. |
After I prepared the input data, I decided to pre-train with BART format. While running
I have no clue how to solve it. Can you help? |
In the README of pre-training, it mentions that the
dataset
,vocab
androberta_zh
have to be prepared before training.Is there any example of the files in the
dataset
andvocab
folder?Also, what do you mean by "Place the checkpoint of Chinese RoBERTa"? I would like to train Chinese BART.
Last, if I wish to replace
Jieba
tokenizer with my custom tokenizer, how can I do so? Thanks.The text was updated successfully, but these errors were encountered: