-
Notifications
You must be signed in to change notification settings - Fork 6.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BART] issues on BPE preprocess (examples.roberta.multiprocessing_bpe_encoder) #1423
Comments
After checking, I'm also facing problem 2. (However I don't meet problem 1) |
Turn out, problem 2 is not a problem, it's normal. After reading the source code, this is what I understood : The Hub interface of BART's method
So that's why in the preprocess there is 2 steps : BPE-encoding of the dataset, then binarization of the dataset. In the
Because the And indeed :
gives :
And similarly :
gives :
Confirmation of my understanding by original authors would be appreciated ! |
Thanks, @colanim ! The dataset occasionally contains ASCII code For other researchers: CR should be replaced with normal blank during step 1 ( Check line No 35711 Starting with |
@colanim yes, that's correct. It's a two stage encoding process. First BPE encode followed by encoding with the fairseq Dictionary. @wonjininfo, glad you got it working :) |
Hi, I can not fix it, I replace '\r' with ' ' |
Summary: this adds an argument to load_dataset that provides task configuration from the checkpoint. different tasks can decide what to do with it afterwards. Pull Request resolved: fairinternal/fairseq-py#1423 Reviewed By: myleott Differential Revision: D24875706 Pulled By: alexeib fbshipit-source-id: 5bb1e2b7495520c456024dc7b0751b65cb05b473
Summary: this adds an argument to load_dataset that provides task configuration from the checkpoint. different tasks can decide what to do with it afterwards. Pull Request resolved: fairinternal/fairseq-py#1423 Reviewed By: myleott Differential Revision: D24875706 Pulled By: alexeib fbshipit-source-id: 5bb1e2b7495520c456024dc7b0751b65cb05b473
Hi,
Congratulations on great work!! I appreciate you all for making resources publicly available.
I was following README on finetuning BART on CNNDM task.
While I was performing
2) BPE preprocess
, I faced some problems.Here are some details of my problem :
train.bpe.source
andtrain.bpe.target
are not identical.It should be 287227, but there are additional 250 lines while processing
train.source
.When I check
val.bpe.target
, the first BPE encoded sentence shows up like following :32 582 287 20154 6182 318 6301 6729 2691 284 4297 287 23254 2585 13 1114 720 4531 11 339 481 4074 718 8059 286 6729 287 281 47869 42378 305 6513 321 3091 13
Using
bart.decode()
, I can decode it and it shows :are pay As spellszi If km wages Women familybut Asolia Con for idea global85 in win free 51il temporarily For wages AsasAlternativelyStage W Fin 0 sites for
.Which should be
A man in suburban Boston is selling snow online to customers in warmer states. For $89, he will ship 6 pounds of snow in an insulated Styrofoam box.
The same problem applies to other bpe processed files.
It appears like there is some point I missed.
I am checking this on
Would you share any thoughts on the matter? It would help me a lot.
Once again, thank you very much!
WonJin
@ngoyal2707 @yinhanliu
The text was updated successfully, but these errors were encountered: