[BART] issues on BPE preprocess (examples.roberta.multiprocessing_bpe_encoder) #1423

wonjininfo · 2019-11-25T05:45:35Z

Hi,
Congratulations on great work!! I appreciate you all for making resources publicly available.

I was following README on finetuning BART on CNNDM task.

While I was performing 2) BPE preprocess, I faced some problems.

Here are some details of my problem :

I found that the numbers of lines of train.bpe.source and train.bpe.target are not identical.
It should be 287227, but there are additional 250 lines while processing train.source.

ubuntu@server:~/fairseq/cnn_dm$ wc -l *
     11490 test.source
     11490 test.target
    287474 train.bpe.source  <= not matching
    287227 train.bpe.target
    287227 train.source
    287227 train.target
     13368 val.bpe.source
     13368 val.bpe.target
     13368 val.source
     13368 val.target
    200000 vocab
   1425607 total

While I was trying to check problem No. 1, I faced another trouble which seems closely related with problem 1.
When I check val.bpe.target, the first BPE encoded sentence shows up like following :
32 582 287 20154 6182 318 6301 6729 2691 284 4297 287 23254 2585 13 1114 720 4531 11 339 481 4074 718 8059 286 6729 287 281 47869 42378 305 6513 321 3091 13
Using bart.decode(), I can decode it and it shows : are pay As spellszi If km wages Women familybut Asolia Con for idea global85 in win free 51il temporarily For wages AsasAlternativelyStage W Fin 0 sites for.
Which should be A man in suburban Boston is selling snow online to customers in warmer states. For $89, he will ship 6 pounds of snow in an insulated Styrofoam box.
The same problem applies to other bpe processed files.

It appears like there is some point I missed.
I am checking this on

Python 3.6
stanford-corenlp-3.7.0.jar (3.9 also checked)
pytorch 10.0
CUDA 10.0
Ubuntu 16.04

Would you share any thoughts on the matter? It would help me a lot.
Once again, thank you very much!
WonJin

@ngoyal2707 @yinhanliu

The text was updated successfully, but these errors were encountered:

astariul · 2019-11-25T05:58:48Z

After checking, I'm also facing problem 2.

(However I don't meet problem 1)

astariul · 2019-11-25T06:55:56Z

Turn out, problem 2 is not a problem, it's normal.

After reading the source code, this is what I understood :

The Hub interface of BART's method encode() is doing 2 things :

BPE encoding
Binarization (?)

So that's why in the preprocess there is 2 steps : BPE-encoding of the dataset, then binarization of the dataset.

In the encode method of BART, the 2 steps are done by :

Because the bpe files are only the first step, we should compare it to the first step.

And indeed :

bart.bpe.encode("A man in suburban Boston is selling snow online to customers in warmer states. For $89, he will ship 6 pounds of snow in an insulated Styrofoam box.")

gives :

32 582 287 20154 6182 318 6301 6729 2691 284 4297 287 23254 2585 13 1114 720 4531 11 339 481 4074 718 8059 286 6729 287 281 47869 42378 305 6513 321 3091 13

And similarly :

bart.bpe.decode('32 582 287 20154 6182 318 6301 6729 2691 284 4297 287 23254 2585 13 1114 720 4531 11 339 481 4074 718 8059 286 6729 287 281 47869 42378 305 6513 321 3091 13')

gives :

A man in suburban Boston is selling snow online to customers in warmer states. For $89, he will ship 6 pounds of snow in an insulated Styrofoam box.

Confirmation of my understanding by original authors would be appreciated !

wonjininfo · 2019-11-25T07:08:10Z

Thanks, @colanim !
I found the cause of Problem No.1 as well. It was not related to the fairseq code.

The dataset occasionally contains ASCII code 0D which means CR.
It seems like BPE encoder replaces CR with LF(Line Feed), which is normal.

For other researchers: CR should be replaced with normal blank during step 1 (1) Follow instructions here to download and process into data-files with non-tokenized cased samples.).

Check line No 35711 Starting with -- Brian Steel was taught from birth that he was "handicapped."

myleott · 2019-11-25T15:03:38Z

@colanim yes, that's correct. It's a two stage encoding process. First BPE encode followed by encoding with the fairseq Dictionary.

@wonjininfo, glad you got it working :)

zhaoguangxiang · 2019-12-06T20:08:37Z

0D

Hi, I can not fix it, I replace '\r' with ' '
link: #1391 (comment)

Summary: this adds an argument to load_dataset that provides task configuration from the checkpoint. different tasks can decide what to do with it afterwards. Pull Request resolved: fairinternal/fairseq-py#1423 Reviewed By: myleott Differential Revision: D24875706 Pulled By: alexeib fbshipit-source-id: 5bb1e2b7495520c456024dc7b0751b65cb05b473

myleott closed this as completed Nov 25, 2019

gm0616 mentioned this issue Dec 9, 2019

The instance number of train.bpe.source is inconsistent with train.bpe.target #1475

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BART] issues on BPE preprocess (examples.roberta.multiprocessing_bpe_encoder) #1423

[BART] issues on BPE preprocess (examples.roberta.multiprocessing_bpe_encoder) #1423

wonjininfo commented Nov 25, 2019 •

edited

Loading

astariul commented Nov 25, 2019

astariul commented Nov 25, 2019 •

edited

Loading

wonjininfo commented Nov 25, 2019 •

edited

Loading

myleott commented Nov 25, 2019

zhaoguangxiang commented Dec 6, 2019

[BART] issues on BPE preprocess (examples.roberta.multiprocessing_bpe_encoder) #1423

[BART] issues on BPE preprocess (examples.roberta.multiprocessing_bpe_encoder) #1423

Comments

wonjininfo commented Nov 25, 2019 • edited Loading

astariul commented Nov 25, 2019

astariul commented Nov 25, 2019 • edited Loading

wonjininfo commented Nov 25, 2019 • edited Loading

myleott commented Nov 25, 2019

zhaoguangxiang commented Dec 6, 2019

wonjininfo commented Nov 25, 2019 •

edited

Loading

astariul commented Nov 25, 2019 •

edited

Loading

wonjininfo commented Nov 25, 2019 •

edited

Loading