Grammatical error correction utilities for data preparation and results evaluation.
This section introduces the steps to prepare for GEC parallel data. The datasets considered in this repo is: FCE, NUCLE, Lang-8, Write&Improve+LOCNESS, CoNLL-2014 and JFLEG.
- First, download the GEC datasets from their websites.
- FCE, NUCLE, Lang-8 and Write&Improve+LOCNESS datasets can be downloaded from the BEA-2019 website.
- CoNLL-2014 can be downloaded from the CoNLL-2014 website.
- JFLEG can be downloaded from the JFLEG repo.
- Compress the datasets into
datasets.zip
and placedatasets.zip
in the root directory so that the data preparation script can find it.datasets.zip
contains:- fce_v2.1.bea19.tar.gz // FCE dataset tar ball
- release3.3.tar.bz2 // NUCLE dataset tar ball
- lang8.bea19.tar.gz // Lang-8 dataset tar ball
- wi+locness_v2.1.bea19.tar.gz // Write&Improve+LOCNESS dataset tar ball
- conll14st-test-data.tar.gz // CoNLL-2014 dataset tar ball
- jfleg.tar.gz // JFLEG dataset tar ball made by compressing the cloned repo
Run the command below to prepare for the GEC data.
sh prepare_data.sh
A data
directory will be generated with three subdirectories: tar
, ori
and raw
.
tar
: contains GEC dataset tar balls unzipped fromdatasets.zip
.ori
: contains what are unzipped from each tar ball intar
.raw
: contains parallel GEC data files which can be tokenized later with specific needs such as model training and testing.
Statistics:
28350 data/raw/fce-train.src
28350 data/raw/fce-train.trg
57151 data/raw/nucle.src
57151 data/raw/nucle.trg
1102354 data/raw/lang8.src
1102354 data/raw/lang8.trg
34308 data/raw/wi_locness-train.src
34308 data/raw/wi_locness-train.trg
1222163 data/raw/train.w.err-free.src
1222163 data/raw/train.w.err-free.trg
624958 data/raw/train.wo.err-free.src
624958 data/raw/train.wo.err-free.trg
4384 data/raw/wi_locness-valid.src
4384 data/raw/wi_locness-valid.trg
4384 data/raw/valid.src
4384 data/raw/valid.trg
2695 data/raw/fce-test.src
2695 data/raw/fce-test.trg
1312 data/raw/conll14.src
1312 data/raw/conll14.trg
4477 data/raw/wi_locness-test.src
747 data/raw/jfleg.src
5783304 total
This section introduces the steps to evaluate GEC outputs. Note that m2scorer
is needed for evaluating the FCE-test and CoNLL-2014 outputs. Download m2scorer and place it in the scorers
directory.
sh test_fce.sh output_samples/fce.out
- Precision : 0.6129
- Recall : 0.4181
- F_0.5 : 0.5606
sh test_conll14.sh output_samples/conll14.out
- Precision : 0.6835
- Recall : 0.4636
- F_0.5 : 0.6243
The Write&Improve+LOCNESS dataset should be evaluated on Codalab by compressing the output into a .zip file and submitting it in "Participate".
- p_cs:66.40
- r_cs:61.21
- f0.5_cs:65.29
sh test_jfleg.sh output_samples/jfleg.out
- GLEU: 0.614138
Additionally, data files in data/raw
can be converted into m2 format with the following command.
sh prepare_m2.sh