All newly added files have the following prefix at the beginning:
###############################################################################
# CUSTOM MODULE FOR GEC
###############################################################################
Among all changes, the only modification that requires a package rebuild is
fairseq/data/token_labeled_language_pair_dataset.py
and its import in
fairseq/data/__init__.py
, because datasets are not registered separately.
Please run:
pip install --upgrade --editable .
eval_lm_fp16.py
: single-line edit for fp16 lm evaluationfairseq/models/copy_augmented_transformer_el.py
: copy-augmented transformer + edit label prediction model definitionfairseq/data/token_labeled_language_pair_dataset.py
: custom dataset loader for "m3" (i.e. ori-cor sentence pairs along with token-level edit labels)fairseq/data/__init__.py
: includetoken_labeled_language_pair_dataset
in the module definition (somehow there's no registry for datasets)fairseq/criterion/gec_loss.py
: weighted cross-entropy using target-side edit labels, along with an auxiliary source-side edit label prediction loss.fairseq/tasks/gec.py
: define a GEC task using custom models, datasets, and lossesfairseq/sequence_copygenerator.py
: a "fork" offairseq/sequence_generator.py
that also keeps track of & returns copy scores in decodinggenerate_or_copy.py
: generation with<unk>
's replaced based on copy scoresfairseq/scripts/test_gec_modules.py
: unit tests for newly created moduleslm_scorer.py
: scoring using pre-trained neural language models