Implementation of COLING 2024 paper "LM-Combiner: A Contextual Rewriting Model for Chinese Grammatical Error Correction".
All the code and model are released. Thank you for your patience!
The part of the model is implemented using the huggingface framework and the required environment is as follows:
- Python
- torch
- transformers
- datasets
- tqdm
For the evaluation, we refer to the relevant environment configurations of ChERRANT.
- Firstly, we train a baseline model (Chinese-Bart-large) for LM-Combiner on the FCGEC dataset using the Seq2Seq format.
sh ./script/run_bart_baseline.sh
- Candidate Sentence Generation
- We use the baseline model to generate candidate sentences for the training and test sets
- On tasks where the model fits better (spelling correction, etc.), we recommend using the K-fold cross-inference from the paper to generate candidate sentences separately.
python ./src/predict_bl_tsv.py
- Golden Labels Merging
- We use the ChERRANT tool to fully decouple the error correction task and the rewriting task by merging the correct labels.
python ./scorer_wapper/golden_label_merging.py
- Subsequently, we train LM-Combiner on the constructed candidate dataset
- In particular, we supplement the gpt2 vocab (mainly double quotes) to better fit the FCGEC dataset, see
./pt_model/gpt2-base/vocab.txt
for details.
sh ./script/run_lm_combiner.py
- We use the official ChERRANT script to evaluate the model on the FCGEC-dev.
sh ./script/compute_score.sh
method | Prec | Rec | F0.5 |
---|---|---|---|
bart_baseline | 28.88 | 38.95 | 40.46 |
+lm_combiner | 52.15 | 37.41 | 48.34 |
If you find this work is useful for your research, please cite our paper:
@inproceedings{wang-etal-2024-lm-combiner,
title = "{LM}-Combiner: A Contextual Rewriting Model for {C}hinese Grammatical Error Correction",
author = "Wang, Yixuan and
Wang, Baoxin and
Liu, Yijun and
Wu, Dayong and
Che, Wanxiang",
editor = "Calzolari, Nicoletta and
Kan, Min-Yen and
Hoste, Veronique and
Lenci, Alessandro and
Sakti, Sakriani and
Xue, Nianwen",
booktitle = "Proceedings of the 2024 Joint International Conference on Computational Linguistics, Language Resources and Evaluation (LREC-COLING 2024)",
month = may,
year = "2024",
address = "Torino, Italia",
publisher = "ELRA and ICCL",
url = "https://aclanthology.org/2024.lrec-main.934",
pages = "10675--10685",
}