MarkBERT: Marking Word Boundaries Improves Chinese BERT
bash fastnlp-ner/run_ner.sh
This training example is for msra-ner. The dataset can be found in the folder msra-mark. The checkpoints can be downloaded from markbert.
You should see results like below:
ontonotes-results:
run-1
FitlogCallback evaluation on data-test:
span: f=0.827065, pre=0.816909, rec=0.837478
acc_span: acc=0.978047
Evaluate data in 5.5 seconds!
FitlogCallback evaluation on data-train:
span: f=0.931626, pre=0.926514, rec=0.936794
acc_span: acc=0.990249
Evaluation on dev at Epoch 3/10. Step:3762/15850:
span: f=0.813191, pre=0.812256, rec=0.814128
acc_span: acc=0.977757
run-2
FitlogCallback evaluation on data-test: span: f=0.824422, pre=0.80732, rec=0.842264 acc_span: acc=0.978673 Evaluate data in 5.48 seconds! FitlogCallback evaluation on data-train: span: f=0.911083, pre=0.89973, rec=0.922726 acc_span: acc=0.987973 Evaluation on dev at Epoch 2/3. Step:2970/4755: span: f=0.806176, pre=0.797273, rec=0.81528 acc_span: acc=0.977811
run-3
FitlogCallback evaluation on data-test:
span: f=0.824504, pre=0.810659, rec=0.838831
acc_span: acc=0.978354
Evaluate data in 5.58 seconds!
FitlogCallback evaluation on data-train:
span: f=0.938988, pre=0.932747, rec=0.945314
acc_span: acc=0.991173
Evaluation on dev at Epoch 3/5. Step:3762/7925:
span: f=0.806487, pre=0.804959, rec=0.80802
acc_span: acc=0.978021
msra-results
In Epoch:5/Step:21793, got best dev performance: span: f=0.96069, pre=0.961054, rec=0.960327 acc_span: acc=0.994596
We add markers in the data preprocess phase during fine-tuning therefore the usage of MarkBERT is simple. We use TexSmart tookit to do segmentation and pos-tagging in preprocessing the data.
In the CLUE experiments: You can simply use the tokenizer in run_glue.py to replace BERT tokenizers and run fine-tuning experiments in any huggingface transformers versions. You MUST follow the dataset sample (as seen in the data_sample.txt file) to preprocess the corresponding fine-tuning dataset sothat the MarkBertTokenizer can correctly tokenize the input texts for MarkBERT.
In the NER experiments: You also need to insert markers manually since the dataset is char-level (as seen in the data_sample.txt file), then you can use MarkBERT just like normal BERT-models. You can use the cutoff function provided to avoid sentences over 512 tokens.
The special tokens for the markers is:
in MarkBERT, the special token is '[unused1]'.
Without using the MarkBERT tokenizer, you can also use MarkBERT checkpoints as an improved version of BERT-BASE.
We provide a FastNLP version to quickly test the effectiveness of MarkBERT.
You can install the fastnlp and fitlog packages and enter the fastnlp folder to run the bash.
You need to prepare your train and dev file and assign the path in fastnlp-ner/run_ner.py line21-22 and assign the model checkpoint path in the fastnlp-ner/run_ner.sh
Also, you can use MarkBERT as following the pre-process steps and then use it in huggingface Transformers or any other toolkit that operate pre-trained models.
If you encounter any errors, you may find help in https://github.com/LeeSureman/Flat-Lattice-Transformer .
We thank Hao Jiang for the help in locating an evaluation bug in the NER task in the previous MarkBERT implementation.