PolEval 2021 Task 3: Post-correction of OCR results

Results

Submission	test-A	test-B
ED 3 (mt5-base)	4.986	5.001
ED 3 pl (plt5-large)	4.281	4.302
plt5-base	5.355	5.327

Mateusz Piotrowski (mt5-xxl)	3.725	3.744

Model plT5-large achieves 4.302 WER score. The best solution is 3.744 WER score using xxl model.

Instructions

git clone -b secret https://github.com/poleval/2021-ocr-correction.git
mkdir data
virtualenv -p python3 venv
source venv/bin/activate
pip3 install -r requirements.txt

You can use model enelpol/poleval2021-task3 or train yourself.

Training

Prepare aligned texts using space-tokenizer.

python3 align.py 2021-ocr-correction/dev-0/in.tsv 2021-ocr-correction/dev-0/expected.tsv data/dev.jsonl data/dev-baseline.tsv
python3 align.py 2021-ocr-correction/train/in.tsv 2021-ocr-correction/train/expected.tsv data/train.jsonl data/train-baseline.tsv
python3 align.py 2021-ocr-correction/test-A/in.tsv 2021-ocr-correction/test-A/expected.tsv data/test-A.jsonl data/test-A-baseline.tsv

Prepare data for training.

python3 chunk_generator_context.py data/dev.jsonl data/dev.train.tsv --chunk 50 --stride 0
Input and output size in subtokens:
Input mean: 100.20682910235149, max: 157, min: 1
Output mean: 99.27256878003146, max: 165, min: 1

python3 chunk_generator_context.py data/train.jsonl data/train.train.tsv --chunk 50 --stride 0
Input and output size in subtokens:
Input mean: 100.26346787291853, max: 197, min: 1
Output mean: 99.32334532501505, max: 241, min: 1

python3 chunk_generator_context.py data/test-A.jsonl data/test-A.train.tsv --chunk 50 --stride 0
Input and output size in subtokens:
Input mean: 100.11499283996649, max: 179, min: 1
Output mean: 99.20710059171597, max: 179, min: 1

shuf data/train.train.tsv > data/train.train.tsv.shuf

Train mt5.

python3 train_mt5.py data/train.train.tsv.shuf data/dev.train.tsv --batch_size 12 --num_beams 2 --warmup_steps 1 --epochs 1 --evaluate_during_training_steps 500 --scheduler linear_schedule_with_warmup
python3 train_mt5.py data/train.train.tsv.shuf data/dev.train.tsv --batch_size 12 --num_beams 2 --warmup_steps 1 --epochs 1 --evaluate_during_training_steps 500 --scheduler linear_schedule_with_warmup --model_name allegro/plt5-base --model_type t5
python3 train_mt5.py data/train.train.tsv.shuf data/dev.train.tsv --batch_size 12 --num_beams 2 --warmup_steps 1 --epochs 1 --evaluate_during_training_steps 500 --scheduler linear_schedule_with_warmup --model_name allegro/plt5-large --model_type t5

Prediciton

Predict.

MODEL="enelpol/poleval2021-task3" # "model_bin_*/best_model/" #edit
MODEL_TYPE=t5
python3 predict_mt5.py 2021-ocr-correction/test-A/in.tsv 2021-ocr-correction/test-A/output.tsv --model_name $MODEL --model_type $MODEL_TYPE --chunk 50 --stride 0 --num_beams 8 --batch_size 8
python3 predict_mt5.py 2021-ocr-correction/test-B/in.tsv 2021-ocr-correction/test-B/output.tsv --model_name $MODEL --model_type $MODEL_TYPE --chunk 50 --stride 0 --num_beams 8 --batch_size 8

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

PolEval 2021 Task 3: Post-correction of OCR results

Results

Instructions

Training

Prediciton

About

Releases

Packages

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 3 Commits
Readme.md		Readme.md
align.py		align.py
chunk_generator_context.py		chunk_generator_context.py
predict_mt5.py		predict_mt5.py
requirements.txt		requirements.txt
train_mt5.py		train_mt5.py

enelpol/poleval2021-task3

Folders and files

Latest commit

History

Repository files navigation

PolEval 2021 Task 3: Post-correction of OCR results

Results

Instructions

Training

Prediciton

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages