LayoutReader

Why this repo?

The original LayoutReader is published by Microsoft Research. It is based on LayoutLM, and use a seq2seq architecture to predict the reading order of the words in a document. There are several problems with the original repo:

Because it doesn't use transformers, there are lots of experiments in the code, and the code is not well-organized. It's hard to train and deploy.
seq2seq is too slow in production, I want to get the all predictions in one pass.
The pre-trained model's input is English word-level, but it's not the real case. The real inputs should be the spans extracted by PDF parser or OCR.
I want a multilingual model. I notice only use the bbox is only a little bit worse than bbox+text, so I want to train a model only use bbox, ignore the text.

What I did?

Refactor the codes, use LayoutLMv3ForTokenClassification of transformers to train and eval.
Offer a script turn the original word-level dataset into span-level dataset.
Implement a better post-processor to avoid duplicate predictions.
Release a pre-trained model fine-tuned from layoutlmv3-large

How to use?

from transformers import LayoutLMv3ForTokenClassification
from v3.helpers import prepare_inputs, boxes2inputs, parse_logits

model = LayoutLMv3ForTokenClassification.from_pretrained("hantian/layoutreader")

# list of [left, top, right, bottom], bboxes of spans, should be range from 0 to 1000
boxes = [[...], ...]  
inputs = boxes2inputs(boxes)
inputs = prepare_inputs(inputs, model)
logits = model(**inputs).logits.cpu().squeeze(0)
orders = parse_logits(logits, len(boxes))
print(orders)

# [0, 1, 2, ...]

Or you can python main.py to serve the model.

Dataset

Download Original Dataset

The original dataset can download from ReadingBank. More details can be found in the original repo.

Build Span-Level Dataset

unzip ReadingBank.zip
python tools.py ./train/ train.jsonl.gz
python tools.py ./dev/ dev.jsonl.gz
python tools.py ./test/ test.jsonl.gz --src-shuffle-rate=0
python tools.py ./test/ test_shuf.jsonl.gz --src-shuffle-rate=1

Train & Eval

The core codes are in ./v3 folder. The train.sh and eval.py are the entrypoints.

bash train.sh
python eval.py ../test.jsonl.gz hantian/layoutreader
python eval.py ../test_shuf.jsonl.gz hantian/layoutreader

Span-Level Results

shuf means whether the input order is shuffled.
BlEU Idx is the BLEU score of predicted tokens' orders.
BLEU Text is the BLEU score of final merged text.

I only train the layout only model. And test on the span-level dataset. So the Heuristic Method result is quite different from the original word-level result. I mainly focus on the BLEU Text, it's only a bit lower than the original word-level result. But the speed is much faster.

Method	shuf	BLEU Idx	BLEU Text
Heuristic Method	no	44.4	70.7
LayoutReader (layout only)	no	94.9	97.5
LayoutReader (layout only)	yes	94.8	97.4

Word-Level Results

My eval script

The layout only model is trained by myself using the original codes, and the public model is the pre-trained model. The layout only is nearly as good as the public model, and the shuf only has a little effect on the results.

Only test the first part of test dataset. Because it's too slow...

Method	shuf	BLEU Idx	BLEU Text
Heuristic Method	no	78.3	79.4
LayoutReader (layout only)	no	98.0	98.2
LayoutReader (layout only)	yes	97.8	98.0
LayoutReader (public model)	no	98.0	98.3

Old eval script (copy from original paper)

Evaluation results of the LayoutReader on the reading order detection task, where the source-side of training/testing data is in the left-to-right and top-to-bottom order

Method	Encoder	BLEU	ARD
Heuristic Method	-	0.6972	8.46
LayoutReader (layout only)	LayoutLM (layout only)	0.9732	2.31
LayoutReader	LayoutLM	0.9819	1.75

Input order study with left-to-right and top-to-bottom inputs in evaluation, where r is the proportion of shuffled samples in training.

Method	BLEU	BLEU	BLEU	ARD	ARD	ARD
	r=100%	r=50%	r=0%	r=100%	r=50%	r=0%
LayoutReader (layout only)	0.9701	0.9729	0.9732	2.85	2.61	2.31
LayoutReader	0.9765	0.9788	0.9819	2.50	2.24	1.75

Input order study with token-shuffled inputs in evaluation, where r is the proportion of shuffled samples in training.

Method	BLEU	BLEU	BLEU	ARD	ARD	ARD
	r=100%	r=50%	r=0%	r=100%	r=50%	r=0%
LayoutReader (layout only)	0.9718	0.9714	0.1331	2.72	2.82	105.40
LayoutReader	0.9772	0.9770	0.1783	2.48	2.46	72.94

Citation

If this model helps you, please cite it.

@software{Pang_Faster_LayoutReader_based_2024,
  author = {Pang, Hantian},
  month = feb,
  title = {{Faster LayoutReader based on LayoutLMv3}},
  url = {https://github.com/ppaanngggg/layoutreader},
  version = {1.0.0},
  year = {2024}
}

Name		Name	Last commit message	Last commit date
Latest commit ppaanngggg Create LICENSE May 23, 2024 f0718ca · May 23, 2024 History 15 Commits
.github		.github
example		example
v3		v3
.gitignore		.gitignore
CITATION.cff		CITATION.cff
LICENSE		LICENSE
README.md		README.md
main.py		main.py
requirements.txt		requirements.txt
tools.py		tools.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

LayoutReader

Why this repo?

What I did?

How to use?

Dataset

Download Original Dataset

Build Span-Level Dataset

Train & Eval

Span-Level Results

Word-Level Results

My eval script

Old eval script (copy from original paper)

Citation

About

Releases

Sponsor this project

Packages

Languages

License

ppaanngggg/layoutreader

Folders and files

Latest commit

History

Repository files navigation

LayoutReader

Why this repo?

What I did?

How to use?

Dataset

Download Original Dataset

Build Span-Level Dataset

Train & Eval

Span-Level Results

Word-Level Results

My eval script

Old eval script (copy from original paper)

Citation

About

Topics

Resources

License

Citation

Stars

Watchers

Forks

Releases

Sponsor this project

Packages 0

Languages

Packages