tags | ||
---|---|---|
|
LayoutLM is a Transformer-based model combining image and text information to understand structured documents such as scanned receipts or forms. There are three versions of the model:
- v1 -- combines text and layout information
- v2 -- combines text, layout and image information
- v3 -- simplifies processing of v2 into a single transformer
The first version of the model, described by while the second version, introduced by , also uses image features.
The v1 and v2 differ quite dramatically so this note describes v1 only briefly as an introduction to processing text and image. Rest of the note is dedicated to v2 only.
TODO: there is also a v3 ...