This code is based on a PyTorch realization of the code from the original repository. Models code is designed to enable ONNX* export and inference on CPU\GPU via OpenVINO™.
- Ubuntu* 18.04
- Python* 3.7 or newer
- PyTorch* (1.7.0)
- OpenVINO™ 2021.1 with Python API
These packages are used for rendering images while evaluation and demo.
sudo apt-get update &&
sudo apt-get install -y --no-install-recommends \
texlive \
imagemagick \
ghostscript
Evaluation process uses imagemagick to convert PDF-rendered formulas into PNG images. Sometimes there could be errors:
convert-im6.q16: not authorized `/tmp/tmpgr1m4d4_.pdf' @ error/constitute.c/ReadImage/412.
convert-im6.q16: no images defined `/tmp/tmpgr1m4d4_.png' @ error/convert.c/ConvertImageCommand/3258.
The problem is missing required permissions.
To fix this open file /etc/ImageMagick-6/policy.xml
:
sudo nano /etc/ImageMagick-6/policy.xml
Find <policy domain="coder" rights="none" pattern="PDF" />
and replace with:
<policy domain="coder" rights="read|write" pattern="PDF" />
Create and activate virtual environment:
bash init_venv.sh
Dataset format is similar to im2latex-100k. Main structure of the dataset is following:
formulas.norm.lst
- file with one formula per line.imaged_processed
- folder containing input images.split_file
- this file containsimage_name
(tab symbol)formula_idx
per line connecting corresponding index of the formula in the file with formulas and particular image withimage_name
. Example:There should be at least two such files:11.png 11 34.png 34 ...
train_filter.lst
andvalidate_filter.lst
You can prepare your own dataset in the same format as above. Samples of the dataset can be found here.
NOTE: By default the following structure of the dataset is assumed:
images_processed
- folder with imagesformulas.norm.lst
- file with preprocessed formulas. If you want to use your own dataset, formulas should be preprocessed. For details, refer to this script.validate_filter.lst
andtrain_filter.lst
- corresponding splits of the data.
When you prepare your own dataset with formulas.norm.lst
file, you will have to create a vocabulary file for this dataset.
Vocabulary file is a special file which is used to cast token ids to human readable tokens and vice versa.
Like letters and digits in the natural language, tokens here are atomic units of the latex language (e.g. \\sin
, 1
, \\sqrt
, etc).
You can find an example in the vocabs folder of this project.
Use this script to create vocab file from your own formulas file.
The script will read the formulas and create the vocabulary from the formulas used in train split of the dataset.
To train formula recognition model run:
python tools/train.py --config configs/medium_config.yml --work_dir <path to work dir>
Work dir is used to store information about learning: saved model checkpoints, logs.
The config file is divided into 4 sections: train, eval, export, demo. Common parameters (like path to the model) are stored on the same level as train and other sections. Unique parameters (like learning rate) are stored in specific sections. Unique parameters and common parameters are mutually exclusive.
Note: All values in the config file which have 'path' in their name will be treated as paths and the script which reads configuration will try to resolve all relative paths. By default all relative paths are resolved relatively to the folder where this README.md file is placed. Keep this in mind or use full paths.
backbone_config
:arch
: type of the architecture (if backbone_type is resnet). For more details, please, refer to ResnetLikeBackBonedisable_layer_3
anddisable_layer_4
- disables layer 3 and 4 in resnet-like backbone. ResNet backbone from the torchvision module consists of 4 block of layers, each of them increase the number of channels and decrease the spatial dimensionality. These parameters allow to switch off the 3rd and the 4th of such layers, respectively.enable_last_conv
- enables additional convolution layer to adjust number of output channels to the number of input channels in the LSTM. Optional. Default: false.output_channels
- number of output channels channels. Ifenable_last_conv
istrue
, this parameter should be equal tohead.encoder_input_size
, otherwise it should be equal to actual number of output channels of the backbone.
backbone_type
:resnet
for resnet-like backbone or anything else for original backbone from im2markup paper. Optional. Default isresnet
head
- configuration of the text recognition head. All of the following parameters have default values, you can check them in text reconition headbeam_width
- width used in beam search. 0 - do not use beam search, 1 and more - use beam search with corresponding number of possible tracks.dec_rnn_h
- number of channels in decodingemb_size
- dimension of the embeddingencoder_hidden_size
- number of channels in encodingencoder_input_size
- number of channels in the lstm input, should be equal tobackbone_config.output_channels
max_len
- maximum possible length of the predicted formulan_layer
- number of layers in the trainable initial hidden state for each row
model_path
- path to the pretrained model checkpoint (you can find the links to the checkpoints below in this document).vocab_path
- path where vocabulary file is stored.val_transforms_list
- here you can describe set of desirable transformations for validation datasets respectively. An example is given in the config file, for other options, please, refer to constructor of transforms (sectioncreate_list_of_transforms
)device
- device for training, used in PyTorch .to() method. Possible options: 'cuda', 'cpu'.cpu
is used by default.
In addition to common parameters you can specify the following arguments:
batch_size
- batch size used for traininglearning_rate
- learining ratelog_path
- path to store training logsoptimizer
- Adam or SGDsave_dir
- dir to save checkpointstrain_paths
- list of paths from where to get training data (if more than one path is specified, datasets are concatenated). If one wants to concatenate more than one instance of the desirable dataset, this dataset should be specified several times.train_ann_file
- path to train annotation file withfile_name <\t> formula_number
val_path
- path to the validation dataval_ann_file
- the same astrain_ann_file
train_transforms_list
- similar toval_transforms_list
epochs
- number of epochs to train
One can use some pretrained models. Right now two models are available:
- medium model:
- checkpoint link
- digits, letters, some greek letters, fractions, trigonometric operations are supported; for more details, please, look at corresponding vocab file.
- to use this model, just set the correct value to the
model_path
field in the bcorresponding config file:
model_path: <path to the model>
The model can be used for recognizing both rendered and scanned formulas (e.g. from a scanner or from a phone camera)
- handwritten polynomials model:
- checkpoint
- digits, letters, upper indices are supported
- to use this model, please, change model path in the corresponding config file:
model_path: <path to the model>
The model can be used for recognizing handwritten polynomial equations. All the above models can be used for aftertuning or as ready for inference models. To provide maximum quality at recognizing formulas, it is highly recommended to preprocess image - simply binarize it:
val_transform_list:
- name: TransformBin
threshold: 100
You can find other prepocessing at this file. Sample images in the data section of this repo are already preprocessed, you can look at the examples.
test_path
- path to the test datatest_ann_file
- path to the test annotation file. The same astrain_ann_file
transforms_list
- list of image transformations (optional)
These parameters are used for model export to ONNX & OpenVINO™ IR:
res_encoder_name
- filename to save the converted encoder model (with.onnx
postfix)res_decoder_name
- filename to save the converted decoder model (with.onnx
postfix)export_ir
- Set this flag totrue
to export model to the OpenVINO IR. For details refer to convert to IR sectionverbose_export
- Set this flag totrue
to perform verbose export (i.e. print model optimizer commands to terminal)input_shape_decoder
- list of dimensions describing input shape for encoder for OpenVINO IR conversion.
tools/test.py
script is designed for quality evaluation of formula-recognition models.
For example, one can run evaluation process using config for medium
model.
python tools/test.py --config configs/medium_config.yml
Evaluation process is the following:
- Run the model and get predictions
- Render predictions from the first step into images of the formulas
- Compare images.
The third step is very important because in LaTeX language one can write different formulas that are looking the same. Example:
s^{12}_{i}
ands_{i}^{12}
looking the same: both of them are rendered as That is why we cannot just compare text predictions one-by-one, we have to render images and compare them.
In order to see how trained model works using OpenVINO™ please refer to Formula recognition Python* Demo. Before running the demo you have to export trained model to IR. Please, see below how to do that.
If you want to see how trained PyTorch model is working, you can run tools/demo.py
script with correct config
file. Fill in the input_images
variable with the paths to desired images. For every image in this list, model will predict the formula and print it into the terminal.
To run the model via OpenVINO™ one has to export PyTorch model to ONNX first and then convert to OpenVINO™ Intermediate Representation (IR) using Model Optimizer.
Model will be split into two parts:
- Encoder (CNN-backbone and part of the text recognition head)
- Text recognition decoder (LSTM + attention-based head)
The tools/export.py
script exports a given model to ONNX representation.
python tools/export.py --config configs/medium_config.yml
Conversion from ONNX model representation to OpenVINO™ IR is straightforward and handled by OpenVINO™ Model Optimizer.
To convert model to IR one has to set flag export_ir
in config
file:
...
export_ir: true
...
If this flag is set, full pipeline (PyTorch -> ONNX -> Openvino™ IR) is running, else model is exported to ONNX only.