This repo contains scripts used to convert PubLayNet dataset to tfrecords for semantic segmentation. The tfrecords can be used to train and evaluate semantic segmentation neural networks for document structure extraction and document layout recognition.
The style and formatting of the tfrecords is that of the official semantic segmentation model released on TensorFlow's model repository (https://github.com/tensorflow/models/tree/master/research/deeplab). More specifically, the scripts released here follow the style and formatting of Pascal_VOC dataset used for deeplab.
To use the code:
- Download the PubLayNet files from its official GitHub repo (https://github.com/ibm-aur-nlp/PubLayNet).
- Put
train.json
anddev.json
underPubLayNet_tfrecords/PubLayNet
folder. - Unzip the downloaded files and put each batch in its appropriate folder under
PubLayNet_tfrecords/PubLayNet/RawImages/
. - In Terminal, navigate to
./PubLayNet_tfrecords
. - To create the segmentation mask PNG files run
python create_PubLayNet_segmentation_mask_png_files.py
. - To create tfrecords run
python build_PubLayNet_tfrecords.py
. - Tfrecords will be saved in
PubLayNet_tfrecords/PubLayNet/tfrecords
.
PubLayNet is a large dataset of document images, of which the layout is annotated with both bounding boxes and polygonal segmentations. The source of the documents is PubMed Central Open Access Subset (commercial use collection). The annotations are automatically generated by matching the PDF format and the XML format of the articles in the PubMed Central Open Access Subset. More details are available in the paper "PubLayNet: largest dataset ever for document layout analysis."