historgraph is a graph database of four different handwritten historical documents, viz.
Please cite the following papers (as well as the corresponding references of the handwritten historical documents, see above):
[1] M. Stauffer, A. Fischer, K. Riesen, Keyword Spotting in Historical Handwritten Documents based on Graph Matching, Pattern Recognit. (2018).
[2] M. Stauffer, A. Fischer, K. Riesen, A Novel Graph Database for Handwritten Word Images, in: Int. Work. Struct. Syntactic, Stat. Pattern Recognit., 2016: pp. 553–563.
The following image preprocessing steps have been employed for the four different manuscripts. Note that the segmentation of AK and BOT has been done in conjunction of the ICFHR2016 KWS benchmark competition, while the segmentation of GW and PAR has been done by ourself.
- Remove noise by Difference of Gaussians
- Binarisation by global thresholding
- Semi-automatic segmentation of word line images to word images by means of projection profiles
- Skew-correction by means of linear regression of lower baseline
- Skeletonisation
- Remove noise by Difference of Gaussians
- Binarisation by global thresholding
- Filtering of segmentation errors by means of morphological filtering
- Skeletonisation
We denote segmented word images that are binarised by B. If the image is additionally skeletonised we use the term S from now on.
Keypoint
: The first graph representation makes use of characteristics points (so called keypoints) in skeletonised word images S. These keypoints are represented as nodes that are labelled with the corresponding (x,y)-coordinates. Between pairs of keypoints (which are connected on the skeleton) further intermediate points are converted to nodes and added to the graph at equidistant intervals. Finally, undirected edges are inserted into the graph for each pair of nodes that is directly connected by a stroke.
Grid
: The second graph representation is based on a grid-wise segmentation of binarised word images B into equally sized segments. For each segment, a node is inserted into the graph and labelled by the (x,y)-coordinates of its respective centre of mass. Undirected edges are inserted between two neighbouring segments that are actually represented by a node. Finally, the inserted edges are reduced by means of a Minimal Spanning Tree algorithm.
Projection
: The third graph representation is based an adaptive and threshold-based segmentation of binarised word images B. Basically, this segmentation is computed on the horizontal and vertical projection profiles of B. The resulting segmentation is further refined in the horizontal and vertical direction by means of two distance-based thresholds. A node is inserted into the graph for each segment and labelled by the (x,y)-coordinates of the corresponding centre of mass. Undirected edges are inserted into the graph for each pair of nodes that is directly connected by a stroke in the original word image.
Split
: The fourth graph representation is based on an iterative segmentation of binarised word images B. That is, segments are iteratively split into smaller subsegments until the width and height of all segments are below certain thresholds. A node is inserted into the graph and labelled by the (x,y)-coordinates of the point on the stroke closest to the centre of mass of each segment. For the insertion of the edges, a similar procedure as for Projection
is applied.
The resulting graphs or more precisely the (x,y)-coordinates of the node labels are normalised by a z-score. Formally, we use normalised coordinates (x-norm, y-norm) derived by
x-norm = (x - µ(x))\σ(x)
and y-norm = (y - µ(y))\σ(y)
where (µ(x),µ(y)) and (σ(x),σ(y)) represent the mean and standard deviation of all (x,y)-coordinates in the graph under consideration.
In the following we describe the Keyword Spotting test setup for each of the four manuscripts. For each manuscript a ground truth can be found in the subfolder ./00_GroundTruth
.
To reproduce the test results achieved on GW and PAR, extract all keywords (see keywords.txt
) in train.txt
and match or classify them with all words from test.txt
. Note that GW is a four-fold, and thus, this procedure needs to repeated for each fold (i.e. each subfolder ./cv1
to ./cv4
) and then the accuracy (e.g. Average Precision and Mean Average Precision) needs to be averaged over the four folds.
To reproduce the test results achieved on AK and BOT, all keywords in queries.txt
are matched or classified with all words from words.txt
(see subfolder ./02_Test
). Note that the ground truth in subfolder ./01_Train
can be used for meta parameter optimisation only.
For reference Keyword Spottings results of the four different handwritten historical documents with different fast suboptimal algorithms for Graph Edit Distance we refer to the following papers:
[1] M. Stauffer, A. Fischer, K. Riesen, Keyword Spotting in Historical Handwritten Documents based on Graph Matching, Pattern Recognit. (2018).
[3] M. Stauffer, A. Fischer, K. Riesen, Speeding-Up Graph-based Keyword Spotting by Quadtree Segmentations, in: Int. Conf. Comput. Anal. Images Patterns, 2017.
[4] M. Stauffer, A. Fischer, K. Riesen, Filters for Graph-Based Keyword Spotting in Historical Handwritten Documents, Pattern Recognit. Lett. (2018).
[5] M. Stauffer, A. Fischer, K. Riesen, Ensembles for Graph-based Keyword Spotting in Historical Handwritten Documents, in: Int. Conf. Doc. Anal. Recognit., 2017: pp. 714–720.
[6] M.R. Ameri, M. Stauffer, K. Riesen, T.D. Bui, A. Fischer, Keyword Spotting in Historical Documents Based on Handwriting Graphs and Hausdorff Edit Distance, in: Int. Graphonomics Soc. Conf., 2017.
[7] M. Stauffer, A. Fischer, K. Riesen, Graph-Based Keyword Spotting in Historical Documents Using Context-Aware Hausdorff Edit Distance, in: Int. Work. Doc. Anal. Syst., 2018.