Skip to content

Latest commit



97 lines (75 loc) · 3.59 KB

File metadata and controls

97 lines (75 loc) · 3.59 KB

How to run UNLV tests.

The scripts in this directory make it possible to duplicate the tests published in the Fourth Annual Test of OCR Accuracy. See but first you have to get the tools and data used by UNLV:

Step 1: to download the images go to and get doe3.3B.tar.gz, bus.3B.tar.gz, mag.3B.tar.gz and news.3B.tar.gz spn.3B.tar.gz is incorrect in this repo, so get it from

mkdir -p ~/isri-downloads
cd ~/isri-downloads
curl  -L > bus.3B.tar.gz
curl  -L > doe3.3B.tar.gz
curl  -L > mag.3B.tar.gz
curl  -L > news.3B.tar.gz
curl  -L > spn.3B.tar.gz

Step 2: extract the files.

It doesn't really matter where in your filesystem you put them, but they must go under a common root so you have directories doe3.3B, bus.3B, mag.3B and news.3B. in, for example, ~/ISRI-OCRtk.

mkdir -p ~/ISRI-OCRtk
cd ~/ISRI-OCRtk
tar xzvf ~/isri-downloads/bus.3B.tar.gz
tar xzvf ~/isri-downloads/doe3.3B.tar.gz
tar xzvf ~/isri-downloads/mag.3B.tar.gz
tar xzvf ~/isri-downloads/news.3B.tar.gz
tar xzvf ~/isri-downloads/spn.3B.tar.gz
mkdir -p stopwords
cd stopwords
wget -O spa.stopwords.txt

Edit ~/ISRI-OCRtk/stopwords/spa.stopwords.txt wordacc uses a space delimited stopwords file, not line delimited. s/\n/ /g

perl -pi -e 's/\n/ /' ~/ISRI-OCRtk/stopwords/spa.stopwords.txt

Edit ~/ISRI-OCRtk/spn.3B/pages Delete the line containing the following imagename as it crashes tesseract.

7733_005.3B 3

Step 3: Download the modified ISRI toolkit, make and install the tools :

These will be installed in /usr/local/bin.

git clone
cd ~/ocr-evaluation-tools
sudo make install

Step 4: cd back to your main tesseract-ocr dir and Build tesseract.

Step 5: run test/unlvtests/ with the root ISRI data dir, testname, tessdata-dir:

test/unlvtests/ ~/ISRI-OCRtk 4_fast_eng ../tessdata_fast

and go to the gym, have lunch etc. It takes a while to run.

Step 6: There should be a RELEASE.summary file

unlvtests/reports/4-beta_fast.summary that contains the final summarized accuracy report and comparison with the 1995 results.

Step 7: run the test for Spanish.

test/unlvtests/ ~/ISRI-OCRtk 4_fast_spa ../tessdata_fast

Notes from Nick White regarding wordacc

If you just want to remove all lines which have 100% recognition, you can add a 'awk' command like this:

ocrevalutf8 wordacc ground.txt ocr.txt | awk '$3 != 100 {print $0}' results.txt

or if you've already got a results file you want to change, you can do this:

awk '$3 != 100 {print $0}' results.txt newresults.txt

If you only want the last sections where things are broken down by word, you can add a sed command, like this:

ocrevalutf8 wordacc ground.txt ocr.txt | sed '/^ Count Missed %Right $/,$ !d' | awk '$3 != 100 {print $0}' results.txt