Skip to content

Commit

Permalink
Merge pull request #33 from rmtheis/tweak-readme
Browse files Browse the repository at this point in the history
Minor edits to Readme
  • Loading branch information
zdenop committed May 22, 2015
2 parents f8ebff2 + a36a5f9 commit e4136f2
Showing 1 changed file with 35 additions and 31 deletions.
66 changes: 35 additions & 31 deletions README.md
Original file line number Diff line number Diff line change
@@ -1,48 +1,52 @@
Note that this is a text-only and possibly out-of-date version of the
wiki ReadMe, which is located at:

https://github.com/tesseract-ocr/tesseract/blob/master/README
https://github.com/tesseract-ocr/tesseract/blob/master/README.md

Introduction
============

This package contains the Tesseract Open Source OCR Engine.
Originally developed at Hewlett Packard Laboratories Bristol and
at Hewlett Packard Co, Greeley Colorado, all the code
Originally developed at Hewlett-Packard Laboratories Bristol and
at Hewlett-Packard Co, Greeley Colorado, all the code
in this distribution is now licensed under the Apache License:

* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
* http://www.apache.org/licenses/LICENSE-2.0
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
Licensed under the Apache License, Version 2.0 (the "License");
you may not use this file except in compliance with the License.
You may obtain a copy of the License at

http://www.apache.org/licenses/LICENSE-2.0

Unless required by applicable law or agreed to in writing, software
distributed under the License is distributed on an "AS IS" BASIS,
WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
See the License for the specific language governing permissions and
limitations under the License.


Dependencies and Licenses
=========================

Leptonica is required. (www.leptonica.com). Tesseract no longer compiles
without Leptonica.
[Leptonica](http://www.leptonica.com) is required. Tesseract no longer
compiles without Leptonica.

Libtiff is no longer required as a direct dependency.


Installing and Running Tesseract
--------------------------------

All Users Do NOT Ignore!

The tarballs are split into pieces.

tesseract-x.xx.tar.gz contains all the source code.

tesseract-x.xx.<lang>.tar.gz contains the language data files for <lang>.
tesseract-x.xx.`<lang>`.tar.gz contains the language data files for `<lang>`.
You need at least one of these or Tesseract will not work.

Note that tesseract-x.xx.tar.gz unpacks to the tesseract-ocr directory.
tesseract-x.xx.<lang>.tar.gz unpacks to the tessdata directory which
tesseract-x.xx.`<lang>`.tar.gz unpacks to the tessdata directory which
belongs inside your tesseract-ocr directory. It is therefore best to
download them into your tesseract-x.xx directory, so you can use unpack
here or equivalent. You can unpack as many of the language packs as you
Expand All @@ -52,7 +56,7 @@ before you run make install. If you unpack them as root to the
destination directory of make install, then the user ids and access
permissions might be messed up.

boxtiff-2.xx.<lang>.tar.gz contains data that was used in training for
boxtiff-2.xx.`<lang>`.tar.gz contains data that was used in training for
those that want to do their own training. Most users should NOT download
these files.

Expand All @@ -63,8 +67,8 @@ Tesseract wiki https://github.com/tesseract-ocr/tesseract/wiki
Windows
-------

Please use installer (for 3.00 and above). Tesseract is library with
command line interface. If you need GUI, please check AddOns wiki page
Please use the installer (for 3.00 and above). Tesseract is a library with a
command line interface. If you need a GUI, please check the AddOns wiki page.

TODO-UPDATE-WIKI-LINKS

Expand All @@ -74,15 +78,15 @@ If you are building from the sources, the recommended build platform is
VC++ Express 2008 (optionally 2010).

The executables are built with static linking, so they stand more chance
of working out of the box on more windows systems.
of working out of the box on more Windows systems.

The executable must reside in the same directory as the tessdata
directory or you need to set up environment variable TESSDATA_PREFIX.
Installer will set it up for you.

The command line is:

tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfiles...]
tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfiles...]

If you need interface to other applications, please check wrapper section
on AddOns wiki page:
Expand All @@ -98,19 +102,19 @@ Non-Windows (or Cygwin)
You have to tell Tesseract through a standard unix mechanism where to
find its data directory. You must either:

./autogen.sh
./configure
make
make install
sudo ldconfig
./autogen.sh
./configure
make
make install
sudo ldconfig

to move the data files to the standard place, or:

export TESSDATA_PREFIX="directory in which your tessdata resides/"
export TESSDATA_PREFIX="directory in which your tessdata resides/"

In either case the command line is:

tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfiles...]
tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfiles...]

New there is a tesseract.spec for making rpms. (Thanks to Andrew Ziem for
the help.) It might work with your OS if you know how to do that.
Expand All @@ -126,8 +130,8 @@ instead of `./configure` above.

History
=======
The engine was developed at Hewlett Packard Laboratories Bristol and
at Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some
The engine was developed at Hewlett-Packard Laboratories Bristol and
at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some
more changes made in 1996 to port to Windows, and some C++izing in 1998.
A lot of the code was written in C, and then some more was written in C++.
Since then all the code has been converted to at least compile with a C++
Expand All @@ -138,7 +142,7 @@ lists, but has the big negative that if you do get a segmentation violation,
it is hard to debug.

The most recent change is that Tesseract can now recognize 39 languages,
including Arabic, Hindi, Vietnamese, plus 3 Fraktur variants
including Arabic, Hindi, Vietnamese, plus 3 Fraktur variants,
is fully UTF8 capable, and is fully trainable. See TrainingTesseract for
more information on training.

Expand Down

0 comments on commit e4136f2

Please sign in to comment.