From a36a5f96d0598a731d00a5c3b21f7fe4e7b9d91f Mon Sep 17 00:00:00 2001 From: Robert Theis Date: Thu, 21 May 2015 19:23:42 -0700 Subject: [PATCH] Minor edits to Readme --- README.md | 66 +++++++++++++++++++++++++++++-------------------------- 1 file changed, 35 insertions(+), 31 deletions(-) diff --git a/README.md b/README.md index e7a8e9502b..8a8fffb6ad 100644 --- a/README.md +++ b/README.md @@ -1,32 +1,35 @@ Note that this is a text-only and possibly out-of-date version of the wiki ReadMe, which is located at: - https://github.com/tesseract-ocr/tesseract/blob/master/README + https://github.com/tesseract-ocr/tesseract/blob/master/README.md Introduction ============ This package contains the Tesseract Open Source OCR Engine. -Originally developed at Hewlett Packard Laboratories Bristol and -at Hewlett Packard Co, Greeley Colorado, all the code +Originally developed at Hewlett-Packard Laboratories Bristol and +at Hewlett-Packard Co, Greeley Colorado, all the code in this distribution is now licensed under the Apache License: - * Licensed under the Apache License, Version 2.0 (the "License"); - * you may not use this file except in compliance with the License. - * You may obtain a copy of the License at - * http://www.apache.org/licenses/LICENSE-2.0 - * Unless required by applicable law or agreed to in writing, software - * distributed under the License is distributed on an "AS IS" BASIS, - * WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. - * See the License for the specific language governing permissions and - * limitations under the License. + Licensed under the Apache License, Version 2.0 (the "License"); + you may not use this file except in compliance with the License. + You may obtain a copy of the License at + + http://www.apache.org/licenses/LICENSE-2.0 + + Unless required by applicable law or agreed to in writing, software + distributed under the License is distributed on an "AS IS" BASIS, + WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied. + See the License for the specific language governing permissions and + limitations under the License. Dependencies and Licenses ========================= -Leptonica is required. (www.leptonica.com). Tesseract no longer compiles -without Leptonica. +[Leptonica](http://www.leptonica.com) is required. Tesseract no longer +compiles without Leptonica. + Libtiff is no longer required as a direct dependency. @@ -34,15 +37,16 @@ Installing and Running Tesseract -------------------------------- All Users Do NOT Ignore! + The tarballs are split into pieces. tesseract-x.xx.tar.gz contains all the source code. -tesseract-x.xx..tar.gz contains the language data files for . +tesseract-x.xx.``.tar.gz contains the language data files for ``. You need at least one of these or Tesseract will not work. Note that tesseract-x.xx.tar.gz unpacks to the tesseract-ocr directory. -tesseract-x.xx..tar.gz unpacks to the tessdata directory which +tesseract-x.xx.``.tar.gz unpacks to the tessdata directory which belongs inside your tesseract-ocr directory. It is therefore best to download them into your tesseract-x.xx directory, so you can use unpack here or equivalent. You can unpack as many of the language packs as you @@ -52,7 +56,7 @@ before you run make install. If you unpack them as root to the destination directory of make install, then the user ids and access permissions might be messed up. -boxtiff-2.xx..tar.gz contains data that was used in training for +boxtiff-2.xx.``.tar.gz contains data that was used in training for those that want to do their own training. Most users should NOT download these files. @@ -63,8 +67,8 @@ Tesseract wiki https://github.com/tesseract-ocr/tesseract/wiki Windows ------- -Please use installer (for 3.00 and above). Tesseract is library with -command line interface. If you need GUI, please check AddOns wiki page +Please use the installer (for 3.00 and above). Tesseract is a library with a +command line interface. If you need a GUI, please check the AddOns wiki page. TODO-UPDATE-WIKI-LINKS @@ -74,7 +78,7 @@ If you are building from the sources, the recommended build platform is VC++ Express 2008 (optionally 2010). The executables are built with static linking, so they stand more chance -of working out of the box on more windows systems. +of working out of the box on more Windows systems. The executable must reside in the same directory as the tessdata directory or you need to set up environment variable TESSDATA_PREFIX. @@ -82,7 +86,7 @@ Installer will set it up for you. The command line is: -tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfiles...] + tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfiles...] If you need interface to other applications, please check wrapper section on AddOns wiki page: @@ -98,19 +102,19 @@ Non-Windows (or Cygwin) You have to tell Tesseract through a standard unix mechanism where to find its data directory. You must either: -./autogen.sh -./configure -make -make install -sudo ldconfig + ./autogen.sh + ./configure + make + make install + sudo ldconfig to move the data files to the standard place, or: -export TESSDATA_PREFIX="directory in which your tessdata resides/" + export TESSDATA_PREFIX="directory in which your tessdata resides/" In either case the command line is: -tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfiles...] + tesseract imagename outputbase [-l lang] [-psm pagesegmode] [configfiles...] New there is a tesseract.spec for making rpms. (Thanks to Andrew Ziem for the help.) It might work with your OS if you know how to do that. @@ -126,8 +130,8 @@ instead of `./configure` above. History ======= -The engine was developed at Hewlett Packard Laboratories Bristol and -at Hewlett Packard Co, Greeley Colorado between 1985 and 1994, with some +The engine was developed at Hewlett-Packard Laboratories Bristol and +at Hewlett-Packard Co, Greeley Colorado between 1985 and 1994, with some more changes made in 1996 to port to Windows, and some C++izing in 1998. A lot of the code was written in C, and then some more was written in C++. Since then all the code has been converted to at least compile with a C++ @@ -138,7 +142,7 @@ lists, but has the big negative that if you do get a segmentation violation, it is hard to debug. The most recent change is that Tesseract can now recognize 39 languages, -including Arabic, Hindi, Vietnamese, plus 3 Fraktur variants +including Arabic, Hindi, Vietnamese, plus 3 Fraktur variants, is fully UTF8 capable, and is fully trainable. See TrainingTesseract for more information on training.