Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jules - improved how we specify and find tesseract data. #4093

Merged
merged 2 commits into from
Nov 27, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
30 changes: 13 additions & 17 deletions docs/functions.rst
Original file line number Diff line number Diff line change
Expand Up @@ -68,7 +68,6 @@ Yet others are handy, general-purpose utilities.
:meth:`get_tessdata` locates the language support of the Tesseract-OCR installation
:attr:`fitz_fontdescriptors` dictionary of available supplement fonts
:attr:`PYMUPDF_MESSAGE` destination of |PyMuPDF| messages.
:attr:`TESSDATA_PREFIX` a copy of `os.environ["TESSDATA_PREFIX"]`
:attr:`pdfcolor` dictionary of almost 500 RGB colors in PDF format.
==================================== ==============================================================

Expand Down Expand Up @@ -379,18 +378,6 @@ Yet others are handy, general-purpose utilities.
Also see `set_messages()`.


-----

.. attribute:: TESSDATA_PREFIX

* New in v1.19.4

Copy of `os.environ["TESSDATA_PREFIX"]` for convenient checking whether there is integrated Tesseract OCR support.

If this attribute is `None`, Tesseract-OCR is either not installed, or the environment variable is not set to point to Tesseract's language support folder.

.. note:: This variable is now checked before OCR functions are tried. This prevents verbose messages from MuPDF.

-----

.. attribute:: pdfcolor
Expand Down Expand Up @@ -850,13 +837,22 @@ Yet others are handy, general-purpose utilities.

-----

.. method:: get_tessdata()
.. method:: get_tessdata(tessdata=None)

Detect Tesseract language support folder.

Return the name of Tesseract's language support folder. Use this function if the environment variable `TESSDATA_PREFIX` has not been set.
This function is used to enable OCR via Tesseract even if the language
support folder is not specified directly or in environment variable
TESSDATA_PREFIX.

:returns: `os.getenv("TESSDATA_PREFIX")` if not `None`. Otherwise, if Tesseract-OCR is installed, locate the name of `tessdata`. If no installation is found, return `False`.
* If <tessdata> is set we return it directly.

* Otherwise we return `os.environ['TESSDATA_PREFIX']` if set.

* Otherwise we search for a Tesseract installation and return its language
support folder.

The folder name can be used as parameter `tessdata` in methods :meth:`Page.get_textpage_ocr`, :meth:`Pixmap.pdfocr_save` and :meth:`Pixmap.pdfocr_tobytes`.
* Otherwise we raise an exception.

-----

Expand Down
32 changes: 23 additions & 9 deletions docs/installation.rst
Original file line number Diff line number Diff line change
Expand Up @@ -159,7 +159,12 @@ Notes
* `Pillow <https://pypi.org/project/Pillow/>`_ is required for :meth:`Pixmap.pil_save` and :meth:`Pixmap.pil_tobytes`.
* `fontTools <https://pypi.org/project/fonttools/>`_ is required for :meth:`Document.subset_fonts`.
* `pymupdf-fonts <https://pypi.org/project/pymupdf-fonts/>`_ is a collection of nice fonts to be used for text output methods.
* `Tesseract-OCR <https://github.com/tesseract-ocr/tesseract>`_ for optical character recognition in images and document pages. Tesseract is separate software, not a Python package. To enable OCR functions in PyMuPDF, the software must be installed and the system environment variable `"TESSDATA_PREFIX"` must be defined and contain the `tessdata` folder name of the Tesseract installation location. See below.
*
`Tesseract-OCR <https://github.com/tesseract-ocr/tesseract>`_ for optical
character recognition in images and document pages. Tesseract is separate
software, not a Python package. To enable OCR functions in PyMuPDF,
Tesseract must be installed and the `tessdata` folder name specified; see
below.

.. note:: You can install these additional components at any time -- before or after installing PyMuPDF. PyMuPDF will detect their presence during import or when the respective functions are being used.

Expand Down Expand Up @@ -271,18 +276,27 @@ If you do not intend to use this feature, skip this step. Otherwise, it is requi

PyMuPDF will already contain all the logic to support OCR functions. But it additionally does need `Tesseract’s language support data <https://github.com/tesseract-ocr/tessdata>`_.

The language support folder location must be communicated either via storing it in the environment variable `"TESSDATA_PREFIX"`, or as a parameter in the applicable functions.
If not specified explicitly, PyMuPDF will attempt to find the installed
Tesseract's tessdata, but this should probably not be relied upon.

Otherwise PyMuPDF requires that Tesseract's language support folder is
specified explicitly either in PyMuPDF OCR functions' `tessdata` arguments or
`os.environ["TESSDATA_PREFIX"]`.

So for a working OCR functionality, make sure to complete this checklist:

1. Locate Tesseract's language support folder. Typically you will find it here:
- Windows: `C:/Program Files/Tesseract-OCR/tessdata`
- Unix systems: `/usr/share/tesseract-ocr/4.00/tessdata`

2. Set the environment variable `TESSDATA_PREFIX`
- Windows: `setx TESSDATA_PREFIX "C:/Program Files/Tesseract-OCR/tessdata"`
- Unix systems: `declare -x TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata`

.. note:: On Windows systems, this must happen outside Python -- before starting your script. Just manipulating `os.environ` will not work!
* Windows: `C:/Program Files/Tesseract-OCR/tessdata`
* Unix systems: `/usr/share/tesseract-ocr/4.00/tessdata`

2. Specify the language support folder when calling PyMuPDF OCR functions:

* Set the `tessdata` argument.
* Or set `os.environ["TESSDATA_PREFIX"]` from within Python.
* Or set environment variable `TESSDATA_PREFIX` before running Python, for example:

* Windows: `setx TESSDATA_PREFIX "C:/Program Files/Tesseract-OCR/tessdata"`
* Unix systems: `declare -x TESSDATA_PREFIX=/usr/share/tesseract-ocr/4.00/tessdata`

.. include:: footer.rst
2 changes: 2 additions & 0 deletions scripts/gh_release.py
Original file line number Diff line number Diff line change
Expand Up @@ -359,6 +359,8 @@ def env_set(name, value, pass_=False):
if pass_:
env_pass(name)

env_pass('PYMUPDF_SETUP_PY_LIMITED_API')

if os.environ.get('PYMUPDF_SETUP_LIBCLANG'):
env_pass('PYMUPDF_SETUP_LIBCLANG')

Expand Down
72 changes: 37 additions & 35 deletions src/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -430,7 +430,6 @@ def _format_g(value, *, fmt='%g'):
Page = 'Page_forward_decl'
Point = 'Point_forward_decl'

TESSDATA_PREFIX = os.environ.get("TESSDATA_PREFIX")
matrix_like = 'matrix_like'
point_like = 'point_like'
quad_like = 'quad_like'
Expand Down Expand Up @@ -10305,8 +10304,7 @@ def pdfocr_save(self, filename, compress=1, language=None, tessdata=None):
'''
Save pixmap as an OCR-ed PDF page.
'''
if not TESSDATA_PREFIX and not tessdata:
raise RuntimeError('No OCR support: TESSDATA_PREFIX not set')
tessdata = get_tessdata(tessdata)
opts = mupdf.FzPdfocrOptions()
opts.compress = compress
if language:
Expand All @@ -10328,15 +10326,15 @@ def pdfocr_tobytes(self, compress=True, language="eng", tessdata=None):
compress: (bool) compress, default 1 (True).
language: (str) language(s) occurring on page, default "eng" (English),
multiples like "eng+ger" for English and German.
tessdata: (str) folder name of Tesseract's language support. Must be
given if environment variable TESSDATA_PREFIX is not set.
tessdata: (str) folder name of Tesseract's language support. If None
we use environment variable TESSDATA_PREFIX or search for
Tesseract installation.
Notes:
On failure, make sure Tesseract is installed and you have set the
environment variable "TESSDATA_PREFIX" to the folder containing your
Tesseract's language support data.
On failure, make sure Tesseract is installed and you have set
<tessdata> or environment variable "TESSDATA_PREFIX" to the folder
containing your Tesseract's language support data.
"""
if not TESSDATA_PREFIX and not tessdata:
raise RuntimeError('No OCR support: TESSDATA_PREFIX not set')
tessdata = get_tessdata(tessdata)
from io import BytesIO
bio = BytesIO()
self.pdfocr_save(bio, compress=compress, language=language, tessdata=tessdata)
Expand Down Expand Up @@ -18309,55 +18307,59 @@ def make_utf16be(s):
return "(" + r + ")"


def get_tessdata():
"""Detect Tesseract-OCR and return its language support folder.
def get_tessdata(tessdata=None):
"""Detect Tesseract language support folder.

This function can be used to enable OCR via Tesseract even if the
environment variable TESSDATA_PREFIX has not been set.
If the value of TESSDATA_PREFIX is None, the function tries to locate
Tesseract-OCR and fills the required variable.
This function is used to enable OCR via Tesseract even if the language
support folder is not specified directly or in environment variable
TESSDATA_PREFIX.

Returns:
Folder name of tessdata if Tesseract-OCR is available, otherwise False.
"""
TESSDATA_PREFIX = os.getenv("TESSDATA_PREFIX")
if TESSDATA_PREFIX: # use environment variable if set
return TESSDATA_PREFIX
* If <tessdata> is set we return it directly.

* Otherwise we return `os.environ['TESSDATA_PREFIX']` if set.

* Otherwise we search for a Tesseract installation and return its language
support folder.

* Otherwise we raise an exception.
"""
Try to locate the tesseract-ocr installation.
"""
if tessdata:
return tessdata
tessdata = os.getenv("TESSDATA_PREFIX")
if tessdata: # use environment variable if set
return tessdata

# Try to locate the tesseract-ocr installation.

import subprocess
# Windows systems:
if sys.platform == "win32":
cp = subprocess.run("where tesseract", shell=1, capture_output=1, check=0, text=True)
response = cp.stdout.strip()
if cp.returncode or not response:
message("Tesseract-OCR is not installed")
return False
raise RuntimeError("No tessdata specified and Tesseract is not installed")
dirname = os.path.dirname(response) # path of tesseract.exe
tessdata = os.path.join(dirname, "tessdata") # language support
if os.path.exists(tessdata): # all ok?
return tessdata
else: # should not happen!
message("unexpected: Tesseract-OCR has no 'tessdata' folder")
return False
raise RuntimeError("No tessdata specified and Tesseract installation has no {tessdata} folder")

# Unix-like systems:
cp = subprocess.run("whereis tesseract-ocr", shell=1, capture_output=1, check=0, text=True)
response = cp.stdout.strip().split()
if cp.returncode or len(response) != 2: # if not 2 tokens: no tesseract-ocr
message("tesseract-ocr is not installed")
return False
raise RuntimeError("No tessdata specified and Tesseract is not installed")

# search tessdata in folder structure
dirname = response[1] # contains tesseract-ocr installation folder
tessdatas = glob.glob(f"{dirname}/*/tessdata")
pattern = f"{dirname}/*/tessdata"
tessdatas = glob.glob(pattern)
tessdatas.sort()
if len(tessdatas) == 0:
message("unexpected: tesseract-ocr has no 'tessdata' folder")
return False
return tessdatas[-1]
if tessdatas:
return tessdatas[-1]
else:
raise RuntimeError("No tessdata specified and Tesseract installation has no {pattern} folder.")


def css_for_pymupdf_font(
Expand Down
4 changes: 1 addition & 3 deletions src/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -25,7 +25,6 @@

g_exceptions_verbose = pymupdf.g_exceptions_verbose

TESSDATA_PREFIX = os.environ.get("TESSDATA_PREFIX")
point_like = "point_like"
rect_like = "rect_like"
matrix_like = "matrix_like"
Expand Down Expand Up @@ -748,8 +747,7 @@ def get_textpage_ocr(
full: (bool) whether to OCR the full page image, or only its images (default)
"""
pymupdf.CheckParent(page)
if not TESSDATA_PREFIX and not tessdata:
raise RuntimeError("No OCR support: TESSDATA_PREFIX not set")
tessdata = pymupdf.get_tessdata(tessdata)

def full_ocr(page, dpi, language, flags):
zoom = dpi / 72
Expand Down