Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PyMuPDF integration #658

Closed
apyrgio opened this issue Dec 18, 2023 · 6 comments
Closed

PyMuPDF integration #658

apyrgio opened this issue Dec 18, 2023 · 6 comments
Labels
development Development-focused changes

Comments

@apyrgio
Copy link
Contributor

apyrgio commented Dec 18, 2023

The possibility of using PyMuPDF was brought up as a solution to the congestion problem we encountered in #616, and was immediately introduced in PR #622.

While looking more into how PyMuPDF works though, we realized that it can help us tackle more problems than the original one. As of writing this issue, our current understanding is that we can use PyMuPDF to:

  1. Replace pdfinfo / pdftoppm in the 1st stage of the conversion (Use PyMuPDF to solve most congestions issues in /tmp (client & Server)  #622).
  2. Replace all external commands (gm / tesseract / pdfunite / ps2pdf) in the 2nd stage of the conversion and perform the conversion on the Linux/macOS/Windows hosts (On-host pixels to PDF conversion #625).
  3. Convert a PDF to pixels, and pixels to a (searchable) PDF, without touching the filesystem (Defense in Depth - Traceless Sanitization #633, Containers: have progress streamed instead of via mounted volumes (and deprecate doc_to_pixels_qubes_wrapper.py) #443).

This issue holds all of our questions regarding the integration of PyMuPDF, either in terms of feasibility, security, or performance, as well as other effects it has in our code.

@apyrgio apyrgio added the development Development-focused changes label Dec 18, 2023
@apyrgio
Copy link
Contributor Author

apyrgio commented Dec 18, 2023

How does PyMuPDF integrate with Tesseract?

PyMuPDF directly uses the C API of Tesseract. More specifically, it seems to statically link with the Tesseract library. To confidently answer this, we need to review the build scripts. However, there are some good indications that this is the case:

  1. The PyMuPDF package on Debian does not list Tesseract or MuPDF as dependencies.
  2. The PyMuPDF API allows the user to specify the Tesseract data directory, but not the path to the Tesseract binary

Also, we have tested that on a Windows and macOS host, the following code works without installing Tesseract, only installing PyMuPDF via PyPI:

import fitz
doc = fitz.open("./tests/test_docs/sample-pdf.pdf")
page = doc.load_page(0)
pix = page.get_pixmap()
buf = pix.pdfocr_tobytes(tessdata="/path/tessdata_fast-4.1.0")
f = open("./test.pdf", "wb")
f.write(buf)

This means that we can do OCR on macOS / Windows hosts, which we previously thought highly difficult (#625).

@apyrgio
Copy link
Contributor Author

apyrgio commented Dec 18, 2023

Does PyMuPDF use GhostScript?

Even though PyMuPDF and GhostScript are developed by the same company (Artifex), (Py)MuPDF does not use GhostScript. From https://en.wikipedia.org/wiki/MuPDF:

Fitz was originally intended as an R&D project to replace the aging Ghostscript graphics library, but has instead become the rendering engine powering MuPDF.

Grepping for ghostscript / postscript throughout the code does not yield any result that shows that GhostScript is involved. Actually, PostScript code seems to be handled within mupdf.

Removing our dependency on GhostScript is good news, since it has been the source of CVEs in the past.

@apyrgio
Copy link
Contributor Author

apyrgio commented Dec 18, 2023

How does PyMuPDF affect our container image size?

The fact that PyMuPDF allows 2nd stage conversion on the host opens the way for lots of improvements in the container image. Basically, the only packages that we need to install are:

  1. LibreOffice
  2. PyMuPDF
  3. python3-magic
  4. fonts-noto-cjk
  5. OpenJDK8

Unfortunately, PyMuPDF is not available on Alpine Linux. This means that we need to install it with pip install, and add some build dependencies as well. Here are some findings for reducing the image size:

  1. We should delete our build dependencies on the same step that we install them, so that they are not included in the image layer.

  2. When using pip install, we should make it not use a filesystem cache. Else, it can take up more than 100MiB:

     / # du -hd 1 /root | sort -h
     133.8M  /root/.cache
     133.9M  /root
  3. When building PyMuPDF from source, a fitz_new module is also built, which is a "rebased" implementation of PyMuPDF, that's probably not ready for production use yet. We can shave off 50MiB by removing it:

    / # du -hd 1 /usr/lib/python3.11/site-packages | sort -h
    [...]
    28.3M   /usr/lib/python3.11/site-packages/fitz
    49.7M   /usr/lib/python3.11/site-packages/fitz_new

What about other OSes?

The fact that PyMuPDF is difficult to build in Alpine Linux begs the question: can we use a different OS? Turns out that PyMuPDF is available in the official Debian repos. This is good, because we can take advantage of two Debian properties that Alpine Linux does not have:

  1. Slim down our container image with --no-install-recommends / --no-install-suggests. Alpine Linux does not have this flag, but instead allows you to arbitrarily delete packages. This may be very brittle though.
  2. Install the libreoffice-core-nogui flavor of LibreOffice. This flavor has the minimum requirements for scripting LibreOffice, and does not bring any extra libraries, such as Wayland and Mesa.

On the flip side, Alpine Linux is a rolling release distro, which always gets the latest version of the upstream packages. So, we use it not just for its small footprint, but for its security properties as well. Debian takes security very seriously as well, in two different ways:

  • Stable flavors (Bullseye / Bookworm) generally offer less recent versions of a software, but backport security fixes from upstream as soon as possible.

  • Testing / Unstable flavors (Trixie / Sid) are closer to the upstream versions, but are not guaranteed to get security fixes, because they rely that the upstream will include them:

    Sid exclusively gets security updates through its package maintainers. The Debian Security Team only maintains security updates for the current "stable" release.

So, it seems that if we were to switch from Alpine Linux to Debian, the Testing/Unstable flavors would offer similar security guarantees.

Comparisons

The following tables offer comparisons between the following image types:

  • Alpine (current): This is the Alpine image as built from the main branch.
  • Alpine (PyMuPDF): This is the Alpine image that has been tweaked to install only the necessary packages, plus PyMuPDF.
  • Debian (Unstable): This is the debian:unstable-slim image that installs only the necessary packages with --no-install-recommends / --no-install-suggests.
  • Debian (Stable): This is the debian:bookworm-slim image that installs only the necessary packages with --no-install-recommends / --no-install-suggests.

Image size impact

Image Compessed (MiB) Uncompressed (MiB)
Alpine (current) 624 1372
Alpine (PyMuPDF) 413 862
Debian (Unstable) 256 570
Debian (Stable) 253 564
Image Packages
Alpine (current) 286
Alpine (PyMuPDF) 273
Debian (Unstable) 222
Debian (Stable) 221

CVEs impact

Image Critical High Medium Low Negligible
Alpine (current) 0 14 37 6 0
Alpine (PyMuPDF) 0 13 35 6 0
Debian (Unstable) 0 3 8 6 129
Debian (Stable) 1 17 25 10 132
Debian (Stable, excluding won't fix) 0 4 6 0 131

(Debian Stable marks some CVEs as won't fix, meaning that a vulnerability does not apply to it)

@apyrgio
Copy link
Contributor Author

apyrgio commented Dec 18, 2023

What is PyMuPDF's potential impact?

The following diagram shows how the integration of PyMuPDF opens the door for more improvements throughout the codebase, and how it solves some limitations.

PyMuPDF Impact drawio

(this file was created in https://draw.io, and can be edited there by uploading the above .png, since it has the diagram embedded in it. sweet...)

@deeplow
Copy link
Contributor

deeplow commented Dec 19, 2023

Thanks for this investigation @apyrgio! The PyMuPDF + debian stable slim does seem really promising.

deeplow added a commit that referenced this issue Dec 22, 2023
PyMuPDF replaced the need for almost all dependencies, which this commit
now removes.

We are also removing tesseract-ocr as a dependency since
(to our surprise) PyMuPDF ships directly with tesseract binaries [1].
However, now that tesseract-ocr is not available directly as a binary
tool, the `test_ocr.py` needed to be changed.

Fixes #568

[1]: #658 (comment)
deeplow added a commit that referenced this issue Dec 22, 2023
PyMuPDF replaced the need for almost all dependencies, which this commit
now removes.

We are also removing tesseract-ocr as a dependency since
(to our surprise) PyMuPDF ships directly with tesseract binaries [1].
However, now that tesseract-ocr is not available directly as a binary
tool, the `test_ocr.py` needed to be changed.

Fixes #658

[1]: #658 (comment)
deeplow added a commit that referenced this issue Jan 3, 2024
License change required due to the inclusion of the AGPL-licensed
PyMuPDF. This library greatly benefited Dangerzone in many aspects
detailed in [1].

Fixes #658

[1]: #658
@deeplow
Copy link
Contributor

deeplow commented Jan 3, 2024

Performance Impact of PyMuPDF

We stress tested PyMuPDF in a large set of tests and overall found that it didn't decrease the performance in most documents. Quite the contrary in a lot of cases, but it's hard to tell since we don't have a real-world set of documents.

Other impacts of PyMuPDF

We summarized some of the results in this presentation

deeplow added a commit that referenced this issue Jan 4, 2024
License change required due to the inclusion of the AGPL-licensed
PyMuPDF. This library greatly benefited Dangerzone in many aspects
detailed in [1].

Fixes #658

[1]: #658
deeplow added a commit that referenced this issue Jan 4, 2024
License change required due to the inclusion of the AGPL-licensed
PyMuPDF. This library greatly benefited Dangerzone in many aspects
detailed in [1].

Fixes #658

[1]: #658
@deeplow deeplow closed this as completed in f676891 Jan 4, 2024
deeplow added a commit that referenced this issue Jan 4, 2024
PyMuPDF replaced the need for almost all dependencies, which this commit
now removes.

We are also removing tesseract-ocr as a dependency since
(to our surprise) PyMuPDF ships directly with tesseract binaries [1].
However, now that tesseract-ocr is not available directly as a binary
tool, the `test_ocr.py` needed to be changed.

Fixes #658

[1]: #658 (comment)
deeplow added a commit that referenced this issue Jan 4, 2024
License change required due to the inclusion of the AGPL-licensed
PyMuPDF. This library greatly benefited Dangerzone in many aspects
detailed in [1].

Fixes #658

[1]: #658
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
development Development-focused changes
Projects
None yet
Development

No branches or pull requests

2 participants