PyMuPDF integration #658

apyrgio · 2023-12-18T16:02:11Z

The possibility of using PyMuPDF was brought up as a solution to the congestion problem we encountered in #616, and was immediately introduced in PR #622.

While looking more into how PyMuPDF works though, we realized that it can help us tackle more problems than the original one. As of writing this issue, our current understanding is that we can use PyMuPDF to:

Replace pdfinfo / pdftoppm in the 1st stage of the conversion (Use PyMuPDF to solve most congestions issues in /tmp (client & Server) #622).
Replace all external commands (gm / tesseract / pdfunite / ps2pdf) in the 2nd stage of the conversion and perform the conversion on the Linux/macOS/Windows hosts (On-host pixels to PDF conversion #625).
Convert a PDF to pixels, and pixels to a (searchable) PDF, without touching the filesystem (Defense in Depth - Traceless Sanitization #633, Containers: have progress streamed instead of via mounted volumes (and deprecate doc_to_pixels_qubes_wrapper.py) #443).

This issue holds all of our questions regarding the integration of PyMuPDF, either in terms of feasibility, security, or performance, as well as other effects it has in our code.

The text was updated successfully, but these errors were encountered:

apyrgio · 2023-12-18T16:50:13Z

How does PyMuPDF integrate with Tesseract?

PyMuPDF directly uses the C API of Tesseract. More specifically, it seems to statically link with the Tesseract library. To confidently answer this, we need to review the build scripts. However, there are some good indications that this is the case:

The PyMuPDF package on Debian does not list Tesseract or MuPDF as dependencies.
The PyMuPDF API allows the user to specify the Tesseract data directory, but not the path to the Tesseract binary

Also, we have tested that on a Windows and macOS host, the following code works without installing Tesseract, only installing PyMuPDF via PyPI:

import fitz
doc = fitz.open("./tests/test_docs/sample-pdf.pdf")
page = doc.load_page(0)
pix = page.get_pixmap()
buf = pix.pdfocr_tobytes(tessdata="/path/tessdata_fast-4.1.0")
f = open("./test.pdf", "wb")
f.write(buf)

This means that we can do OCR on macOS / Windows hosts, which we previously thought highly difficult (#625).

apyrgio · 2023-12-18T17:19:35Z

Does PyMuPDF use GhostScript?

Even though PyMuPDF and GhostScript are developed by the same company (Artifex), (Py)MuPDF does not use GhostScript. From https://en.wikipedia.org/wiki/MuPDF:

Fitz was originally intended as an R&D project to replace the aging Ghostscript graphics library, but has instead become the rendering engine powering MuPDF.

Grepping for ghostscript / postscript throughout the code does not yield any result that shows that GhostScript is involved. Actually, PostScript code seems to be handled within mupdf.

Removing our dependency on GhostScript is good news, since it has been the source of CVEs in the past.

apyrgio · 2023-12-18T18:14:40Z

How does PyMuPDF affect our container image size?

The fact that PyMuPDF allows 2nd stage conversion on the host opens the way for lots of improvements in the container image. Basically, the only packages that we need to install are:

LibreOffice
PyMuPDF
python3-magic
fonts-noto-cjk
OpenJDK8

Unfortunately, PyMuPDF is not available on Alpine Linux. This means that we need to install it with pip install, and add some build dependencies as well. Here are some findings for reducing the image size:

We should delete our build dependencies on the same step that we install them, so that they are not included in the image layer.
When using pip install, we should make it not use a filesystem cache. Else, it can take up more than 100MiB:
```
 / # du -hd 1 /root | sort -h
 133.8M  /root/.cache
 133.9M  /root
```
When building PyMuPDF from source, a fitz_new module is also built, which is a "rebased" implementation of PyMuPDF, that's probably not ready for production use yet. We can shave off 50MiB by removing it:
```
/ # du -hd 1 /usr/lib/python3.11/site-packages | sort -h
[...]
28.3M   /usr/lib/python3.11/site-packages/fitz
49.7M   /usr/lib/python3.11/site-packages/fitz_new
```

What about other OSes?

The fact that PyMuPDF is difficult to build in Alpine Linux begs the question: can we use a different OS? Turns out that PyMuPDF is available in the official Debian repos. This is good, because we can take advantage of two Debian properties that Alpine Linux does not have:

Slim down our container image with --no-install-recommends / --no-install-suggests. Alpine Linux does not have this flag, but instead allows you to arbitrarily delete packages. This may be very brittle though.
Install the libreoffice-core-nogui flavor of LibreOffice. This flavor has the minimum requirements for scripting LibreOffice, and does not bring any extra libraries, such as Wayland and Mesa.

On the flip side, Alpine Linux is a rolling release distro, which always gets the latest version of the upstream packages. So, we use it not just for its small footprint, but for its security properties as well. Debian takes security very seriously as well, in two different ways:

Stable flavors (Bullseye / Bookworm) generally offer less recent versions of a software, but backport security fixes from upstream as soon as possible.
Testing / Unstable flavors (Trixie / Sid) are closer to the upstream versions, but are not guaranteed to get security fixes, because they rely that the upstream will include them:

Sid exclusively gets security updates through its package maintainers. The Debian Security Team only maintains security updates for the current "stable" release.

So, it seems that if we were to switch from Alpine Linux to Debian, the Testing/Unstable flavors would offer similar security guarantees.

Comparisons

The following tables offer comparisons between the following image types:

Alpine (current): This is the Alpine image as built from the main branch.
Alpine (PyMuPDF): This is the Alpine image that has been tweaked to install only the necessary packages, plus PyMuPDF.
Debian (Unstable): This is the debian:unstable-slim image that installs only the necessary packages with --no-install-recommends / --no-install-suggests.
Debian (Stable): This is the debian:bookworm-slim image that installs only the necessary packages with --no-install-recommends / --no-install-suggests.

Image size impact

Image	Compessed (MiB)	Uncompressed (MiB)
Alpine (current)	624	1372
Alpine (PyMuPDF)	413	862
Debian (Unstable)	256	570
Debian (Stable)	253	564

Image	Packages
Alpine (current)	286
Alpine (PyMuPDF)	273
Debian (Unstable)	222
Debian (Stable)	221

CVEs impact

Image	Critical	High	Medium	Low	Negligible
Alpine (current)	0	14	37	6	0
Alpine (PyMuPDF)	0	13	35	6	0
Debian (Unstable)	0	3	8	6	129
Debian (Stable)	1	17	25	10	132
Debian (Stable, excluding `won't fix`)	0	4	6	0	131

(Debian Stable marks some CVEs as won't fix, meaning that a vulnerability does not apply to it)

apyrgio · 2023-12-18T19:07:34Z

What is PyMuPDF's potential impact?

The following diagram shows how the integration of PyMuPDF opens the door for more improvements throughout the codebase, and how it solves some limitations.

(this file was created in https://draw.io, and can be edited there by uploading the above .png, since it has the diagram embedded in it. sweet...)

deeplow · 2023-12-19T07:58:26Z

Thanks for this investigation @apyrgio! The PyMuPDF + debian stable slim does seem really promising.

PyMuPDF replaced the need for almost all dependencies, which this commit now removes. We are also removing tesseract-ocr as a dependency since (to our surprise) PyMuPDF ships directly with tesseract binaries [1]. However, now that tesseract-ocr is not available directly as a binary tool, the `test_ocr.py` needed to be changed. Fixes #568 [1]: #658 (comment)

PyMuPDF replaced the need for almost all dependencies, which this commit now removes. We are also removing tesseract-ocr as a dependency since (to our surprise) PyMuPDF ships directly with tesseract binaries [1]. However, now that tesseract-ocr is not available directly as a binary tool, the `test_ocr.py` needed to be changed. Fixes #658 [1]: #658 (comment)

License change required due to the inclusion of the AGPL-licensed PyMuPDF. This library greatly benefited Dangerzone in many aspects detailed in [1]. Fixes #658 [1]: #658

deeplow · 2024-01-03T17:34:51Z

Performance Impact of PyMuPDF

We stress tested PyMuPDF in a large set of tests and overall found that it didn't decrease the performance in most documents. Quite the contrary in a lot of cases, but it's hard to tell since we don't have a real-world set of documents.

Other impacts of PyMuPDF

We summarized some of the results in this presentation

License change required due to the inclusion of the AGPL-licensed PyMuPDF. This library greatly benefited Dangerzone in many aspects detailed in [1]. Fixes #658 [1]: #658

PyMuPDF replaced the need for almost all dependencies, which this commit now removes. We are also removing tesseract-ocr as a dependency since (to our surprise) PyMuPDF ships directly with tesseract binaries [1]. However, now that tesseract-ocr is not available directly as a binary tool, the `test_ocr.py` needed to be changed. Fixes #658 [1]: #658 (comment)

License change required due to the inclusion of the AGPL-licensed PyMuPDF. This library greatly benefited Dangerzone in many aspects detailed in [1]. Fixes #658 [1]: #658

apyrgio added the development Development-focused changes label Dec 18, 2023

deeplow mentioned this issue Dec 22, 2023

Support Remaining File Formats that PyMuPDF Supports (MOBI, FB2, CBZ, TXT, PGM, PSD) #660

Open

deeplow mentioned this issue Dec 22, 2023

Use PyMuPDF to solve most congestions issues in /tmp (client & Server) #622

Merged

2 tasks

apyrgio mentioned this issue Jan 3, 2024

Consider switching from gzip to lzma #663

Open

deeplow added a commit that referenced this issue Jan 3, 2024

Replace MIT license with AGPLv3

a45189e

License change required due to the inclusion of the AGPL-licensed PyMuPDF. This library greatly benefited Dangerzone in many aspects detailed in [1]. Fixes #658 [1]: #658

deeplow added a commit that referenced this issue Jan 4, 2024

Replace MIT license with AGPLv3

c8bea70

License change required due to the inclusion of the AGPL-licensed PyMuPDF. This library greatly benefited Dangerzone in many aspects detailed in [1]. Fixes #658 [1]: #658

deeplow added a commit that referenced this issue Jan 4, 2024

Replace MIT license with AGPLv3

f27296c

License change required due to the inclusion of the AGPL-licensed PyMuPDF. This library greatly benefited Dangerzone in many aspects detailed in [1]. Fixes #658 [1]: #658

deeplow closed this as completed in f676891 Jan 4, 2024

deeplow added a commit that referenced this issue Jan 4, 2024

Replace MIT license with AGPLv3

4a7f1ab

License change required due to the inclusion of the AGPL-licensed PyMuPDF. This library greatly benefited Dangerzone in many aspects detailed in [1]. Fixes #658 [1]: #658

apyrgio mentioned this issue Jan 12, 2024

Smallest possible container image for Tails #669

Open

This was referenced Feb 23, 2024

QA and Release 0.6.0 #704

Closed

Qubes: Allow user to disable timeouts #559

Closed

Test timeouts over large set of documents #334

Closed

apyrgio mentioned this issue Sep 3, 2024

On-host pixels to PDF conversion #625

Closed

6 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

PyMuPDF integration #658

PyMuPDF integration #658

apyrgio commented Dec 18, 2023

apyrgio commented Dec 18, 2023

apyrgio commented Dec 18, 2023 •

edited

Loading

apyrgio commented Dec 18, 2023

apyrgio commented Dec 18, 2023 •

edited

Loading

deeplow commented Dec 19, 2023

deeplow commented Jan 3, 2024

PyMuPDF integration #658

PyMuPDF integration #658

Comments

apyrgio commented Dec 18, 2023

apyrgio commented Dec 18, 2023

How does PyMuPDF integrate with Tesseract?

apyrgio commented Dec 18, 2023 • edited Loading

Does PyMuPDF use GhostScript?

apyrgio commented Dec 18, 2023

How does PyMuPDF affect our container image size?

What about other OSes?

Comparisons

Image size impact

CVEs impact

apyrgio commented Dec 18, 2023 • edited Loading

What is PyMuPDF's potential impact?

deeplow commented Dec 19, 2023

deeplow commented Jan 3, 2024

Performance Impact of PyMuPDF

Other impacts of PyMuPDF

apyrgio commented Dec 18, 2023 •

edited

Loading

apyrgio commented Dec 18, 2023 •

edited

Loading