-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
PyMuPDF integration #658
Comments
How does PyMuPDF integrate with Tesseract?PyMuPDF directly uses the C API of Tesseract. More specifically, it seems to statically link with the Tesseract library. To confidently answer this, we need to review the build scripts. However, there are some good indications that this is the case:
Also, we have tested that on a Windows and macOS host, the following code works without installing Tesseract, only installing PyMuPDF via PyPI: import fitz
doc = fitz.open("./tests/test_docs/sample-pdf.pdf")
page = doc.load_page(0)
pix = page.get_pixmap()
buf = pix.pdfocr_tobytes(tessdata="/path/tessdata_fast-4.1.0")
f = open("./test.pdf", "wb")
f.write(buf) This means that we can do OCR on macOS / Windows hosts, which we previously thought highly difficult (#625). |
Does PyMuPDF use GhostScript?Even though PyMuPDF and GhostScript are developed by the same company (Artifex), (Py)MuPDF does not use GhostScript. From https://en.wikipedia.org/wiki/MuPDF:
Grepping for ghostscript / postscript throughout the code does not yield any result that shows that GhostScript is involved. Actually, PostScript code seems to be handled within mupdf. Removing our dependency on GhostScript is good news, since it has been the source of CVEs in the past. |
How does PyMuPDF affect our container image size?The fact that PyMuPDF allows 2nd stage conversion on the host opens the way for lots of improvements in the container image. Basically, the only packages that we need to install are:
Unfortunately, PyMuPDF is not available on Alpine Linux. This means that we need to install it with
What about other OSes?The fact that PyMuPDF is difficult to build in Alpine Linux begs the question: can we use a different OS? Turns out that PyMuPDF is available in the official Debian repos. This is good, because we can take advantage of two Debian properties that Alpine Linux does not have:
On the flip side, Alpine Linux is a rolling release distro, which always gets the latest version of the upstream packages. So, we use it not just for its small footprint, but for its security properties as well. Debian takes security very seriously as well, in two different ways:
So, it seems that if we were to switch from Alpine Linux to Debian, the Testing/Unstable flavors would offer similar security guarantees. ComparisonsThe following tables offer comparisons between the following image types:
Image size impact
CVEs impact
(Debian Stable marks some CVEs as won't fix, meaning that a vulnerability does not apply to it) |
What is PyMuPDF's potential impact?The following diagram shows how the integration of PyMuPDF opens the door for more improvements throughout the codebase, and how it solves some limitations. (this file was created in https://draw.io, and can be edited there by uploading the above |
Thanks for this investigation @apyrgio! The PyMuPDF + debian stable slim does seem really promising. |
PyMuPDF replaced the need for almost all dependencies, which this commit now removes. We are also removing tesseract-ocr as a dependency since (to our surprise) PyMuPDF ships directly with tesseract binaries [1]. However, now that tesseract-ocr is not available directly as a binary tool, the `test_ocr.py` needed to be changed. Fixes #568 [1]: #658 (comment)
PyMuPDF replaced the need for almost all dependencies, which this commit now removes. We are also removing tesseract-ocr as a dependency since (to our surprise) PyMuPDF ships directly with tesseract binaries [1]. However, now that tesseract-ocr is not available directly as a binary tool, the `test_ocr.py` needed to be changed. Fixes #658 [1]: #658 (comment)
Performance Impact of PyMuPDFWe stress tested PyMuPDF in a large set of tests and overall found that it didn't decrease the performance in most documents. Quite the contrary in a lot of cases, but it's hard to tell since we don't have a real-world set of documents. Other impacts of PyMuPDFWe summarized some of the results in this presentation |
PyMuPDF replaced the need for almost all dependencies, which this commit now removes. We are also removing tesseract-ocr as a dependency since (to our surprise) PyMuPDF ships directly with tesseract binaries [1]. However, now that tesseract-ocr is not available directly as a binary tool, the `test_ocr.py` needed to be changed. Fixes #658 [1]: #658 (comment)
The possibility of using PyMuPDF was brought up as a solution to the congestion problem we encountered in #616, and was immediately introduced in PR #622.
While looking more into how PyMuPDF works though, we realized that it can help us tackle more problems than the original one. As of writing this issue, our current understanding is that we can use PyMuPDF to:
pdfinfo
/pdftoppm
in the 1st stage of the conversion (Use PyMuPDF to solve most congestions issues in /tmp (client & Server) #622).gm
/tesseract
/pdfunite
/ps2pdf
) in the 2nd stage of the conversion and perform the conversion on the Linux/macOS/Windows hosts (On-host pixels to PDF conversion #625).This issue holds all of our questions regarding the integration of PyMuPDF, either in terms of feasibility, security, or performance, as well as other effects it has in our code.
The text was updated successfully, but these errors were encountered: