-
Notifications
You must be signed in to change notification settings - Fork 182
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce Container Dependencies #305
Conversation
This is now a bit faster. On my system |
I also experimented with replacing pdftocairo to convert |
General comment. I was looking into PyMuPDF, and it would be a perfect tool for our needs. It converts to PPM, has rich info in a Pythonic interface, and can render some of the tools useless. ... but we can't install it on Alpine Linux, at least not easily. That's a shame, but I'd like to explore it in the future. |
Nice. We could investigate it more in the future. But apart from the installation issues on alpine linux, it's also not available on ARM macOS (source: README.md). |
Looking a bit further into it, the PYMuPDF's installer introduces supply chain issues (some of which we are trying to mitigate in this commit):
But feel free to open an issue about it so we can explore it further. Or reconsider our PDF-related software options in general. |
Thanks for the dig regarding PyMuPDF. Let's punt this discussion until we look into this code once more. |
PDFtk actually isn't needed. It was being used for breaking a PDF into pages but this is something that be replaced by the already present 'pdftoppm'. Furthermore, by removing this dependency we contribute to reproducible builds and overall supply chain security because it was obtained from gitlab with no signature verification or version pinning. The replacement 'pdftoppm' enabled us to do a shortcut: - before: PDF -> PDF pages -> PNG images -> RGB images - after: PDF -> PPM images -> RGB images And this last conversion step is trivial since the RGB format we were using is just a PPM file without the metadata in its header.
default-jre and java dependencies dependencies had been added initially [1] because of libreoffice-java-common, which is no longer present. Then, when the image was changed from ubuntu to alpine [2], default-jre was replaced with openjdk-8. If java is still a dependency for libreoffice, then it should be pulled automatically. [1] firstlookmedia/dangerzone-converter@9ecdb9e [2] firstlookmedia/dangerzone-converter@650ae6e
ee8438e
to
2da9732
Compare
Rebased from main and squashed commits. |
Removes three dependencies:
sudo
- no longer neededopenjdk-8
- see reasoning herePDFtk
- explained belowRemoving PDFtk dependency (replace w/ pdftoppm)
PDFtk actually isn't needed. It was being used for breaking a PDF into pages but this is something that be replaced by the already present
pdftoppm
(packaged inpoppler-utils
). Furthermore, by removing this dependency we contribute to reproducible builds and overall supply chain security because it was obtained from gitlab with no signature verification or version pinning.The replacement
pdftoppm
enabled us to do a shortcut:- before: PDF -> PDF pages -> PNG images -> RGB images
- after: PDF -> PPM images -> RGB images
Note about PPM -> RGB "conversion"
And this last conversion step is trivial since the RGB format we were using is just a PPM file without the metadata at the beginning.
Furthermore, we were using a depth of 8 bits per color channel which is exactly the same depth as the the .ppm file format.
To verify the color accuracy, I compared two samples - one obtained from from
.pdf
to.ppm
(viapdftoppm
) and a "true" RGB file via the old process (pdftocairo -png
+gm convert input.png -depth 8 rgb:file.rgb
). Here's the result (compressed in.png
) - left.pdf
, center.ppm
and right.png
:When we compare the
.ppm
and the original.rgb
file, we can see that they differ is some pixel values. But as can be seen in the picture, these minor differences are of little to no consequence to the final human-readable image.