Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce Container Dependencies #305

Merged
merged 4 commits into from
Jan 23, 2023
Merged

Reduce Container Dependencies #305

merged 4 commits into from
Jan 23, 2023

Conversation

deeplow
Copy link
Contributor

@deeplow deeplow commented Jan 10, 2023

Removes three dependencies:

  • sudo - no longer needed
  • openjdk-8 - see reasoning here
  • PDFtk - explained below

Removing PDFtk dependency (replace w/ pdftoppm)

PDFtk actually isn't needed. It was being used for breaking a PDF into pages but this is something that be replaced by the already present pdftoppm (packaged in poppler-utils). Furthermore, by removing this dependency we contribute to reproducible builds and overall supply chain security because it was obtained from gitlab with no signature verification or version pinning.

The replacement pdftoppm enabled us to do a shortcut:
- before: PDF -> PDF pages -> PNG images -> RGB images
- after: PDF -> PPM images -> RGB images

Note about PPM -> RGB "conversion"

And this last conversion step is trivial since the RGB format we were using is just a PPM file without the metadata at the beginning.

Furthermore, we were using a depth of 8 bits per color channel which is exactly the same depth as the the .ppm file format.

To verify the color accuracy, I compared two samples - one obtained from from .pdf to .ppm (via pdftoppm) and a "true" RGB file via the old process (pdftocairo -png + gm convert input.png -depth 8 rgb:file.rgb). Here's the result (compressed in .png) - left .pdf, center .ppm and right .png:

comparison

When we compare the .ppm and the original .rgb file, we can see that they differ is some pixel values. But as can be seen in the picture, these minor differences are of little to no consequence to the final human-readable image.

@deeplow deeplow requested a review from apyrgio January 10, 2023 10:27
@deeplow
Copy link
Contributor Author

deeplow commented Jan 10, 2023

The replacement pdftoppm enabled us to do a shortcut:

  • before: PDF -> PDF pages -> PNG images -> RGB images
  • after: PDF -> PPM images -> RGB images

This is now a bit faster. On my system make test changed from 2m30s to 1m30s!

@deeplow deeplow changed the title 232 Remove Container Deps Remove Unused Container Dependencies Jan 10, 2023
container/dangerzone.py Outdated Show resolved Hide resolved
@deeplow
Copy link
Contributor Author

deeplow commented Jan 10, 2023

I also experimented with replacing pdftocairo to convert .pdf to .png, but that wouldn't be as efficient as converting directly to rgb (.ppm).

@deeplow deeplow changed the title Remove Unused Container Dependencies Reduce Container Dependencies Jan 12, 2023
container/dangerzone.py Outdated Show resolved Hide resolved
container/dangerzone.py Show resolved Hide resolved
container/dangerzone.py Show resolved Hide resolved
@apyrgio
Copy link
Contributor

apyrgio commented Jan 17, 2023

General comment. I was looking into PyMuPDF, and it would be a perfect tool for our needs. It converts to PPM, has rich info in a Pythonic interface, and can render some of the tools useless.

... but we can't install it on Alpine Linux, at least not easily. That's a shame, but I'd like to explore it in the future.

@deeplow
Copy link
Contributor Author

deeplow commented Jan 19, 2023

I was looking into PyMuPDF, and it would be a perfect tool for our needs.

Nice. We could investigate it more in the future. But apart from the installation issues on alpine linux, it's also not available on ARM macOS (source: README.md).

@deeplow
Copy link
Contributor Author

deeplow commented Jan 19, 2023

Looking a bit further into it, the PYMuPDF's installer introduces supply chain issues (some of which we are trying to mitigate in this commit):

  • It calls git clone git://git.ghostscript.com/mupdf.git.
  • And it also may attempt to download a .tar.gz from `mupdf.org 😬

But feel free to open an issue about it so we can explore it further. Or reconsider our PDF-related software options in general.

@apyrgio
Copy link
Contributor

apyrgio commented Jan 19, 2023

Thanks for the dig regarding PyMuPDF. Let's punt this discussion until we look into this code once more.

PDFtk actually isn't needed. It was being used for breaking a PDF
into pages but this is something that be replaced by the already present
'pdftoppm'. Furthermore, by removing this dependency we contribute to
reproducible builds and overall supply chain security because it was
obtained from gitlab with no signature verification or version pinning.

The replacement 'pdftoppm' enabled us to do a shortcut:
 - before: PDF -> PDF pages -> PNG images -> RGB images
 - after:  PDF -> PPM images -> RGB images

And this last conversion step is trivial since the RGB format we were
using is just a PPM file without the metadata in its header.
default-jre and java dependencies dependencies had been added initially
[1] because of libreoffice-java-common, which is no longer present.
Then, when the image was changed from ubuntu to alpine [2], default-jre
was replaced with openjdk-8.

If java is still a dependency for libreoffice, then it should be pulled
automatically.

[1] firstlookmedia/dangerzone-converter@9ecdb9e
[2] firstlookmedia/dangerzone-converter@650ae6e
@deeplow deeplow force-pushed the 232-rem-container-deps branch from ee8438e to 2da9732 Compare January 23, 2023 14:16
@deeplow
Copy link
Contributor Author

deeplow commented Jan 23, 2023

Rebased from main and squashed commits.

@deeplow deeplow merged commit 2da9732 into main Jan 23, 2023
@deeplow
Copy link
Contributor Author

deeplow commented Jan 24, 2023

There were some unintended side-effects of removing openjdk8 in excell files:

before after
xlsxbefore xlsxafter

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants