Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Containers Page Streaming based on PyMuPDF #627

Merged
merged 17 commits into from
Feb 7, 2024
Merged

Conversation

deeplow
Copy link
Contributor

@deeplow deeplow commented Nov 27, 2023

Fixes #443, #574. Depends on #622.

Adds initial support for page streaming in containers. This is a critical move to homogenize the isolation providers code as it makes Qubes and Containers share a lot of it.

Copy link
Contributor

@apyrgio apyrgio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've made a first pass of the code, and I have some minor comments. My main concern is the non-blocking read implementation, and how it will work on Windows. I'll help with figuring this out, along with streaming support on the second stage of the conversion (reserved for a next PR).

dangerzone/conversion/doc_to_pixels.py Outdated Show resolved Hide resolved
dangerzone/isolation_provider/base.py Outdated Show resolved Hide resolved
dangerzone/isolation_provider/base.py Outdated Show resolved Hide resolved
dangerzone/isolation_provider/container.py Show resolved Hide resolved
dangerzone/isolation_provider/container.py Outdated Show resolved Hide resolved
dangerzone/isolation_provider/base.py Outdated Show resolved Hide resolved
dangerzone/conversion/doc_to_pixels.py Outdated Show resolved Hide resolved
@deeplow
Copy link
Contributor Author

deeplow commented Dec 7, 2023

I've seen with multiple documents the percentage goes over 100% so we should look into that.

@deeplow deeplow force-pushed the 443-stream-pages-pymupdf2 branch from e3932d0 to 8ab2abe Compare December 13, 2023 12:09
@deeplow
Copy link
Contributor Author

deeplow commented Dec 13, 2023

@apyrgio I have just rebased on top of the pymupdf branch so it can work as a base for you.

@deeplow deeplow force-pushed the 443-stream-pages-pymupdf2 branch from 42dc3c6 to a751582 Compare January 2, 2024 11:14
@deeplow deeplow force-pushed the 443-stream-pages-pymupdf2 branch 18 times, most recently from 4211f28 to eab95a5 Compare January 11, 2024 10:26
@eloquence eloquence added this to the 0.6.0 milestone Jan 11, 2024
@deeplow deeplow force-pushed the 443-stream-pages-pymupdf2 branch from a3c2530 to 609b77e Compare January 15, 2024 11:58
@apyrgio
Copy link
Contributor

apyrgio commented Feb 6, 2024

I am curious as to why on the force push diff this particular line was changed.

Hm, I made this change btw. See this commit: ac38d99. Could it be that you haven't pulled my changes prior to the push, since Ubuntu Jammy and Debian Bullseye fails?

Simple rename of the __convert() method in the Qubes conversion to make
the code structurally similar.
Remove container_log messages ahead of debug info being sent over
standard streams.
apyrgio pushed a commit that referenced this pull request Feb 6, 2024
The conversion was hanging arbitrarily [1] on some systems. Sometimes it
would send the full page other times stop half-way.

Originally found by @apyrgio.

Co-authored-by: @apyrgio

[1]: #627 (comment)
deeplow added a commit that referenced this pull request Feb 6, 2024
The conversion was hanging arbitrarily [1] on some systems. Sometimes it
would send the full page other times stop half-way.

Originally found by @apyrgio.

Co-authored-by: @apyrgio

[1]: #627 (comment)
@deeplow deeplow force-pushed the 443-stream-pages-pymupdf2 branch from b82179c to b0bd7dd Compare February 6, 2024 19:08
Merge Qubes and Containers isolation providers core code into the class
parent IsolationProviders abstract class.

This is done by streaming pages in containers for exclusively in first
conversion process. The commit is rather large due to the multiple
interdependencies of the code, making it difficult to split into various
commits.

The main conversion method (_convert) now in the superclass simply calls
two methods:
  - doc_to_pixels()
  - pixels_to_pdf()

Critically, doc_to_pixels is implemented in the superclass, diverging
only in a specialized method called "start_doc_to_pixels_proc()". This
method obtains the process responsible that communicates with the
isolation provider (container / disp VM) via `podman/docker` and qrexec
on Containers and Qubes respectively.

Known regressions:
  - progress reports stopped working on containers

Fixes #443
If one converted more than one document, since the state of
IsolationProvider.percentage would be stored in the IsolationProvider
instance, it would get reused for the second document. The fix is to
keep it as a local variable, but we can explore having progress stored
on the document itself, for example. Or having one IsolationProvider per
conversion.
Now that only the second container can send JSON-encoded progress
information, we can the untrusted JSON parsing. The parse_progress was
also renamed to `parse_progress_trusted` to ensure future developers
don't mistake this as a safe method.

The old methods for sending untrusted JSON were repurposed to send the
progress instead to stderr for troubleshooting in development mode.

Fixes #456
Avoids downloading the container image 4 times in the multi-stage build
by first pulling the alpine image once and then building without any
pulls.

Implemented following a suggestion of @apyrgio.
Conversions methods had changed and that was part of the reason why
the tests were failing. Furthermore, due to the `provider.proc`, which
stores the associated qrexec / container process, "server" exceptions
raise a IterruptedConversion error (now ConverterProcException), which
then requires interpretation of the process exit code to obtain the
"real" exception.
Now that Qubes and Containers essentially share the same code, we can
have both run the same tests.
If we increased the number of parallel conversions, we'd run into an
issue where the streams were getting mixed together. This was because
the Converter.proc was a single attribute. This breaks it down into a
local variable such that this mixup doesn't happen.
Fix issues in older distros that don't yet support python 3.11 where
endianness was not a default argument [1]. This is in response to CI
failures [2].

[1]: https://docs.python.org/3/library/stdtypes.html#int.to_bytes
[2]: https://app.circleci.com/pipelines/github/freedomofpress/dangerzone/2186/workflows/e340ca21-85ce-42b6-9bc3-09e66f96684a/jobs/27380y
The conversion was hanging arbitrarily [1] on some systems. Sometimes it
would send the full page other times stop half-way.

Originally found by @apyrgio.

Co-authored-by: @apyrgio

[1]: #627 (comment)
This reverts commit fea193e.

This is part of the purge of timeout-related code since we no longer
need it [1]. Non-blocking reads were introduced in the reverted commit
in order to be able to cut a stream mid-way due to a timeout. This is
no longer needed now that we're getting rid of timeouts.

[1]: #687
This reverts commit 344d6f7.
Stopwatch is no longer needed now that we're removing timeouts.
Remove timeouts due to several reasons:

1. Lost purpose: after implementing the containers page streaming the
   only subprocess we have left is LibreOffice. So don't have such a
   big risk of commands hanging (the original reason for timeouts).

2. Little benefit: predicting execution time is generically unsolvable
   computer science problem. Ultimately we were guessing an arbitrary
   time based on the number of pages and the document size. As a guess
   we made it pretty lax (30s per page or MB). A document hanging for
   this long will probably lead to user frustration in any case and the
   user may be compelled to abort the conversion.

3. Technical Challenges with non-blocking timeout: there have been
several technical challenges in keeping timeouts that we've made effort
to accommodate. A significant one was having to do non-blocking read to
ensure we could timeout when reading conversion stream (and then used
here)

Fixes #687
Since the progress information is now inferred on host based on the
number of pages obtained, progress-tracking variables should be removed
from the server.
@deeplow deeplow force-pushed the 443-stream-pages-pymupdf2 branch from b0bd7dd to c348774 Compare February 6, 2024 20:30
@apyrgio apyrgio force-pushed the 443-stream-pages-pymupdf2 branch from c348774 to 63622e0 Compare February 7, 2024 12:25
@deeplow
Copy link
Contributor Author

deeplow commented Feb 7, 2024

I'll fix the lint check

Switching from mounting files to writing to stdout has introduced some
Podman crashes in specific environments (Ubuntu Jammy / Debian Bullseye)
due to a conmon bug that affects version 2.0.25.

Fixing it for various permutations of the environments we support
requires the following:

1. CI tests: Install conmon from the oldstable-proposed-updates in
   our Debian Bullseye / Ubuntu Jammy dev/end-user environments.
2. Developers: Add a line in BUILD.md that suggests users to install
   conmon from the oldstable-proposed-updates repo, or some other repo
   they prefer.
3. End-user installations: We will build conmon for Ubuntu Jammy, and
   wait until the proposed updates repo gets merged in Debian Bullseye.

Fixes #685
@deeplow deeplow force-pushed the 443-stream-pages-pymupdf2 branch from 63622e0 to d1afe4c Compare February 7, 2024 12:53
@deeplow
Copy link
Contributor Author

deeplow commented Feb 7, 2024

@apyrgio I think this is just waiting for one final approval.

Copy link
Contributor

@apyrgio apyrgio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A+ for the amazing effort @deeplow, A+++ for the fact that we did it while removing ~300 lines of code. Let's merge it!

@deeplow
Copy link
Contributor Author

deeplow commented Feb 7, 2024

A+ for the amazing effort @deeplow, A+++ for the fact that we did it while removing ~300 lines of code. Let's merge it!

Thanks as well @apyrgio!! 🙂 It's a team effort.

The last thing are the Qubes tests. They are failing on my system. I'll have to check tomorrow what's up with them. I thought I had addressed them already. Let's see...

They are passing now.

Merging...

@deeplow deeplow merged commit d1afe4c into main Feb 7, 2024
50 checks passed
@deeplow deeplow deleted the 443-stream-pages-pymupdf2 branch February 7, 2024 14:35
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
timeout Dangerzone Times Out
Projects
None yet
3 participants