-
Notifications
You must be signed in to change notification settings - Fork 177
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Containers Page Streaming based on PyMuPDF #627
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've made a first pass of the code, and I have some minor comments. My main concern is the non-blocking read implementation, and how it will work on Windows. I'll help with figuring this out, along with streaming support on the second stage of the conversion (reserved for a next PR).
I've seen with multiple documents the percentage goes over 100% so we should look into that. |
e3932d0
to
8ab2abe
Compare
@apyrgio I have just rebased on top of the pymupdf branch so it can work as a base for you. |
42dc3c6
to
a751582
Compare
4211f28
to
eab95a5
Compare
a3c2530
to
609b77e
Compare
Hm, I made this change btw. See this commit: ac38d99. Could it be that you haven't pulled my changes prior to the push, since Ubuntu Jammy and Debian Bullseye fails? |
Simple rename of the __convert() method in the Qubes conversion to make the code structurally similar.
Remove container_log messages ahead of debug info being sent over standard streams.
The conversion was hanging arbitrarily [1] on some systems. Sometimes it would send the full page other times stop half-way. Originally found by @apyrgio. Co-authored-by: @apyrgio [1]: #627 (comment)
The conversion was hanging arbitrarily [1] on some systems. Sometimes it would send the full page other times stop half-way. Originally found by @apyrgio. Co-authored-by: @apyrgio [1]: #627 (comment)
b82179c
to
b0bd7dd
Compare
Merge Qubes and Containers isolation providers core code into the class parent IsolationProviders abstract class. This is done by streaming pages in containers for exclusively in first conversion process. The commit is rather large due to the multiple interdependencies of the code, making it difficult to split into various commits. The main conversion method (_convert) now in the superclass simply calls two methods: - doc_to_pixels() - pixels_to_pdf() Critically, doc_to_pixels is implemented in the superclass, diverging only in a specialized method called "start_doc_to_pixels_proc()". This method obtains the process responsible that communicates with the isolation provider (container / disp VM) via `podman/docker` and qrexec on Containers and Qubes respectively. Known regressions: - progress reports stopped working on containers Fixes #443
If one converted more than one document, since the state of IsolationProvider.percentage would be stored in the IsolationProvider instance, it would get reused for the second document. The fix is to keep it as a local variable, but we can explore having progress stored on the document itself, for example. Or having one IsolationProvider per conversion.
Now that only the second container can send JSON-encoded progress information, we can the untrusted JSON parsing. The parse_progress was also renamed to `parse_progress_trusted` to ensure future developers don't mistake this as a safe method. The old methods for sending untrusted JSON were repurposed to send the progress instead to stderr for troubleshooting in development mode. Fixes #456
Avoids downloading the container image 4 times in the multi-stage build by first pulling the alpine image once and then building without any pulls. Implemented following a suggestion of @apyrgio.
Conversions methods had changed and that was part of the reason why the tests were failing. Furthermore, due to the `provider.proc`, which stores the associated qrexec / container process, "server" exceptions raise a IterruptedConversion error (now ConverterProcException), which then requires interpretation of the process exit code to obtain the "real" exception.
Now that Qubes and Containers essentially share the same code, we can have both run the same tests.
If we increased the number of parallel conversions, we'd run into an issue where the streams were getting mixed together. This was because the Converter.proc was a single attribute. This breaks it down into a local variable such that this mixup doesn't happen.
Attempt to fix missing issue installing poetry [1]. [1]: https://github.com/freedomofpress/dangerzone/actions/runs/7487413482/job/20379748604?pr=627
Fix issues in older distros that don't yet support python 3.11 where endianness was not a default argument [1]. This is in response to CI failures [2]. [1]: https://docs.python.org/3/library/stdtypes.html#int.to_bytes [2]: https://app.circleci.com/pipelines/github/freedomofpress/dangerzone/2186/workflows/e340ca21-85ce-42b6-9bc3-09e66f96684a/jobs/27380y
The conversion was hanging arbitrarily [1] on some systems. Sometimes it would send the full page other times stop half-way. Originally found by @apyrgio. Co-authored-by: @apyrgio [1]: #627 (comment)
This reverts commit fea193e. This is part of the purge of timeout-related code since we no longer need it [1]. Non-blocking reads were introduced in the reverted commit in order to be able to cut a stream mid-way due to a timeout. This is no longer needed now that we're getting rid of timeouts. [1]: #687
This reverts commit 344d6f7. Stopwatch is no longer needed now that we're removing timeouts.
Remove timeouts due to several reasons: 1. Lost purpose: after implementing the containers page streaming the only subprocess we have left is LibreOffice. So don't have such a big risk of commands hanging (the original reason for timeouts). 2. Little benefit: predicting execution time is generically unsolvable computer science problem. Ultimately we were guessing an arbitrary time based on the number of pages and the document size. As a guess we made it pretty lax (30s per page or MB). A document hanging for this long will probably lead to user frustration in any case and the user may be compelled to abort the conversion. 3. Technical Challenges with non-blocking timeout: there have been several technical challenges in keeping timeouts that we've made effort to accommodate. A significant one was having to do non-blocking read to ensure we could timeout when reading conversion stream (and then used here) Fixes #687
Since the progress information is now inferred on host based on the number of pages obtained, progress-tracking variables should be removed from the server.
b0bd7dd
to
c348774
Compare
c348774
to
63622e0
Compare
I'll fix the lint check |
Switching from mounting files to writing to stdout has introduced some Podman crashes in specific environments (Ubuntu Jammy / Debian Bullseye) due to a conmon bug that affects version 2.0.25. Fixing it for various permutations of the environments we support requires the following: 1. CI tests: Install conmon from the oldstable-proposed-updates in our Debian Bullseye / Ubuntu Jammy dev/end-user environments. 2. Developers: Add a line in BUILD.md that suggests users to install conmon from the oldstable-proposed-updates repo, or some other repo they prefer. 3. End-user installations: We will build conmon for Ubuntu Jammy, and wait until the proposed updates repo gets merged in Debian Bullseye. Fixes #685
63622e0
to
d1afe4c
Compare
@apyrgio I think this is just waiting for one final approval. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A+ for the amazing effort @deeplow, A+++ for the fact that we did it while removing ~300 lines of code. Let's merge it!
Thanks as well @apyrgio!! 🙂 It's a team effort.
They are passing now. Merging... |
Fixes #443, #574. Depends on #622.
Adds initial support for page streaming in containers. This is a critical move to homogenize the isolation providers code as it makes Qubes and Containers share a lot of it.