Rewrite in Python 3.7 #7

ibokuri · 2020-04-02T19:51:24Z

Woohoo! Python and multiple file support!

Note that this doesn't touch any of the GUI parts of the converter nor any of the testing infrastructure, just the CLI. I'm still bit newer to the former two so I'll still be working on those.

Also there's still quite a bit of polishing left to be done on the CLI but I'll save that for the mailing list.

ibokuri · 2020-04-02T19:58:38Z

Hm, I uploaded my key to pool.sks-keyservers.net so not too sure why it isn't being picked up.

qpdf-convert-client.py

marmarek · 2020-04-03T20:27:25Z

On Fri, Apr 03, 2020 at 11:22:27AM -0700, Jason Phan wrote: So during me replacing `input()`, I found that even if the client exits due to an error, the server still runs. For example, after the server receives a PDF file, it only sends the # of pages and RGB bitmaps to the client from then on. However, say an invalid # of pages is sent and the client exits with an error. The server will still continue to process pages into bitmaps and send them over to the client, even if the client closed its `sys.stdout` and `sys.stdin` (which I thought would raise an IOError on the server the next time it called `print()` or `sys.stdout.buffer.write()` to let it know the client died). Is there some way to indicate to the server that the client died? Or maybe some way to end the qrexec-client-vm process if the client died?

If the server doesn't read any more data, there is no "nice" way to tell it to terminate. But killing qrexec-client-vm process should do the trick.

…

-- Best Regards, Marek Marczykowski-Górecki Invisible Things Lab A: Because it messes up the order in which people normally read text. Q: Why is top-posting such a bad thing?

ibokuri · 2020-04-03T20:45:28Z

If the server doesn't read any more data, there is no "nice" way to tell it to terminate. But killing qrexec-client-vm process should do the trick.

So what do you think of replacing sys.exit() calls in the client with os.kill(os.getppid(), signal.SIGTERM) (where the ppid resolves to the pid of qrexec-client-vm)? Tbh it seems kind of uh... ugly/harsh compared to sys.exit() but it'll ensure that if the client dies for whatever reason, so will the server and their communication channel.

marmarek · 2020-04-03T21:17:27Z

It may make more sense to reverse calling order, like done here. In short: instead of calling qrexec-client-vm ... qpdf-converter-client, let qpdf-converter-client call qrexec-client-vm and operate on its stdin/stdout. This way you can easily kill the process whenever you want.

ibokuri · 2020-04-04T01:57:35Z

It may make more sense to reverse calling order, [...] let qpdf-converter-client call qrexec-client-vm and operate on its stdin/stdout.

Okay, I've setup a small test to see how that would work. It's basically a synchronous version of the link you sent using subprocess.Popen(). Async stuff can come after a working version.

The problem I have now is that I could only get consistent communication between the server & client if the subprocess is unbuffered. I think it's fine performance-wise since we're just doing IPC and not file writes but this means that the server needs to send to stdout each rgb file's size along with their contents since the server may be faster and send the contents of 2 separate rgb files before the client calls read() which will return the 2 files as a single one.

Any objections to this? If a bad/wrong filesize is sent to the client, either we'll end up an invalid rgb file or a partial one. Either way, we can verify that either visually or when we pass it to convert. If a bad/wrong RGB file is sent, that'll get taken care of in convert as well.

Oh also, if we go this way, can I just have qvm-convert-pdf hold the client's code? No real need for the wrapper if the client's going to be the one calling qrexec-client-vm.

marmarek · 2020-04-04T02:22:47Z

The problem I have now is that I could only get consistent communication between the server & client if the subprocess is unbuffered.

I don't think that's necessary, but you may need to add flush() call in some places - especially in client after sending the data.

I think it's fine performance-wise since we're just doing IPC and not file writes but this means that the server needs to send to stdout each rgb file's size along with their contents since the server may be faster and send the contents of 2 separate rgb files before the client calls read() which will return the 2 files as a single one.

You're doing something wrong. You should know exactly how many bytes a page have and you should read exactly that many bytes (see argument to read()), not everything available. Note that in unbuffered mode, read() may return less bytes than requested, even if they will be more data later. But I'd recommend switching back to buffered mode.

Any objections to this? If a bad/wrong filesize is sent to the client, either we'll end up an invalid rgb file or a partial one. Either way, we can verify that either visually or when we pass it to convert. If a bad/wrong RGB file is sent, that'll get taken care of in convert as well.

Basic verification (data size in this case) should be done before data hit convert. ImageMagick is known for not-so-high code quality and I wouldn't risk what could happen if data size doesn't match exactly.

Oh also, if we go this way, can I just have qvm-convert-pdf hold the client's code? No real need for the wrapper if the client's going to be the one calling qrexec-client-vm.

Yes, one file less.

ibokuri · 2020-04-04T02:30:18Z

I don't think that's necessary, but you may need to add flush() call in some places - especially in client after sending the data.

~~I was putting flushes everywhere but I'll go back and try again.~~
edit: I'm an idiot, ignore this. I was flushing the wrong stdin...

You're doing something wrong. You should know exactly how many bytes a page have and you should read exactly that many bytes [...]

Oh my god, how did I completely forget about the image dimensions the client gets lol.

Basic verification (data size in this case) should be done before data hit convert. ImageMagick is known for not-so-high code quality and I wouldn't risk what could happen if data size doesn't match exactly.

Gotcha.

marmarek

I've done some more thorough review of the current code. Some of this issues would be fixed by the above discussed change, but some are independent.
I think it would make more sense to focus on one thing at a time - first the python rewrite keeping the old protocol and one-file limit. And only then add multi-file support, using one way or another.

qpdf-convert-client.py

qpdf-convert-server.py

qvm-convert-pdf.py

marmarek

Besides comments inline, do not rename directory to pdf-converter. It is a name of python module (see setup.py) and it cannot contain dashes.

pdf-converter/client.py

pdf-converter/server.py

pdf-converter/client.py

ibokuri · 2020-04-19T03:50:31Z

do not rename directory to pdf-converter. It is a name of python module (see setup.py) and it cannot contain dashes.

That's my bad, not as familiar with Python packaging as I'd like to be.

I can keep the files in a sub-directory though right? ~~I just need to change it to pdf_converter or something?~~ I just didn't want them at the top-level.

edit: Maybe I'll just call it src.

marmarek · 2020-04-19T11:45:58Z

I can keep the files in a sub-directory though right?

You can simply add them into existing qubespdfconverter. They will be installed also as part of the package, but that isn't an issue.

ibokuri · 2020-04-26T16:56:05Z

You can simply add them into existing qubespdfconverter. They will be installed also as part of the package, but that isn't an issue.

Done in 326e867.

ibokuri · 2020-04-26T17:20:01Z

Almost Done!

Barring any changes/fixes you might suggest after going over the new changes, I really only have a few major things I want to get in:

Documentation
Replace the big zip() with a class object.

I'll probably be done with both of these some time today. Other changes/features can wait until after the merge.

Points of Interest

Error handling saw a major improvement in 0449636 and fb7f609 (thank you to whoever thought of asyncio.all_tasks()). I spent quite a bit of time making sure to cleanup running tasks and processes if an Exception is raised, but of course if you find something or have any questions/improvements be sure to let me know.

ibokuri · 2020-05-10T14:38:05Z

Drafting this PR for now. After working on the UX for a bit, I found some bugs that need fixing (mainly around error handling; some around the conversion process too). Besides, nice output and error messages for the user should really be a part of this PR anyway.

ibokuri · 2020-05-10T14:39:07Z

Oops, didn't mean to re-request a review @marmarek, sorry about that.

neowutran · 2020-05-24T07:55:29Z

I don't count myself as experienced enough in python to review the code, however for the failed checks:

https://github.com/bl0nd/qubes-app-linux-pdf-converter/blob/master/.travis.yml#L18 those files doesn't exist anymore, and fail the build
The commit 13e22fc is not signed

and

+ make install-dom0 DESTDIR=/home/user/rpmbuild/BUILDROOT/qubes-pdf-converter-dom0-2.1.7-1.fc25.x86_64
python3 setup.py install -O1 --root /home/user/rpmbuild/BUILDROOT/qubes-pdf-converter-dom0-2.1.7-1.fc25.x86_64

...

copying qubespdfconverter/client.py -> build/lib/qubespdfconverter
copying qubespdfconverter/server.py -> build/lib/qubespdfconverter

....

 /usr/bin/python3 -O /tmp/tmpl44vzjkj.py
  File "/usr/lib/python3.5/site-packages/qubespdfconverter/client.py", line 60
    width: int
         ^
SyntaxError: invalid syntax

  File "/usr/lib/python3.5/site-packages/qubespdfconverter/server.py", line 111
    self.initial = prefix.with_suffix(f".{i_suffix}")
                                                   ^
SyntaxError: invalid syntax

This syntax doesn't exist for python3.5, but anyway no reason for fedora-25 (dom0) to try to copy/install/check those files. Related to 'setup.py' but didn't searched his exact role

ibokuri · 2020-05-24T13:34:15Z

Thanks so much! I'm really awful at CI/testing stuff so bear with me if these are stupid questions.

For travis.yml, should I just delete this whole portion?

jobs:
  include:
    - script:
      - shellcheck qpdf-convert-client qpdf-convert-server

For the unsigned commit, how do I sign it without rebasing everything? Or is rebasing okay in this case.
For the dom0 bit, looks like rpm_spec/qpdf-convert-dom0.spec.in calls make install-dom0, which has some setup.py install lines. I agree with dom0 not needing to install those so I can probably just remove the install-dom0 target and the make install-dom0 lines right? In the spec.in there's also this part:

%files
%config(noreplace) %attr(0664,root,qubes) /etc/qubes-rpc/policy/qubes.PdfConvert
%dir %{python2_sitelib}/qubespdfconverter-*.egg-info
%{python2_sitelib}/qubespdfconverter-*.egg-info/*
%{python2_sitelib}/qubespdfconverter
%dir %{python3_sitelib}/qubespdfconverter-*.egg-info
%{python3_sitelib}/qubespdfconverter-*.egg-info/*
%{python3_sitelib}/qubespdfconverter

I think I need to also get rid of the egg-info and qubespdfconverter lines?

marmarek · 2020-05-26T20:41:32Z

1. For travis.yml, should I just delete this whole portion?

Yes. But consider adding pylint instead.

1. For the unsigned commit, how do I sign it without rebasing everything? Or is rebasing okay in this case.

Rebase is the only way. And it is ok for pull request related branch.

2\. For the dom0 bit, looks like `rpm_spec/qpdf-convert-dom0.spec.in` calls `make install-dom0`, which has some `setup.py install` lines. I agree with dom0 not needing to install those so I can probably just remove the `install-dom0` target and the `make install-dom0` lines right?

Dom0 indeed doesn't require the actual pdf converter scripts. But integration tests (tests.py) file should stay. There should be a way to do that with setup() arguments. You can leave it as is, I'll take care of this part.

This commit also adds more robust argument parsing in anticipation of future options and filepath existence checks to avoid potentially wasteful qrexec-client-vm runs.

PNG tasks were being enqueued too quickly, leaving no time for RGB conversions or PNG deletions. This meant that the server would create PNGs for every single page of a PDF before any conversions started, which is clearly not ideal. After experimenting with limits on the number of PNGs created before forcing the PNG creation task to join on the queue, I found that a limit of 1 gave the best performance. Technically, it's a limit of 2 since we start a new task before we await the previous one. In any case, the server is quite a bit faster now and won't run out of space easily.

ibokuri · 2020-06-19T19:19:04Z

python3-tqdm in Debian 10 seems to be older (unsurprisingly), specifically
there is no reset() method [...] also format_dict doesn't seem to be used

Sorry, did you mean format_dict() isn't used by Debian 10's tqdm or by client.py? If the latter, the function's used internally by tqdm.

(changing bar_format to {desc}...{n}/{total} makes it work)

Does it? To update {total} we would still need to use reset() no? That is, unless you want to create the bars after their associated dispvms have started and have sent us the page numbers, which I didn't do since it (and the use of {n}/{total}) has its own problems:

The order of the bars won't be the same as how the user specified them on the command-line, which is surprisingly very annoying.
If the server fails at parsing out or sending the page numbers, we'd have to create a completely different bar in the exception handling code instead of simply updating an existing one. I'd rather just have 1 bar for each job and update it accordingly rather than maintaining two separate ones for successes or failures.
{n}/{total} can't show statuses like "fail" or "done". I guess you can put it in r_bar and then update r_bar with statuses instead of pages numbers on success or failures? I'm pretty sure I've tried this before though and decided against it, can't remember the exact reason though.

it may be better to remove rgb/png files just after merging them into pdf (client side) - converting this 3MB input and 163MB output file took over 3GB in /tmp and barely fit there

Oops, I guess I deleted the removal code and forgot to put it back in. It's back in there now.

I didn't get any error message when run out of space in /tmp - just Total Sanitized Files: 0/1 (and progress for a file "finished" in the middle of file); and the same for issue when bar.reset() failed [...]

Did you run out of space on the server or client? The client should raise an IOError (on merges/saves) or a CalledProcessError (on rep conversions) if you run out of space. If it's not then that's a problem. As for the server, c00e7a1 should prevent the server from running out of space since it only has 1 or 2 images in /tmp at a time.

Interestingly, when I tried to reproduce your error (with the new changes in place) by using a batch size of 500, I didn't even get to run out of space, the process just ended up getting killed by OOM lol.

running 100+ pdftocairo in parallel means the system will be very busy with context switches instead of actual rendering; and also takes more space in /tmp

c00e7a1 makes it so that there should only be 1-2 pdftocairo processes running at any time on the server. The server essentially starts up a conversion and then waits until the last one finishes before queueing the current on. idk why, this gives waaay better performance than other solutions I've tried.

As for the performance [...]

Changes:

Server starts conversion and waits on previous one
Bulk saving

Performance (excludes VM startup time):

new server, new client (batch: 50, bulk): 3 min
OG server, new client (batch: 50, bulk): 5 min
new server, new client (batch: 50): 16 min
new server, OG client 27 min

Comments:

Lowering the batch size didn't really have any impact on the time (unless the size was set to something super low like 1 or 2, in which case the time went down a bit).

marmarek · 2020-06-19T20:57:17Z

Sorry, did you mean format_dict() isn't used by Debian 10's tqdm or by client.py? If the latter, the function's used internally by tqdm.

The former. That function simply doesn't exist in that version, the dict is built inline in format_meter.

To update {total} we would still need to use reset() no?

Yes, that's separate issue that is easy to solve like this:

            try:
                self.bar.reset(total=pagenums)
            except AttributeError:
                # tqdm older than 4.32 do not have reset(), open-code it here
                self.bar.last_print_n = self.bar.n = 0 
                self.bar.last_print_t = self.bar.start_t = self.bar._time()
                self.bar.total = pagenums
                self.bar.refresh()

* `{n}/{total}` can't show statuses like "fail" or "done".

That's indeed the case with this solution.

I guess you can put it in r_bar and then update r_bar with statuses instead of pages numbers on success or failures?

I don't know tqdm enough, but perhaps the more naive method would work: changing bar_format to done/fail instead of {n}/{total}?

Did you run out of space on the server or client?

In fact both. One because of missing /tmp cleanup on the client side, the other one because of too many parallel pdfcairo processes producing all the output at once.
I think the lack of message was related to bar_format, which didn't included done/fail now. But in case of "fail", it would be nice to get some more details.

ibokuri · 2020-06-19T21:34:06Z

 try:
     self.bar.reset(total=pagenums)
 except AttributeError:
      # tqdm older than 4.32 do not have reset(), open-code it here
      self.bar.last_print_n = self.bar.n = 0 
      self.bar.last_print_t = self.bar.start_t = self.bar._time()
      self.bar.total = pagenums
      self.bar.refresh()

Ah, I see. I'll try it out.

I don't know tqdm enough, but perhaps the more naive method would work: changing bar_format to done/fail instead of {n}/{total}?

I'll play with the bar some more and see what works.

In fact both. One because of missing /tmp cleanup on the client side, the other one because of too many parallel pdfcairo processes producing all the output at once.

If you have time, try out the new commits and see if it's any better. They should help.

But in case of "fail", it would be nice to get some more details.

Hmmmmm... You didn't see error logs at the end of the program like this?

Sending files...

 foobar.txt...fail
 foobbar2.txt...done

ERROR: foobar.txt: a very nice log message

Total Sanitized Files: 1/2

If an exception's raised and caught, there should be an error like that (note: they all show up at once at the very end of the program). If that's not showing then something's up.

marmarek · 2020-06-19T21:51:47Z

You didn't see error logs at the end of the program like this?

No, it wasn't there.
And also exit code was 0.

qubespdfconverter/client.py

marmarek · 2020-06-20T00:38:04Z

With recent commits, the missing error message is still an issue (but now I do get non-zero exit code correctly). How to test:

mkdir /tmp/small
sudo mount -t tmpfs  none /tmp/small -o size=100M
TMPDIR=/tmp/small qvm-convert-pdf (that large pdf)

ibokuri · 2020-06-20T01:34:09Z

It looks like it was just an unhandled OSError from when we save initial representations. The error logs now show up nicely now for me.

I think all that's left is the bar stuff.

marmarek · 2020-06-20T01:39:37Z

Better :)

So, now the only remaining issue is working with older tqdm (Debian buster).

ibokuri · 2020-06-20T21:54:56Z

So, I installed the Debian 10 template (didn't have it before) and made an appvm off of it. Then I copied over the client program, installed python3-pip, ran pip3 install click tqdm pillow, and then ran the program.

It seems to run fine with reset() and format_dict(). I'm probably doing it all wrong, but am I not supposed to install the dependencies through pip or something?

marmarek · 2020-06-20T21:57:51Z

installed python3-pip, ran pip3 install click tqdm pillow

This is the place where you cheated ;)
sudo apt install python3-click python3-tqdm python-pillow

ibokuri · 2020-06-20T22:06:30Z

Huh, is apt instead pip used when the Makefile runs python3 setup.py install?

marmarek · 2020-06-20T22:15:45Z

No, python3 setup.py install doesn't install dependencies at all. The point is it should work with dependencies packaged as distribution packages (Debian here), not installed on a side by pip (which has poor integrity protection).

ibokuri · 2020-06-21T17:51:06Z

Uhhh not too sure what to do about the .travis.yml conflict..

marmarek · 2020-06-22T03:07:05Z

Don't worry about conflict, I'll handle it on merge.

marmarek reviewed Apr 3, 2020

View reviewed changes

qpdf-convert-client.py Outdated Show resolved Hide resolved

marmarek requested changes Apr 17, 2020

View reviewed changes

marmarek requested changes Apr 19, 2020

View reviewed changes

ibokuri marked this pull request as draft May 10, 2020 14:35

ibokuri requested a review from marmarek May 10, 2020 14:38

ibokuri changed the title ~~Port to Python 3~~ Rewrite in Python 3.7 May 19, 2020

ibokuri marked this pull request as ready for review May 19, 2020 19:56

Jason Phan added 7 commits May 27, 2020 23:05

readme: Remove extra parenthesis

a84a215

wrapper: Update qvm-convert-pdf into Python 3

4e0d635

This commit also adds more robust argument parsing in anticipation of future options and filepath existence checks to avoid potentially wasteful qrexec-client-vm runs.

wrapper: Add logging and trim options

7ef5b33

wrapper: Prepare for multiple file support

9668bfb

wrapper: Remove unneeded main() try block

a014fb3

wrapper: Remove logging

fd9175f

client: Update to Python 3

0a7cfae

Jason Phan added 5 commits June 16, 2020 15:33

server: Rename batch entry variables

8e32abb

client: Implement bulk saves and remove reps appropriately

bcbaf9e

client: Exit with 1 on error

ef41913

meta: Copyright info

2abacda

pylint: Add bad-continuation to .pylintrc

9b655af

marmarek reviewed Jun 19, 2020

View reviewed changes

qubespdfconverter/client.py Outdated Show resolved Hide resolved

Jason Phan added 2 commits June 19, 2020 18:35

client: Simplify image appending

c87e61b

client: Fix output spacing

5eab363

client: Handle out of space error

f1d35f2

Jason Phan added 3 commits June 21, 2020 11:21

client: Add support for older tqdm versions

379659b

pylint: Add expression-not-assigned

64cc14f

makefile: Resolve makefile conflict

1ee08f7

marmarek approved these changes Jun 22, 2020

View reviewed changes

marmarek merged commit 60b6b5c into QubesOS:master Jun 24, 2020

ibokuri mentioned this pull request Jul 12, 2020

Add support for more file types + archlinux packaging #9

Open

unman mentioned this pull request Sep 16, 2020

Fix check - Ubuntu only has python3.6 #13

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite in Python 3.7 #7

Rewrite in Python 3.7 #7

ibokuri commented Apr 2, 2020

ibokuri commented Apr 2, 2020

marmarek commented Apr 3, 2020 via email

ibokuri commented Apr 3, 2020 •

edited

Loading

marmarek commented Apr 3, 2020

ibokuri commented Apr 4, 2020

marmarek commented Apr 4, 2020

ibokuri commented Apr 4, 2020 •

edited

Loading

marmarek left a comment

marmarek left a comment

ibokuri commented Apr 19, 2020 •

edited

Loading

marmarek commented Apr 19, 2020

ibokuri commented Apr 26, 2020

ibokuri commented Apr 26, 2020

ibokuri commented May 10, 2020

ibokuri commented May 10, 2020

neowutran commented May 24, 2020

ibokuri commented May 24, 2020

marmarek commented May 26, 2020

ibokuri commented Jun 19, 2020 •

edited

Loading

marmarek commented Jun 19, 2020

ibokuri commented Jun 19, 2020

marmarek commented Jun 19, 2020

marmarek commented Jun 20, 2020

ibokuri commented Jun 20, 2020

marmarek commented Jun 20, 2020

ibokuri commented Jun 20, 2020

marmarek commented Jun 20, 2020

ibokuri commented Jun 20, 2020

marmarek commented Jun 20, 2020

ibokuri commented Jun 21, 2020

marmarek commented Jun 22, 2020

Rewrite in Python 3.7 #7

Rewrite in Python 3.7 #7

Conversation

ibokuri commented Apr 2, 2020

ibokuri commented Apr 2, 2020

marmarek commented Apr 3, 2020 via email

ibokuri commented Apr 3, 2020 • edited Loading

marmarek commented Apr 3, 2020

ibokuri commented Apr 4, 2020

marmarek commented Apr 4, 2020

ibokuri commented Apr 4, 2020 • edited Loading

marmarek left a comment

Choose a reason for hiding this comment

marmarek left a comment

Choose a reason for hiding this comment

ibokuri commented Apr 19, 2020 • edited Loading

marmarek commented Apr 19, 2020

ibokuri commented Apr 26, 2020

ibokuri commented Apr 26, 2020

Almost Done!

Points of Interest

ibokuri commented May 10, 2020

ibokuri commented May 10, 2020

neowutran commented May 24, 2020

ibokuri commented May 24, 2020

marmarek commented May 26, 2020

ibokuri commented Jun 19, 2020 • edited Loading

marmarek commented Jun 19, 2020

ibokuri commented Jun 19, 2020

marmarek commented Jun 19, 2020

marmarek commented Jun 20, 2020

ibokuri commented Jun 20, 2020

marmarek commented Jun 20, 2020

ibokuri commented Jun 20, 2020

marmarek commented Jun 20, 2020

ibokuri commented Jun 20, 2020

marmarek commented Jun 20, 2020

ibokuri commented Jun 21, 2020

marmarek commented Jun 22, 2020

ibokuri commented Apr 3, 2020 •

edited

Loading

ibokuri commented Apr 4, 2020 •

edited

Loading

ibokuri commented Apr 19, 2020 •

edited

Loading

ibokuri commented Jun 19, 2020 •

edited

Loading