Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite in Python 3.7 #7

Merged
merged 92 commits into from
Jun 24, 2020
Merged

Rewrite in Python 3.7 #7

merged 92 commits into from
Jun 24, 2020

Conversation

ibokuri
Copy link
Contributor

@ibokuri ibokuri commented Apr 2, 2020

Woohoo! Python and multiple file support!

Note that this doesn't touch any of the GUI parts of the converter nor any of the testing infrastructure, just the CLI. I'm still bit newer to the former two so I'll still be working on those.

Also there's still quite a bit of polishing left to be done on the CLI but I'll save that for the mailing list.

@ibokuri
Copy link
Contributor Author

ibokuri commented Apr 2, 2020

Hm, I uploaded my key to pool.sks-keyservers.net so not too sure why it isn't being picked up.

qpdf-convert-client.py Outdated Show resolved Hide resolved
@marmarek
Copy link
Member

marmarek commented Apr 3, 2020 via email

@ibokuri
Copy link
Contributor Author

ibokuri commented Apr 3, 2020

If the server doesn't read any more data, there is no "nice" way to tell it to terminate. But killing qrexec-client-vm process should do the trick.

So what do you think of replacing sys.exit() calls in the client with os.kill(os.getppid(), signal.SIGTERM) (where the ppid resolves to the pid of qrexec-client-vm)? Tbh it seems kind of uh... ugly/harsh compared to sys.exit() but it'll ensure that if the client dies for whatever reason, so will the server and their communication channel.

@marmarek
Copy link
Member

marmarek commented Apr 3, 2020

It may make more sense to reverse calling order, like done here. In short: instead of calling qrexec-client-vm ... qpdf-converter-client, let qpdf-converter-client call qrexec-client-vm and operate on its stdin/stdout. This way you can easily kill the process whenever you want.

@ibokuri
Copy link
Contributor Author

ibokuri commented Apr 4, 2020

It may make more sense to reverse calling order, [...] let qpdf-converter-client call qrexec-client-vm and operate on its stdin/stdout.

Okay, I've setup a small test to see how that would work. It's basically a synchronous version of the link you sent using subprocess.Popen(). Async stuff can come after a working version.

The problem I have now is that I could only get consistent communication between the server & client if the subprocess is unbuffered. I think it's fine performance-wise since we're just doing IPC and not file writes but this means that the server needs to send to stdout each rgb file's size along with their contents since the server may be faster and send the contents of 2 separate rgb files before the client calls read() which will return the 2 files as a single one.

Any objections to this? If a bad/wrong filesize is sent to the client, either we'll end up an invalid rgb file or a partial one. Either way, we can verify that either visually or when we pass it to convert. If a bad/wrong RGB file is sent, that'll get taken care of in convert as well.

Oh also, if we go this way, can I just have qvm-convert-pdf hold the client's code? No real need for the wrapper if the client's going to be the one calling qrexec-client-vm.

@marmarek
Copy link
Member

marmarek commented Apr 4, 2020

The problem I have now is that I could only get consistent communication between the server & client if the subprocess is unbuffered.

I don't think that's necessary, but you may need to add flush() call in some places - especially in client after sending the data.

I think it's fine performance-wise since we're just doing IPC and not file writes but this means that the server needs to send to stdout each rgb file's size along with their contents since the server may be faster and send the contents of 2 separate rgb files before the client calls read() which will return the 2 files as a single one.

You're doing something wrong. You should know exactly how many bytes a page have and you should read exactly that many bytes (see argument to read()), not everything available. Note that in unbuffered mode, read() may return less bytes than requested, even if they will be more data later. But I'd recommend switching back to buffered mode.

Any objections to this? If a bad/wrong filesize is sent to the client, either we'll end up an invalid rgb file or a partial one. Either way, we can verify that either visually or when we pass it to convert. If a bad/wrong RGB file is sent, that'll get taken care of in convert as well.

Basic verification (data size in this case) should be done before data hit convert. ImageMagick is known for not-so-high code quality and I wouldn't risk what could happen if data size doesn't match exactly.

Oh also, if we go this way, can I just have qvm-convert-pdf hold the client's code? No real need for the wrapper if the client's going to be the one calling qrexec-client-vm.

Yes, one file less.

@ibokuri
Copy link
Contributor Author

ibokuri commented Apr 4, 2020

I don't think that's necessary, but you may need to add flush() call in some places - especially in client after sending the data.

I was putting flushes everywhere but I'll go back and try again.
edit: I'm an idiot, ignore this. I was flushing the wrong stdin...

You're doing something wrong. You should know exactly how many bytes a page have and you should read exactly that many bytes [...]

Oh my god, how did I completely forget about the image dimensions the client gets lol.

Basic verification (data size in this case) should be done before data hit convert. ImageMagick is known for not-so-high code quality and I wouldn't risk what could happen if data size doesn't match exactly.

Gotcha.

Copy link
Member

@marmarek marmarek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've done some more thorough review of the current code. Some of this issues would be fixed by the above discussed change, but some are independent.
I think it would make more sense to focus on one thing at a time - first the python rewrite keeping the old protocol and one-file limit. And only then add multi-file support, using one way or another.

qpdf-convert-client.py Outdated Show resolved Hide resolved
qpdf-convert-client.py Outdated Show resolved Hide resolved
qpdf-convert-client.py Outdated Show resolved Hide resolved
qpdf-convert-client.py Outdated Show resolved Hide resolved
qpdf-convert-client.py Outdated Show resolved Hide resolved
qpdf-convert-server.py Outdated Show resolved Hide resolved
qpdf-convert-server.py Outdated Show resolved Hide resolved
qpdf-convert-server.py Outdated Show resolved Hide resolved
qpdf-convert-server.py Outdated Show resolved Hide resolved
qvm-convert-pdf.py Outdated Show resolved Hide resolved
Copy link
Member

@marmarek marmarek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Besides comments inline, do not rename directory to pdf-converter. It is a name of python module (see setup.py) and it cannot contain dashes.

pdf-converter/client.py Outdated Show resolved Hide resolved
pdf-converter/client.py Outdated Show resolved Hide resolved
pdf-converter/client.py Outdated Show resolved Hide resolved
pdf-converter/client.py Outdated Show resolved Hide resolved
pdf-converter/client.py Outdated Show resolved Hide resolved
pdf-converter/client.py Outdated Show resolved Hide resolved
pdf-converter/client.py Outdated Show resolved Hide resolved
pdf-converter/server.py Outdated Show resolved Hide resolved
pdf-converter/server.py Outdated Show resolved Hide resolved
pdf-converter/client.py Outdated Show resolved Hide resolved
@ibokuri
Copy link
Contributor Author

ibokuri commented Apr 19, 2020

do not rename directory to pdf-converter. It is a name of python module (see setup.py) and it cannot contain dashes.

That's my bad, not as familiar with Python packaging as I'd like to be.

I can keep the files in a sub-directory though right? I just need to change it to pdf_converter or something? I just didn't want them at the top-level.

edit: Maybe I'll just call it src.

@marmarek
Copy link
Member

I can keep the files in a sub-directory though right?

You can simply add them into existing qubespdfconverter. They will be installed also as part of the package, but that isn't an issue.

@ibokuri
Copy link
Contributor Author

ibokuri commented Apr 26, 2020

You can simply add them into existing qubespdfconverter. They will be installed also as part of the package, but that isn't an issue.

Done in 326e867.

@ibokuri
Copy link
Contributor Author

ibokuri commented Apr 26, 2020

Almost Done!

Barring any changes/fixes you might suggest after going over the new changes, I really only have a few major things I want to get in:

  1. Documentation
  2. Replace the big zip() with a class object.

I'll probably be done with both of these some time today. Other changes/features can wait until after the merge.

Points of Interest

Error handling saw a major improvement in 0449636 and fb7f609 (thank you to whoever thought of asyncio.all_tasks()). I spent quite a bit of time making sure to cleanup running tasks and processes if an Exception is raised, but of course if you find something or have any questions/improvements be sure to let me know.

@ibokuri ibokuri marked this pull request as draft May 10, 2020 14:35
@ibokuri
Copy link
Contributor Author

ibokuri commented May 10, 2020

Drafting this PR for now. After working on the UX for a bit, I found some bugs that need fixing (mainly around error handling; some around the conversion process too). Besides, nice output and error messages for the user should really be a part of this PR anyway.

@ibokuri ibokuri requested a review from marmarek May 10, 2020 14:38
@ibokuri
Copy link
Contributor Author

ibokuri commented May 10, 2020

Oops, didn't mean to re-request a review @marmarek, sorry about that.

@ibokuri ibokuri changed the title Port to Python 3 Rewrite in Python 3.7 May 19, 2020
@ibokuri ibokuri marked this pull request as ready for review May 19, 2020 19:56
@neowutran
Copy link

I don't count myself as experienced enough in python to review the code, however for the failed checks:

and

+ make install-dom0 DESTDIR=/home/user/rpmbuild/BUILDROOT/qubes-pdf-converter-dom0-2.1.7-1.fc25.x86_64
python3 setup.py install -O1 --root /home/user/rpmbuild/BUILDROOT/qubes-pdf-converter-dom0-2.1.7-1.fc25.x86_64

...

copying qubespdfconverter/client.py -> build/lib/qubespdfconverter
copying qubespdfconverter/server.py -> build/lib/qubespdfconverter

....

 /usr/bin/python3 -O /tmp/tmpl44vzjkj.py
  File "/usr/lib/python3.5/site-packages/qubespdfconverter/client.py", line 60
    width: int
         ^
SyntaxError: invalid syntax

  File "/usr/lib/python3.5/site-packages/qubespdfconverter/server.py", line 111
    self.initial = prefix.with_suffix(f".{i_suffix}")
                                                   ^
SyntaxError: invalid syntax

This syntax doesn't exist for python3.5, but anyway no reason for fedora-25 (dom0) to try to copy/install/check those files. Related to 'setup.py' but didn't searched his exact role

@ibokuri
Copy link
Contributor Author

ibokuri commented May 24, 2020

Thanks so much! I'm really awful at CI/testing stuff so bear with me if these are stupid questions.

  1. For travis.yml, should I just delete this whole portion?
jobs:
  include:
    - script:
      - shellcheck qpdf-convert-client qpdf-convert-server
  1. For the unsigned commit, how do I sign it without rebasing everything? Or is rebasing okay in this case.

  2. For the dom0 bit, looks like rpm_spec/qpdf-convert-dom0.spec.in calls make install-dom0, which has some setup.py install lines. I agree with dom0 not needing to install those so I can probably just remove the install-dom0 target and the make install-dom0 lines right? In the spec.in there's also this part:

%files
%config(noreplace) %attr(0664,root,qubes) /etc/qubes-rpc/policy/qubes.PdfConvert
%dir %{python2_sitelib}/qubespdfconverter-*.egg-info
%{python2_sitelib}/qubespdfconverter-*.egg-info/*
%{python2_sitelib}/qubespdfconverter
%dir %{python3_sitelib}/qubespdfconverter-*.egg-info
%{python3_sitelib}/qubespdfconverter-*.egg-info/*
%{python3_sitelib}/qubespdfconverter

I think I need to also get rid of the egg-info and qubespdfconverter lines?

@marmarek
Copy link
Member

1. For travis.yml, should I just delete this whole portion?

Yes. But consider adding pylint instead.

1. For the unsigned commit, how do I sign it without rebasing everything? Or is rebasing okay in this case.

Rebase is the only way. And it is ok for pull request related branch.

2\. For the dom0 bit, looks like `rpm_spec/qpdf-convert-dom0.spec.in` calls `make install-dom0`, which has some `setup.py install` lines. I agree with dom0 not needing to install those so I can probably just remove the `install-dom0` target and the `make install-dom0` lines right?

Dom0 indeed doesn't require the actual pdf converter scripts. But integration tests (tests.py) file should stay. There should be a way to do that with setup() arguments. You can leave it as is, I'll take care of this part.

Jason Phan added 7 commits May 27, 2020 23:05
Jason Phan added 5 commits June 16, 2020 15:33
PNG tasks were being enqueued too quickly, leaving no time for RGB conversions
or PNG deletions.  This meant that the server would create PNGs for every
single page of a PDF before any conversions started, which is clearly not
ideal.

After experimenting with limits on the number of PNGs created before forcing
the PNG creation task to join on the queue, I found that a limit of 1 gave the
best performance. Technically, it's a limit of 2 since we start a new task
before we await the previous one. In any case, the server is quite a bit faster
now and won't run out of space easily.
@ibokuri
Copy link
Contributor Author

ibokuri commented Jun 19, 2020

python3-tqdm in Debian 10 seems to be older (unsurprisingly), specifically
there is no reset() method [...] also format_dict doesn't seem to be used

Sorry, did you mean format_dict() isn't used by Debian 10's tqdm or by client.py? If the latter, the function's used internally by tqdm.

(changing bar_format to {desc}...{n}/{total} makes it work)

Does it? To update {total} we would still need to use reset() no? That is, unless you want to create the bars after their associated dispvms have started and have sent us the page numbers, which I didn't do since it (and the use of {n}/{total}) has its own problems:

  • The order of the bars won't be the same as how the user specified them on the command-line, which is surprisingly very annoying.

  • If the server fails at parsing out or sending the page numbers, we'd have to create a completely different bar in the exception handling code instead of simply updating an existing one. I'd rather just have 1 bar for each job and update it accordingly rather than maintaining two separate ones for successes or failures.

  • {n}/{total} can't show statuses like "fail" or "done". I guess you can put it in r_bar and then update r_bar with statuses instead of pages numbers on success or failures? I'm pretty sure I've tried this before though and decided against it, can't remember the exact reason though.

it may be better to remove rgb/png files just after merging them into pdf (client side) - converting this 3MB input and 163MB output file took over 3GB in /tmp and barely fit there

Oops, I guess I deleted the removal code and forgot to put it back in. It's back in there now.

I didn't get any error message when run out of space in /tmp - just Total Sanitized Files: 0/1 (and progress for a file "finished" in the middle of file); and the same for issue when bar.reset() failed [...]

Did you run out of space on the server or client? The client should raise an IOError (on merges/saves) or a CalledProcessError (on rep conversions) if you run out of space. If it's not then that's a problem. As for the server, c00e7a1 should prevent the server from running out of space since it only has 1 or 2 images in /tmp at a time.

Interestingly, when I tried to reproduce your error (with the new changes in place) by using a batch size of 500, I didn't even get to run out of space, the process just ended up getting killed by OOM lol.

running 100+ pdftocairo in parallel means the system will be very busy with context switches instead of actual rendering; and also takes more space in /tmp

c00e7a1 makes it so that there should only be 1-2 pdftocairo processes running at any time on the server. The server essentially starts up a conversion and then waits until the last one finishes before queueing the current on. idk why, this gives waaay better performance than other solutions I've tried.

As for the performance [...]

Changes:

  • Server starts conversion and waits on previous one
  • Bulk saving

Performance (excludes VM startup time):

  • new server, new client (batch: 50, bulk): 3 min
  • OG server, new client (batch: 50, bulk): 5 min
  • new server, new client (batch: 50): 16 min
  • new server, OG client 27 min

Comments:

  • Lowering the batch size didn't really have any impact on the time (unless the size was set to something super low like 1 or 2, in which case the time went down a bit).

@marmarek
Copy link
Member

Sorry, did you mean format_dict() isn't used by Debian 10's tqdm or by client.py? If the latter, the function's used internally by tqdm.

The former. That function simply doesn't exist in that version, the dict is built inline in format_meter.

To update {total} we would still need to use reset() no?

Yes, that's separate issue that is easy to solve like this:

            try:
                self.bar.reset(total=pagenums)
            except AttributeError:
                # tqdm older than 4.32 do not have reset(), open-code it here
                self.bar.last_print_n = self.bar.n = 0 
                self.bar.last_print_t = self.bar.start_t = self.bar._time()
                self.bar.total = pagenums
                self.bar.refresh()
* `{n}/{total}` can't show statuses like "fail" or "done".

That's indeed the case with this solution.

I guess you can put it in r_bar and then update r_bar with statuses instead of pages numbers on success or failures?

I don't know tqdm enough, but perhaps the more naive method would work: changing bar_format to done/fail instead of {n}/{total}?

Did you run out of space on the server or client?

In fact both. One because of missing /tmp cleanup on the client side, the other one because of too many parallel pdfcairo processes producing all the output at once.
I think the lack of message was related to bar_format, which didn't included done/fail now. But in case of "fail", it would be nice to get some more details.

@ibokuri
Copy link
Contributor Author

ibokuri commented Jun 19, 2020

 try:
     self.bar.reset(total=pagenums)
 except AttributeError:
      # tqdm older than 4.32 do not have reset(), open-code it here
      self.bar.last_print_n = self.bar.n = 0 
      self.bar.last_print_t = self.bar.start_t = self.bar._time()
      self.bar.total = pagenums
      self.bar.refresh()

Ah, I see. I'll try it out.

I don't know tqdm enough, but perhaps the more naive method would work: changing bar_format to done/fail instead of {n}/{total}?

I'll play with the bar some more and see what works.

In fact both. One because of missing /tmp cleanup on the client side, the other one because of too many parallel pdfcairo processes producing all the output at once.

If you have time, try out the new commits and see if it's any better. They should help.

But in case of "fail", it would be nice to get some more details.

Hmmmmm... You didn't see error logs at the end of the program like this?

Sending files...

 foobar.txt...fail
 foobbar2.txt...done

ERROR: foobar.txt: a very nice log message

Total Sanitized Files: 1/2

If an exception's raised and caught, there should be an error like that (note: they all show up at once at the very end of the program). If that's not showing then something's up.

@marmarek
Copy link
Member

You didn't see error logs at the end of the program like this?

No, it wasn't there.
And also exit code was 0.

@marmarek
Copy link
Member

With recent commits, the missing error message is still an issue (but now I do get non-zero exit code correctly). How to test:

mkdir /tmp/small
sudo mount -t tmpfs  none /tmp/small -o size=100M
TMPDIR=/tmp/small qvm-convert-pdf (that large pdf)

@ibokuri
Copy link
Contributor Author

ibokuri commented Jun 20, 2020

It looks like it was just an unhandled OSError from when we save initial representations. The error logs now show up nicely now for me.

I think all that's left is the bar stuff.

@marmarek
Copy link
Member

Better :)

So, now the only remaining issue is working with older tqdm (Debian buster).

@ibokuri
Copy link
Contributor Author

ibokuri commented Jun 20, 2020

So, I installed the Debian 10 template (didn't have it before) and made an appvm off of it. Then I copied over the client program, installed python3-pip, ran pip3 install click tqdm pillow, and then ran the program.

It seems to run fine with reset() and format_dict(). I'm probably doing it all wrong, but am I not supposed to install the dependencies through pip or something?

@marmarek
Copy link
Member

installed python3-pip, ran pip3 install click tqdm pillow

This is the place where you cheated ;)
sudo apt install python3-click python3-tqdm python-pillow

@ibokuri
Copy link
Contributor Author

ibokuri commented Jun 20, 2020

Huh, is apt instead pip used when the Makefile runs python3 setup.py install?

@marmarek
Copy link
Member

No, python3 setup.py install doesn't install dependencies at all. The point is it should work with dependencies packaged as distribution packages (Debian here), not installed on a side by pip (which has poor integrity protection).

@ibokuri
Copy link
Contributor Author

ibokuri commented Jun 21, 2020

Uhhh not too sure what to do about the .travis.yml conflict..

@marmarek
Copy link
Member

Don't worry about conflict, I'll handle it on merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants