-
Notifications
You must be signed in to change notification settings - Fork 10.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Differential cmap compression #4790
Conversation
We just need to have unit? tests that will read initial cmap and bcmaps contents and compare the resulting maps/dictionaries.
For generic viewer, multiple file requests could be a performance issue?
Looks like it's standard BSD License (didn't check characters by character though :) Is this util used only for compression or it's used for decompression as well? Is there a npm install for it? |
So is that good or bad?
Compression only. I'll check for something npm-installable. I guess that this will be the python version then. |
If for compression only, we don't have to worry about license policies http://www.mozilla.org/MPL/license-policy.html for mozilla-central code (if they are applicable) Also did you analyze the dependencies and can we group some files into one. |
What is the performance impact? I could imagine that this is quite a bit more complicated. |
yury:
ok, I can do that.
This could improve compression a little bit. A quick test shows that bcmaps.tar.zip is ~3% smaller than bcmaps.zip. Are there any advantages other than that? Is reading two small files much slower than reading one large file of the same size? |
yes, each http requests adds overhead of http headers (in our case bcmaps often smaller then the headers size), server processing and wait for transfer |
bthorben:
It's not complicated. To restore a file, you just copy some UInt8Array subsets from file A and some other UInt8Array subsets from file B. The files are 4k bytes on average (40k max), so it should be fast even if 10 files or more are traversed (which I'll try to avoid). From a theoretical view it's fast in the sense that we have global compression that considers all N bytes during data compression, but it touches only expected O(sqrt(N)) bytes for decompression. In contrast, solid rar or other arithmetic coding methods always have to read O(N) bytes to decode a file. |
Concerning the performance impact, I noticed that bcmaps are parsed multiple times per file, even multiple times per page (tested with pdfs/mao.pdf). Is there a reason for that? EDIT: Opened #4794. |
Yury, Thorben, I see concerns now: @yurydelendik: Is it a good idea to use differentially packed cmaps for the extension, and normally packed cmaps for the viewer? The decompression code would be the same, only the cmap files would be packed differently. |
I guess we would just have to test how the performance changed. Using PDF.js as an "extension" will probably be the most common use case. If space and performance improves for this use case, it would probably okay to sacrifice a little bit of performance for the web viewer. The question is how much ;) I would also see two ways around any network / roundtrip problems:
|
Do you mean ALL files? In this case, all cmaps would be downloaded for each document that uses cmaps.
That would improve my current implementation, but still the download size would be bigger than neccessary, as only parts of B, C, ... are needed. |
It's interesting to see the dependency. I might be wrong, but I'm thinking that we can group, e.g. 78-* files as one, and perform only one requests, and have insignificant impact on the size increase for your solution. |
In the meantime I improved the generation of dependencies so that
But still, I agree that small files that could be grouped. |
I skimmed through the commits and thought of a similar yet simpler approach:
Unpacking is trivial and requires a single request since the The size gain of this approach should be similar to the gains seen with this PR. Actually it should be slightly better since we don't need to store the filename of the "parent" cmap. |
Hmm, the purpose of grouping was to speed up loading from the viewer, but grouping will also increase the amount of junk that is loaded but not needed. Grouping can only help the viewer if we bundle those files which are requested together. Do we know which files these are, so that we can decide which files to bundle? |
Not yet, but we shall definitely look into that. Knowing which data is common and to which cmaps will be useful, at least to understand the impact on the performance. Not sure how analyze that just yet, but I hope your difflib patch will help us to do that. |
Would be a job for the classifier we want to build ;) |
Here is a possible grouping:
All numbers are averages of the respective group (exception: the first three totals). The grouping here is optimized in favor for small overhead when loading separately. EDIT: added single files to table. |
That's really good. If I'm not mistaken, the base files that are separately clustered are good choice? Will that give us only one additional request, right? |
Yes, 0.96 on average. |
I tweaked the algorithms a little bit and could improve all three metrics (total size, overhead, number of dependencies). I think the viewer penalties (overhead, dependencies) are small enough to go with it now.
|
That's awesome. Can you regroup commits to move bcmap changes into their own commit? |
Yes, I need to cleanup the commits anyway. I included a .tmp file in the last commit which contains the diff-statistics (computation takes ~1h). So if anybody wants to play around with it, |
Quick show of hand. Has anyone given some thought to what I suggested ? |
Am I correct that you want to pack all data in a single file? This works well for the extension, but if used by the viewer, you have to load the whole pack even if you need only a single file. |
Correct. The idea is to have a single binary blob Yes you need to load |
Sorry, I still don't get how this is more efficient. Am I missing something? Let's assume your blob is 300k and each cmap is 2k. If you need two cmaps, then you need to load 304k. If each cmap is, say, 4k, like in the current version, than the viewer needs to load only 2x (4k + 2k overhead), which is much cheaper. |
loading a single file in memory instantly takes at least 600k (just for comparison size of our average PDF is 200k). We shall be looking at something:
Please notice that the fetched over network blobs sometimes cannot be stored, e.g. due to storage policies or private browsing, and we still need to check if these has been updated to refresh the cache. IMHO grouping into smaller groups and perform 2-3 additional requests per document might be the acceptable solution. |
Right. My usage pattern of PDF.js is far from the normal use case. I open multiple PDFs, one after the other, while the extension/built-in PDF.js viewer would load a single PDF at a time in which case loading a couple of small files is preferred over my approach of one big blob and a single small file. |
Code is ready for review from my side. |
/botio test |
From: Bot.io (Linux)ReceivedCommand cmd_test from @Snuffleupagus received. Current queue size: 0 Live output at: http://107.21.233.14:8877/90e86e11dc3437a/output.txt |
From: Bot.io (Windows)ReceivedCommand cmd_test from @Snuffleupagus received. Current queue size: 0 Live output at: http://107.22.172.223:8877/6eb90eeb33ead42/output.txt |
From: Bot.io (Windows)SuccessFull output at http://107.22.172.223:8877/6eb90eeb33ead42/output.txt Total script time: 23.74 mins
|
From: Bot.io (Linux)SuccessFull output at http://107.21.233.14:8877/90e86e11dc3437a/output.txt Total script time: 26.03 mins
|
I slightly updated Readme.md |
@yurydelendik What is the status of this PR? |
I removed [WIP] from the title, as I'm done with all my checkboxes. |
f7d3330
to
35eeac5
Compare
When Yury asked me if there was an npm install for difflib, I must have not looked right. There are a couple of alternatives in npm, and the current commit uses one of them, so the inclusion of own difflib code can be skipped. Additionally, I reformatted a lot of code to comply closer to the style guide and added a long comment which explains the algorithm. I also included the temp file |
BTW: use w=1 to compare external/cmapscompress/compress.js |
@yurydelendik Is there anything else I can do to get it merged? |
@fkaelberer I believe Yury is waiting for input from @brendandahl before landing this. |
35eeac5
to
e679f97
Compare
e679f97
to
ebff000
Compare
I updated my commits to make the code more readable:
and a minor algorithmic change:
|
Any updates on this? |
Ping @brendandahl for reviewing this. As far as I know @yurydelendik agreed on this change. |
Are we still interested in implementing this in some form, or should we close the PR? Note that since this PR was originally submitted, the loading/fetching of CMap files has been re-factored and now runs on the main-thread instead. Furthermore, we also cache (compressed) CMap files in the worker, which should help reduce the amount of (font related) data that needs to be loaded. Edit: See PR #8064. |
Closing since this is not a problem in practice anymore due to the outlined refactoring steps. If this becomes a concern later on, we can always revisit this. Thanks. |
When I compressed cmap files with winrar, I noticed that they compress 2x better when put in a solid archive, which means that the file contents are very similar.
This PR makes use of this fact by storing cmap files differentially, that is, a cmap file A is either stored normally, or it is stored as a patch that contains the necessary data to recover A from a similar file B.
This reduces the combined size of the cmaps to
608k620k (54% of 1150k) and the size of pdf.js.xpi from 2MB to 1.5MB.The patches are computed using a third party diff tool (cemerick/jsdifflib) which is a partial port of python's difflib.
The dependencies are chosen
optimally with respect to the output sizeso that the number of dependencies, the overhead and total file size is minimized. This takes over an hour to compute (because each pair of files is diff'ed), but because that has to be done only if the cmap files change, I didn't care much about performance optimization.The reconstruction of the original files is very simple. Each file starts with a
filename
string. Iffilename
is empty, the rest of the file is stored normally. Otherwise, itfilename
points to a basefile from which the cmap is to be reconstructed. Reconstrucion commands are eithercopy
, followed by astart
andlength
number, which means that the respective data has to be fetched from the base file, orinsert
followed by alength
number and the corresponding number of bytes to insert.(Since the
copy
andinsert
commands are always alternating, the commands themselves do not have to be stored.) Strings and numbers are stored and restored using the existing functions in cmapscompress/compress.js and cmaps.js.I marked this as work in progress, because I have still some questions / TODOs that I wanted to check before polishing this.
Do we need more test for this? Can somebody provide them?Update file verification.updatefix formatting in cmapscompress/Readme.mdcode /commits