Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Differential cmap compression #4790

Closed
wants to merge 2 commits into from

Conversation

fkaelberer
Copy link
Contributor

When I compressed cmap files with winrar, I noticed that they compress 2x better when put in a solid archive, which means that the file contents are very similar.

This PR makes use of this fact by storing cmap files differentially, that is, a cmap file A is either stored normally, or it is stored as a patch that contains the necessary data to recover A from a similar file B.
This reduces the combined size of the cmaps to 608k 620k (54% of 1150k) and the size of pdf.js.xpi from 2MB to 1.5MB.

The patches are computed using a third party diff tool (cemerick/jsdifflib) which is a partial port of python's difflib.

The dependencies are chosen optimally with respect to the output size so that the number of dependencies, the overhead and total file size is minimized. This takes over an hour to compute (because each pair of files is diff'ed), but because that has to be done only if the cmap files change, I didn't care much about performance optimization.

The reconstruction of the original files is very simple. Each file starts with a filename string. If filename is empty, the rest of the file is stored normally. Otherwise, it filename points to a basefile from which the cmap is to be reconstructed. Reconstrucion commands are either

  • copy, followed by a start and length number, which means that the respective data has to be fetched from the base file, or
  • insert followed by a length number and the corresponding number of bytes to insert.

(Since the copy and insert commands are always alternating, the commands themselves do not have to be stored.) Strings and numbers are stored and restored using the existing functions in cmapscompress/compress.js and cmaps.js.

I marked this as work in progress, because I have still some questions / TODOs that I wanted to check before polishing this.

  • Does the license of jsdifflib allow usage in pdf.js? [Yury says yes, if only for compression]
  • Are there any drawbacks I didn't think of?
  • Do we need more test for this? Can somebody provide them? Update file verification.
  • Reduce number of dependencies to ~1-3 ~~(currently, each file depends recursively on ~10 other files)~~ [At the cost of 1% compression ratio, average dependency length can be reduced to 1.1.]
  • Avoid that small cmap files depend on big cmap files.
  • update fix formatting in cmapscompress/Readme.md
  • cleanup code / commits

@yurydelendik
Copy link
Contributor

Do we need more test for this? Can somebody provide them?

We just need to have unit? tests that will read initial cmap and bcmaps contents and compare the resulting maps/dictionaries.

Are there any drawbacks I didn't think of?
Reduce number of dependencies to ~1-3 (currently, each file depends recursively on ~10 other files)

For generic viewer, multiple file requests could be a performance issue?

Does the license of jsdifflib allow usage in pdf.js?

Looks like it's standard BSD License (didn't check characters by character though :) Is this util used only for compression or it's used for decompression as well? Is there a npm install for it?

@fkaelberer
Copy link
Contributor Author

Looks like it's standard BSD License

So is that good or bad?

Is this util used only for compression

Compression only. I'll check for something npm-installable. I guess that this will be the python version then.

@yurydelendik
Copy link
Contributor

If for compression only, we don't have to worry about license policies http://www.mozilla.org/MPL/license-policy.html for mozilla-central code (if they are applicable)

Also did you analyze the dependencies and can we group some files into one.

@bthorben
Copy link
Contributor

What is the performance impact? I could imagine that this is quite a bit more complicated.

@fkaelberer
Copy link
Contributor Author

yury:

Do we need more test for this? Can somebody provide them?

[...] tests that will read initial cmap and bcmaps contents and compare the resulting maps/dictionaries.

ok, I can do that.

Also did you analyze the dependencies and can we group some files into one.

This could improve compression a little bit. A quick test shows that bcmaps.tar.zip is ~3% smaller than bcmaps.zip. Are there any advantages other than that? Is reading two small files much slower than reading one large file of the same size?

@yurydelendik
Copy link
Contributor

Is reading two small files much slower than reading one large file of the same size?

yes, each http requests adds overhead of http headers (in our case bcmaps often smaller then the headers size), server processing and wait for transfer

@fkaelberer
Copy link
Contributor Author

bthorben:

What is the performance impact? I could imagine that this is quite a bit more complicated.

It's not complicated. To restore a file, you just copy some UInt8Array subsets from file A and some other UInt8Array subsets from file B. The files are 4k bytes on average (40k max), so it should be fast even if 10 files or more are traversed (which I'll try to avoid).

From a theoretical view it's fast in the sense that we have global compression that considers all N bytes during data compression, but it touches only expected O(sqrt(N)) bytes for decompression. In contrast, solid rar or other arithmetic coding methods always have to read O(N) bytes to decode a file.

@fkaelberer
Copy link
Contributor Author

Concerning the performance impact, I noticed that bcmaps are parsed multiple times per file, even multiple times per page (tested with pdfs/mao.pdf). Is there a reason for that?

EDIT: Opened #4794.

@fkaelberer fkaelberer closed this May 13, 2014
@fkaelberer fkaelberer reopened this May 13, 2014
@fkaelberer
Copy link
Contributor Author

Yury, Thorben, I see concerns now:
I always had the browser extension in mind, where all data is available on disk, so additional file accesses don't hurt. But if the file is loaded via the viewer, then differential compression is always worse than regular storage.

@yurydelendik: Is it a good idea to use differentially packed cmaps for the extension, and normally packed cmaps for the viewer? The decompression code would be the same, only the cmap files would be packed differently.

@bthorben
Copy link
Contributor

I always had the browser extension in mind, where all data is available on disk, so additional file accesses don't hurt. But if the file is loaded via the viewer, then differential compression is always worse than regular storage.

I guess we would just have to test how the performance changed. Using PDF.js as an "extension" will probably be the most common use case. If space and performance improves for this use case, it would probably okay to sacrifice a little bit of performance for the web viewer. The question is how much ;)


I would also see two ways around any network / roundtrip problems:

  • We could always just concatenate all the differential files together with an index-table and always request the concatenated file.
  • We actually KNOW which files we need for a certain map. We could just store that we need files A, B, C to get map X and request them all at once.

@fkaelberer
Copy link
Contributor Author

We could always just concatenate all the differential files together with an index-table and always request the concatenated file.

Do you mean ALL files? In this case, all cmaps would be downloaded for each document that uses cmaps.

We actually KNOW which files we need for a certain map. We could just store that we need files A, B, C to get map X and request them all at once.

That would improve my current implementation, but still the download size would be bigger than neccessary, as only parts of B, C, ... are needed.

@yurydelendik
Copy link
Contributor

It's interesting to see the dependency. I might be wrong, but I'm thinking that we can group, e.g. 78-* files as one, and perform only one requests, and have insignificant impact on the size increase for your solution.

@fkaelberer
Copy link
Contributor Author

In the meantime I improved the generation of dependencies so that

  • the average number of dependency files is down from 10 to 1.1
  • the average number of loaded, but unused bytes is down from 8300 to 4200
  • cmap sizes increased from 608k to 620k

But still, I agree that small files that could be grouped.

@p01
Copy link
Contributor

p01 commented May 14, 2014

I skimmed through the commits and thought of a similar yet simpler approach:

  1. Concatenate all the unique binary data of the cmaps into a single binary blob B.
  2. Then each cmaps consists of a sequence of offset/length pairs ( stored as Uint16s or Uint32s ) that refer to chunks of data within B.

Unpacking is trivial and requires a single request since the B is common to all cmaps.
Of course B can be compressed too.

The size gain of this approach should be similar to the gains seen with this PR. Actually it should be slightly better since we don't need to store the filename of the "parent" cmap.

@fkaelberer
Copy link
Contributor Author

But still, I agree that small files that could be grouped.

Hmm, the purpose of grouping was to speed up loading from the viewer, but grouping will also increase the amount of junk that is loaded but not needed. Grouping can only help the viewer if we bundle those files which are requested together. Do we know which files these are, so that we can decide which files to bundle?

@yurydelendik
Copy link
Contributor

Do we know which files these are, so that we can decide which files to bundle?

Not yet, but we shall definitely look into that. Knowing which data is common and to which cmaps will be useful, at least to understand the impact on the performance. Not sure how analyze that just yet, but I hope your difflib patch will help us to do that.

@bthorben
Copy link
Contributor

Do we know which files these are, so that we can decide which files to bundle?

Would be a job for the classifier we want to build ;)

@fkaelberer
Copy link
Contributor Author

Here is a possible grouping:

Files |  Size  | DiffedSize | Overhead (separate) | Overhead (cluster) |
------|--------|------------|---------------------|--------------------|
   4  |   2521 |      893   |             267     |           1052     | 78-H, 78-EUC-H, Ext-H, NWP-H
  23  |    369 |       56   |             131     |            919     | Adobe-CNS1-0, Adobe-CNS1-1, Adobe-GB1-1, Adobe-Japan1-3, Adobe-Japan1-2, Adobe-Japan1-1, Adobe-GB1-0
  12  |    151 |       40   |              53     |            330     | B5-V, B5pc-V, HKdla-B5-V, HKgccs-B5-V, HKm314-B5-V, HKdlb-B5-V, HKscs-B5-V, CNS1-V, HKm471-B5-V, ETe
   4  |   1227 |      516   |             319     |            839     | CNS2-H, CNS-EUC-H, CNS-EUC-V, CNS1-H
   9  |    183 |       47   |              37     |            241     | GBK-EUC-V, GB-EUC-V, GBT-EUC-V, GBKp-EUC-V, GBK2K-V, GBpc-EUC-V, GB-V, GBTpc-EUC-V, GBT-V
   6  |   4678 |     3193   |             218     |          14484     | KSC-EUC-H, KSCms-UHC-HW-H, KSCpc-EUC-H, KSC-H, KSC-Johab-H, KSCms-UHC-H
   6  |    165 |       49   |              37     |            133     | KSC-EUC-V, KSCpc-EUC-V, KSC-V, KSC-Johab-V, KSCms-UHC-V, KSCms-UHC-HW-V
   2  |    158 |       97   |              17     |             36     | UniCNS-UTF16-V, UniCNS-UTF32-V
   3  |    180 |       95   |              33     |            104     | UniGB-UTF16-V, UniGB-UTF32-V, UniGB-UTF8-V
   4  |  26587 |    13348   |            6505     |          26804     | UniKS-UTF16-H, UniKS-UCS2-H, UniKS-UTF32-H, UniKS-UTF8-H
   2  |    166 |      101   |              17     |             36     | UniKS-UTF16-V, UniKS-UTF32-V
   3  |   2530 |     1066   |             235     |            668     | 78-RKSJ-H, 78ms-RKSJ-H, Ext-RKSJ-H
   9  |  38909 |    11444   |           13595     |          64094     | UniJIS2004-UTF16-H, UniJIS-UTF16-H, UniJIS2004-UTF32-H, UniJIS2004-UTF8-H, UniJIS-UTF8-H, UniJISX021
   2  |   4431 |     2263   |              45     |             95     | ETHK-B5-H, HKscs-B5-H
   4  |  51251 |    30835   |           17397     |          72088     | UniCNS-UTF16-H, UniCNS-UTF8-H, UniCNS-UTF32-H, UniCNS-UCS2-H
   7  |  32723 |    19061   |           37974     |         100703     | UniGB-UTF16-H, UniGB-UTF32-H, UniGB-UCS2-H, GBKp-EUC-H, GBK2K-H, UniGB-UTF8-H, GBK-EUC-H
   3  |   7285 |     2607   |             181     |            535     | GBT-EUC-H, GBTpc-EUC-H, GBT-H
  65* |   2761 |     2761   |               0     |              0     | * single files
------------------------------------------------------------------------
 168  |1165667 |   636214   |            2935     |          10739     |

All numbers are averages of the respective group (exception: the first three totals).
The Overhead columns are unused bytes that we have to load until the file can be decoded when loading the required base files separately/clustered.

The grouping here is optimized in favor for small overhead when loading separately.

EDIT: added single files to table.

@yurydelendik
Copy link
Contributor

That's really good. If I'm not mistaken, the base files that are separately clustered are good choice? Will that give us only one additional request, right?

@fkaelberer
Copy link
Contributor Author

Will that give us only one additional request, right?

Yes, 0.96 on average.

@fkaelberer
Copy link
Contributor Author

I tweaked the algorithms a little bit and could improve all three metrics (total size, overhead, number of dependencies). I think the viewer penalties (overhead, dependencies) are small enough to go with it now.

 Files |  Size  | DiffedSize | Overhead | Dependencies |
-------|--------|------------|----------|--------------|
    4  |   1227 |      516   |     624  |        0.75  | CNS-EUC-H, CNS-EUC-V, CNS1-H, CNS2-H
    6  |   4678 |     3193   |     218  |        1.00  | KSC-EUC-H, KSCms-UHC-HW-H, KSCpc-EUC-H, KSC-H, KSC-Johab-H, KSCms-UHC-H
    2  |    158 |       97   |      17  |        0.50  | UniCNS-UTF16-V, UniCNS-UTF32-V
    4  |  26587 |    13348   |    6505  |        0.75  | UniKS-UTF16-H, UniKS-UCS2-H, UniKS-UTF32-H, UniKS-UTF8-H
    2  |    166 |      101   |      17  |        0.50  | UniKS-UTF16-V, UniKS-UTF32-V
    7  |   1760 |      855   |     482  |        1.14  | B5-H, ETen-B5-H, HKdla-B5-H, HKm471-B5-H, B5pc-H, HKdlb-B5-H, HKm314-B5-H
    3  |    176 |       72   |      12  |        0.67  | GB-V, GB-EUC-V, GBT-V
    3  |    163 |       74   |      18  |        0.67  | KSC-V, KSC-EUC-V, KSC-Johab-V
    3  |    167 |       72   |      15  |        0.67  | KSCms-UHC-V, KSCpc-EUC-V, KSCms-UHC-HW-V
    2  |    180 |      108   |      17  |        0.50  | UniGB-UTF16-V, UniGB-UTF32-V
    4  |    545 |      180   |      85  |        0.75  | Adobe-GB1-4, Adobe-GB1-3, Adobe-GB1-5, Adobe-Japan1-6
    2  |    241 |      133   |      12  |        0.50  | Adobe-Korea1-0, Adobe-Japan1-3
    3  |   3718 |     1974   |    1206  |        0.67  | ETHK-B5-H, HKgccs-B5-H, HKscs-B5-H
    4  |  51251 |    30835   |   17397  |        0.75  | UniCNS-UTF16-H, UniCNS-UTF8-H, UniCNS-UTF32-H, UniCNS-UCS2-H
    3  |    544 |      207   |      28  |        0.67  | GB-EUC-H, GBpc-EUC-H, GB-H
    3  |   7285 |     2607   |     181  |        0.67  | GBT-EUC-H, GBTpc-EUC-H, GBT-H
    5  |  39937 |    24743   |   22534  |        1.00  | UniGB-UTF32-H, GBK2K-H, UniGB-UTF16-H, UniGB-UTF8-H, UniGB-UCS2-H
    2  |  14689 |     7358   |      12  |        0.50  | GBKp-EUC-H, GBK-EUC-H
    3  |    290 |      112   |      15  |        0.67  | 90ms-RKSJ-V, 78ms-RKSJ-V, 90msp-RKSJ-V
    9  |  38909 |    11444   |   13595  |        1.56  | UniJIS2004-UTF16-H, UniJIS-UTF16-H, UniJIS2004-UTF32-H, UniJIS2004-UTF8-H, UniJIS-UTF8-H, UniJI...
    6  |    414 |      101   |      54  |        1.17  | Adobe-CNS1-5, Adobe-GB1-2, Adobe-CNS1-6, Adobe-CNS1-4, Adobe-Japan1-5, Adobe-CNS1-2
    3  |    392 |      155   |      30  |        0.67  | Adobe-CNS1-3, Adobe-Korea1-2, Adobe-Korea1-1
    6  |    287 |       88   |      79  |        1.00  | Adobe-CNS1-0, Adobe-CNS1-1, Adobe-GB1-1, Adobe-Japan1-2, Adobe-GB1-0, Adobe-Japan1-4
    2  |    225 |      125   |      12  |        0.50  | Adobe-Japan1-0, Adobe-Japan1-1
    2  |    148 |       83   |       9  |        0.50  | HKdla-B5-V, HKdlb-B5-V
    3  |    149 |       64   |      15  |        0.67  | HKm314-B5-V, HKgccs-B5-V, HKm471-B5-V
    3  |    181 |       74   |      14  |        0.67  | GBpc-EUC-V, GBKp-EUC-V, GBTpc-EUC-V
    6  |    670 |      165   |      93  |        1.17  | UniJIS-UTF32-V, UniJIS-UTF16-V, UniJIS2004-UTF32-V, UniJISX0213-UTF32-V, UniJISX02132004-UTF32-...
    2  |    171 |       93   |       7  |        0.50  | EUC-V, 78-EUC-V
    3  |    201 |       97   |      10  |        0.67  | RKSJ-V, 78-RKSJ-V, 90pv-RKSJ-V
    2  |    284 |      190   |      46  |        0.50  | Add-V, Add-RKSJ-V
    5  |    771 |      342   |     129  |        1.20  | RKSJ-H, 90ms-RKSJ-H, 83pv-RKSJ-H, 90pv-RKSJ-H, 90msp-RKSJ-H
    3  |   2530 |     1066   |     303  |        0.67  | 78ms-RKSJ-H, Ext-RKSJ-H, 78-RKSJ-H
    5  |   2430 |     1302   |    1129  |        1.00  | 78-H, 78-EUC-H, Ext-H, Add-H, Add-RKSJ-H
    2  |    565 |      305   |      16  |        0.50  | H, EUC-H
    2  |    158 |       88   |       9  |        0.50  | ETen-B5-V, ETHK-B5-V
    4  |    684 |      206   |      55  |        1.00  | UniJIS-UCS2-HW-V, UniJISPro-UCS2-HW-V, UniJIS-UCS2-V, UniJISPro-UCS2-V
    3  |    695 |      414   |     171  |        0.67  | UniJIS-UTF8-V, UniJIS2004-UTF8-V, UniJISPro-UTF8-V
    3  |    228 |      120   |      59  |        0.67  | NWP-V, Ext-V, Ext-RKSJ-V
    2  |    167 |       89   |       5  |        0.50  | V, 78-V
   27* |   5388 |     5388   |       0  |           0  | * single files
-------|--------|------------|----------|--------------|
  168  |1165667 |   626948   |    2099  |        0.71  |

@yurydelendik
Copy link
Contributor

That's awesome. Can you regroup commits to move bcmap changes into their own commit?

@fkaelberer
Copy link
Contributor Author

Yes, I need to cleanup the commits anyway.

I included a .tmp file in the last commit which contains the diff-statistics (computation takes ~1h). So if anybody wants to play around with it, node make cmaps should now complete within a few minutes.
I'll remove the .tmp file when cleaning up the commits.

@p01
Copy link
Contributor

p01 commented May 16, 2014

Quick show of hand. Has anyone given some thought to what I suggested ?
Should I pursue this idea and implement it next week to see where it stands in practice.

@fkaelberer
Copy link
Contributor Author

Am I correct that you want to pack all data in a single file? This works well for the extension, but if used by the viewer, you have to load the whole pack even if you need only a single file.

@p01
Copy link
Contributor

p01 commented May 16, 2014

Correct. The idea is to have a single binary blob B ( which we can compress ofc ), and the cmaps refer to different runs from B. This cmaps should be very small then.

Yes you need to load B once and for all. Even if you need a single cmap, but it pays off if you need to load more because each cmap is very cheap to load and decode.

@fkaelberer
Copy link
Contributor Author

Yes you need to load B once and for all. Even if you need a single cmap, but it pays off if you need to load more because each cmap is very cheap to load and decode.

Sorry, I still don't get how this is more efficient. Am I missing something? Let's assume your blob is 300k and each cmap is 2k. If you need two cmaps, then you need to load 304k.

If each cmap is, say, 4k, like in the current version, than the viewer needs to load only 2x (4k + 2k overhead), which is much cheaper.

@yurydelendik
Copy link
Contributor

Concatenate all the unique binary data of the cmaps into a single binary blob B

loading a single file in memory instantly takes at least 600k (just for comparison size of our average PDF is 200k).

We shall be looking at something:

  1. that small;
  2. requires less requests to get it;
  3. and fast to decompress into CMap structure.

Please notice that the fetched over network blobs sometimes cannot be stored, e.g. due to storage policies or private browsing, and we still need to check if these has been updated to refresh the cache.

IMHO grouping into smaller groups and perform 2-3 additional requests per document might be the acceptable solution.

@p01
Copy link
Contributor

p01 commented May 16, 2014

Right. My usage pattern of PDF.js is far from the normal use case. I open multiple PDFs, one after the other, while the extension/built-in PDF.js viewer would load a single PDF at a time in which case loading a couple of small files is preferred over my approach of one big blob and a single small file.

@fkaelberer
Copy link
Contributor Author

Code is ready for review from my side.
The bcmaps are even down to < 50% now, by using delta offsets for start/end parameters and discovering a quality parameter in the diff tool.

@Snuffleupagus
Copy link
Collaborator

/botio test

@pdfjsbot
Copy link

From: Bot.io (Linux)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://107.21.233.14:8877/90e86e11dc3437a/output.txt

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_test from @Snuffleupagus received. Current queue size: 0

Live output at: http://107.22.172.223:8877/6eb90eeb33ead42/output.txt

@pdfjsbot
Copy link

From: Bot.io (Windows)


Success

Full output at http://107.22.172.223:8877/6eb90eeb33ead42/output.txt

Total script time: 23.74 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: Passed

@pdfjsbot
Copy link

From: Bot.io (Linux)


Success

Full output at http://107.21.233.14:8877/90e86e11dc3437a/output.txt

Total script time: 26.03 mins

  • Font tests: Passed
  • Unit tests: Passed
  • Regression tests: Passed

@fkaelberer
Copy link
Contributor Author

I slightly updated Readme.md

@timvandermeij
Copy link
Contributor

@yurydelendik What is the status of this PR?

@fkaelberer fkaelberer changed the title [WIP] Differential cmap compression Differential cmap compression Jun 27, 2014
@fkaelberer
Copy link
Contributor Author

I removed [WIP] from the title, as I'm done with all my checkboxes.

@fkaelberer
Copy link
Contributor Author

When Yury asked me if there was an npm install for difflib, I must have not looked right. There are a couple of alternatives in npm, and the current commit uses one of them, so the inclusion of own difflib code can be skipped.

Additionally, I reformatted a lot of code to comply closer to the style guide and added a long comment which explains the algorithm. I also included the temp file /external/bcmaps_temp/savings.json, which holds a matrix of how many bytes can be saved when using differential storage to store file i with file j as base. It is in the 'temp' folder, because it is automatically recomputed if the folder is missing, but it takes hours to do so.

@fkaelberer
Copy link
Contributor Author

BTW: use w=1 to compare external/cmapscompress/compress.js

@fkaelberer
Copy link
Contributor Author

@yurydelendik Is there anything else I can do to get it merged?

@timvandermeij
Copy link
Contributor

@fkaelberer I believe Yury is waiting for input from @brendandahl before landing this.

@brendandahl brendandahl self-assigned this Oct 23, 2014
@fkaelberer
Copy link
Contributor Author

I updated my commits to make the code more readable:

  • moved tree logic to own class
  • split up long functions
  • use more descriptive variable and function names

and a minor algorithmic change:

  • don't store file sizes in data files

@mattdbr
Copy link

mattdbr commented Jul 25, 2015

Any updates on this?

@timvandermeij
Copy link
Contributor

Ping @brendandahl for reviewing this. As far as I know @yurydelendik agreed on this change.

@Snuffleupagus
Copy link
Collaborator

Snuffleupagus commented Feb 3, 2018

Are we still interested in implementing this in some form, or should we close the PR?

Note that since this PR was originally submitted, the loading/fetching of CMap files has been re-factored and now runs on the main-thread instead. Furthermore, we also cache (compressed) CMap files in the worker, which should help reduce the amount of (font related) data that needs to be loaded. Edit: See PR #8064.

@timvandermeij
Copy link
Contributor

timvandermeij commented Feb 4, 2018

Closing since this is not a problem in practice anymore due to the outlined refactoring steps. If this becomes a concern later on, we can always revisit this. Thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants