Improve memory usage around the `BasePdfManager.docBaseUrl` parameter (PR 7689 follow-up) #13105

Snuffleupagus · 2021-03-15T11:58:40Z

While there is nothing outright wrong with the existing implementation, it can however lead to increased memory usage in one particular case (that I completely overlooked when implementing this):
For "data:"-URLs, which by definition contains the entire PDF document and can thus be arbitrarily large, we obviously want to avoid sending, storing, and/or logging the "raw" docBaseUrl in that case.

To address this, this patch makes the following changes:

Ignore any non-string in the docBaseUrl option passed to getDocument, since those are unsupported anyway, already on the main-thread.
Ignore "data:"-URLs in the docBaseUrl option passed to getDocument, to avoid having to send what could potentially be a very long string to the worker-thread.
Parse the docBaseUrl option directly in the BasePdfManager-constructors, on the worker-thread, to avoid having to store the "raw" docBaseUrl in the first place.

pdfjsbot · 2021-03-15T12:02:01Z

From: Bot.io (Windows)

Received

Command cmd_unittest from @Snuffleupagus received. Current queue size: 0

Live output at: http://3.101.106.178:8877/09ea6c4c5e0f289/output.txt

pdfjsbot · 2021-03-15T12:02:01Z

From: Bot.io (Linux m4)

Received

Command cmd_unittest from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/afb7e8661eff4a8/output.txt

pdfjsbot · 2021-03-15T12:05:34Z

From: Bot.io (Linux m4)

Success

Full output at http://54.67.70.0:8877/afb7e8661eff4a8/output.txt

Total script time: 3.53 mins

Unit Tests: Passed

pdfjsbot · 2021-03-15T12:07:55Z

From: Bot.io (Windows)

Success

Full output at http://3.101.106.178:8877/09ea6c4c5e0f289/output.txt

Total script time: 5.88 mins

Unit Tests: Passed

pdfjsbot · 2021-03-15T16:23:14Z

From: Bot.io (Linux m4)

Received

Command cmd_unittest from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/b2d44514e5627ab/output.txt

pdfjsbot · 2021-03-15T16:23:14Z

From: Bot.io (Windows)

Received

Command cmd_unittest from @Snuffleupagus received. Current queue size: 0

Live output at: http://3.101.106.178:8877/7941b2c8a0fedd4/output.txt

pdfjsbot · 2021-03-15T16:26:49Z

From: Bot.io (Linux m4)

Success

Full output at http://54.67.70.0:8877/b2d44514e5627ab/output.txt

Total script time: 3.58 mins

Unit Tests: Passed

pdfjsbot · 2021-03-15T16:29:10Z

From: Bot.io (Windows)

Success

Full output at http://3.101.106.178:8877/7941b2c8a0fedd4/output.txt

Total script time: 5.92 mins

Unit Tests: Passed

pdfjsbot · 2021-03-16T11:24:44Z

From: Bot.io (Windows)

Received

Command cmd_unittest from @Snuffleupagus received. Current queue size: 0

Live output at: http://3.101.106.178:8877/6335077df911e03/output.txt

pdfjsbot · 2021-03-16T11:24:44Z

From: Bot.io (Linux m4)

Received

Command cmd_unittest from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/e7eccbdcccfe022/output.txt

pdfjsbot · 2021-03-16T11:28:21Z

From: Bot.io (Linux m4)

Success

Full output at http://54.67.70.0:8877/e7eccbdcccfe022/output.txt

Total script time: 3.60 mins

Unit Tests: Passed

pdfjsbot · 2021-03-16T11:29:56Z

From: Bot.io (Windows)

Success

Full output at http://3.101.106.178:8877/6335077df911e03/output.txt

Total script time: 5.19 mins

Unit Tests: Passed

…s` and into `src/display/display_utils.js` It seems reasonable to place this alongside the *similar* `getFilenameFromUrl` helper function. This way, with the changes in the next patch, we also avoid having to expose the `isDataScheme` function in the API itself and we instead expose `getPdfFilenameFromUrl` in the API (which feels overall more appropriate).

… (PR 7689 follow-up) While there is nothing *outright* wrong with the existing implementation, it can however lead to increased memory usage in one particular case (that I completely overlooked when implementing this): For "data:"-URLs, which by definition contains the entire PDF document and can thus be arbitrarily large, we obviously want to avoid sending, storing, and/or logging the "raw" docBaseUrl in that case. To address this, this patch makes the following changes: - Ignore any non-string in the `docBaseUrl` option passed to `getDocument`, since those are unsupported anyway, already on the main-thread. - Ignore "data:"-URLs in the `docBaseUrl` option passed to `getDocument`, to avoid having to send what could potentially be a *very* long string to the worker-thread. - Parse the `docBaseUrl` option *directly* in the `BasePdfManager`-constructors, on the worker-thread, to avoid having to store the "raw" docBaseUrl in the first place.

pdfjsbot · 2021-03-17T14:52:33Z

From: Bot.io (Windows)

Received

Command cmd_unittest from @Snuffleupagus received. Current queue size: 0

Live output at: http://3.101.106.178:8877/ea9dc7c0af755bc/output.txt

pdfjsbot · 2021-03-17T14:52:33Z

From: Bot.io (Linux m4)

Received

Command cmd_unittest from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/a95ebfb267277e9/output.txt

pdfjsbot · 2021-03-17T14:56:06Z

From: Bot.io (Linux m4)

Success

Full output at http://54.67.70.0:8877/a95ebfb267277e9/output.txt

Total script time: 3.52 mins

Unit Tests: Passed

pdfjsbot · 2021-03-17T14:58:09Z

From: Bot.io (Windows)

Success

Full output at http://3.101.106.178:8877/ea9dc7c0af755bc/output.txt

Total script time: 5.58 mins

Unit Tests: Passed

timvandermeij · 2021-03-19T22:03:27Z

Thank you for improving this!

Snuffleupagus marked this pull request as draft March 15, 2021 13:45

Snuffleupagus force-pushed the BasePdfManager-parseDocBaseUrl branch 4 times, most recently from 27f4489 to 7756e93 Compare March 15, 2021 16:19

Snuffleupagus marked this pull request as ready for review March 15, 2021 16:22

Snuffleupagus force-pushed the BasePdfManager-parseDocBaseUrl branch 6 times, most recently from b8b0ea8 to 6515cff Compare March 16, 2021 11:18

timvandermeij added the core label Mar 16, 2021

Snuffleupagus force-pushed the BasePdfManager-parseDocBaseUrl branch from 6515cff to acc953f Compare March 16, 2021 19:59

Snuffleupagus added 2 commits March 17, 2021 15:48

Snuffleupagus force-pushed the BasePdfManager-parseDocBaseUrl branch from acc953f to c4c7216 Compare March 17, 2021 14:48

timvandermeij approved these changes Mar 19, 2021

View reviewed changes

timvandermeij merged commit 8269ddb into mozilla:master Mar 19, 2021

Snuffleupagus deleted the BasePdfManager-parseDocBaseUrl branch March 19, 2021 22:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve memory usage around the `BasePdfManager.docBaseUrl` parameter (PR 7689 follow-up) #13105

Improve memory usage around the `BasePdfManager.docBaseUrl` parameter (PR 7689 follow-up) #13105

Snuffleupagus commented Mar 15, 2021 •

edited

Loading

pdfjsbot commented Mar 15, 2021

pdfjsbot commented Mar 15, 2021

pdfjsbot commented Mar 15, 2021

pdfjsbot commented Mar 15, 2021

pdfjsbot commented Mar 15, 2021

pdfjsbot commented Mar 15, 2021

pdfjsbot commented Mar 15, 2021

pdfjsbot commented Mar 15, 2021

pdfjsbot commented Mar 16, 2021

pdfjsbot commented Mar 16, 2021

pdfjsbot commented Mar 16, 2021

pdfjsbot commented Mar 16, 2021

pdfjsbot commented Mar 17, 2021

pdfjsbot commented Mar 17, 2021

pdfjsbot commented Mar 17, 2021

pdfjsbot commented Mar 17, 2021

timvandermeij commented Mar 19, 2021

Improve memory usage around the BasePdfManager.docBaseUrl parameter (PR 7689 follow-up) #13105

Improve memory usage around the BasePdfManager.docBaseUrl parameter (PR 7689 follow-up) #13105

Conversation

Snuffleupagus commented Mar 15, 2021 • edited Loading

pdfjsbot commented Mar 15, 2021

From: Bot.io (Windows)

Received

pdfjsbot commented Mar 15, 2021

From: Bot.io (Linux m4)

Received

pdfjsbot commented Mar 15, 2021

From: Bot.io (Linux m4)

Success

pdfjsbot commented Mar 15, 2021

From: Bot.io (Windows)

Success

pdfjsbot commented Mar 15, 2021

From: Bot.io (Linux m4)

Received

pdfjsbot commented Mar 15, 2021

From: Bot.io (Windows)

Received

pdfjsbot commented Mar 15, 2021

From: Bot.io (Linux m4)

Success

pdfjsbot commented Mar 15, 2021

From: Bot.io (Windows)

Success

pdfjsbot commented Mar 16, 2021

From: Bot.io (Windows)

Received

pdfjsbot commented Mar 16, 2021

From: Bot.io (Linux m4)

Received

pdfjsbot commented Mar 16, 2021

From: Bot.io (Linux m4)

Success

pdfjsbot commented Mar 16, 2021

From: Bot.io (Windows)

Success

pdfjsbot commented Mar 17, 2021

From: Bot.io (Windows)

Received

pdfjsbot commented Mar 17, 2021

From: Bot.io (Linux m4)

Received

pdfjsbot commented Mar 17, 2021

From: Bot.io (Linux m4)

Success

pdfjsbot commented Mar 17, 2021

From: Bot.io (Windows)

Success

timvandermeij commented Mar 19, 2021

Improve memory usage around the `BasePdfManager.docBaseUrl` parameter (PR 7689 follow-up) #13105

Improve memory usage around the `BasePdfManager.docBaseUrl` parameter (PR 7689 follow-up) #13105

Snuffleupagus commented Mar 15, 2021 •

edited

Loading