Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve memory usage around the BasePdfManager.docBaseUrl parameter (PR 7689 follow-up) #13105

Merged

Conversation

Snuffleupagus
Copy link
Collaborator

@Snuffleupagus Snuffleupagus commented Mar 15, 2021

While there is nothing outright wrong with the existing implementation, it can however lead to increased memory usage in one particular case (that I completely overlooked when implementing this):
For "data:"-URLs, which by definition contains the entire PDF document and can thus be arbitrarily large, we obviously want to avoid sending, storing, and/or logging the "raw" docBaseUrl in that case.

To address this, this patch makes the following changes:

  • Ignore any non-string in the docBaseUrl option passed to getDocument, since those are unsupported anyway, already on the main-thread.

  • Ignore "data:"-URLs in the docBaseUrl option passed to getDocument, to avoid having to send what could potentially be a very long string to the worker-thread.

  • Parse the docBaseUrl option directly in the BasePdfManager-constructors, on the worker-thread, to avoid having to store the "raw" docBaseUrl in the first place.

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_unittest from @Snuffleupagus received. Current queue size: 0

Live output at: http://3.101.106.178:8877/09ea6c4c5e0f289/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_unittest from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/afb7e8661eff4a8/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Success

Full output at http://54.67.70.0:8877/afb7e8661eff4a8/output.txt

Total script time: 3.53 mins

  • Unit Tests: Passed

@pdfjsbot
Copy link

From: Bot.io (Windows)


Success

Full output at http://3.101.106.178:8877/09ea6c4c5e0f289/output.txt

Total script time: 5.88 mins

  • Unit Tests: Passed

@Snuffleupagus Snuffleupagus marked this pull request as draft March 15, 2021 13:45
@Snuffleupagus Snuffleupagus force-pushed the BasePdfManager-parseDocBaseUrl branch 4 times, most recently from 27f4489 to 7756e93 Compare March 15, 2021 16:19
@Snuffleupagus Snuffleupagus marked this pull request as ready for review March 15, 2021 16:22
@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_unittest from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/b2d44514e5627ab/output.txt

@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_unittest from @Snuffleupagus received. Current queue size: 0

Live output at: http://3.101.106.178:8877/7941b2c8a0fedd4/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Success

Full output at http://54.67.70.0:8877/b2d44514e5627ab/output.txt

Total script time: 3.58 mins

  • Unit Tests: Passed

@pdfjsbot
Copy link

From: Bot.io (Windows)


Success

Full output at http://3.101.106.178:8877/7941b2c8a0fedd4/output.txt

Total script time: 5.92 mins

  • Unit Tests: Passed

@Snuffleupagus Snuffleupagus force-pushed the BasePdfManager-parseDocBaseUrl branch 6 times, most recently from b8b0ea8 to 6515cff Compare March 16, 2021 11:18
@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_unittest from @Snuffleupagus received. Current queue size: 0

Live output at: http://3.101.106.178:8877/6335077df911e03/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_unittest from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/e7eccbdcccfe022/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Success

Full output at http://54.67.70.0:8877/e7eccbdcccfe022/output.txt

Total script time: 3.60 mins

  • Unit Tests: Passed

@pdfjsbot
Copy link

From: Bot.io (Windows)


Success

Full output at http://3.101.106.178:8877/6335077df911e03/output.txt

Total script time: 5.19 mins

  • Unit Tests: Passed

@Snuffleupagus Snuffleupagus force-pushed the BasePdfManager-parseDocBaseUrl branch from 6515cff to acc953f Compare March 16, 2021 19:59
…s` and into `src/display/display_utils.js`

It seems reasonable to place this alongside the *similar* `getFilenameFromUrl` helper function. This way, with the changes in the next patch, we also avoid having to expose the `isDataScheme` function in the API itself and we instead expose `getPdfFilenameFromUrl` in the API (which feels overall more appropriate).
… (PR 7689 follow-up)

While there is nothing *outright* wrong with the existing implementation, it can however lead to increased memory usage in one particular case (that I completely overlooked when implementing this):
For "data:"-URLs, which by definition contains the entire PDF document and can thus be arbitrarily large, we obviously want to avoid sending, storing, and/or logging the "raw" docBaseUrl in that case.

To address this, this patch makes the following changes:
 - Ignore any non-string in the `docBaseUrl` option passed to `getDocument`, since those are unsupported anyway, already on the main-thread.

 - Ignore "data:"-URLs in the `docBaseUrl` option passed to `getDocument`, to avoid having to send what could potentially be a *very* long string to the worker-thread.

 - Parse the `docBaseUrl` option *directly* in the `BasePdfManager`-constructors, on the worker-thread, to avoid having to store the "raw" docBaseUrl in the first place.
@Snuffleupagus Snuffleupagus force-pushed the BasePdfManager-parseDocBaseUrl branch from acc953f to c4c7216 Compare March 17, 2021 14:48
@pdfjsbot
Copy link

From: Bot.io (Windows)


Received

Command cmd_unittest from @Snuffleupagus received. Current queue size: 0

Live output at: http://3.101.106.178:8877/ea9dc7c0af755bc/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Received

Command cmd_unittest from @Snuffleupagus received. Current queue size: 0

Live output at: http://54.67.70.0:8877/a95ebfb267277e9/output.txt

@pdfjsbot
Copy link

From: Bot.io (Linux m4)


Success

Full output at http://54.67.70.0:8877/a95ebfb267277e9/output.txt

Total script time: 3.52 mins

  • Unit Tests: Passed

@pdfjsbot
Copy link

From: Bot.io (Windows)


Success

Full output at http://3.101.106.178:8877/ea9dc7c0af755bc/output.txt

Total script time: 5.58 mins

  • Unit Tests: Passed

@timvandermeij timvandermeij merged commit 8269ddb into mozilla:master Mar 19, 2021
@timvandermeij
Copy link
Contributor

Thank you for improving this!

@Snuffleupagus Snuffleupagus deleted the BasePdfManager-parseDocBaseUrl branch March 19, 2021 22:10
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants