Refactoring: Interception interruption + Better handling of non-web resources as main url #115

matteocargnelutti · 2023-03-03T16:52:58Z

Edit: Extended the scope of this PR to include both #113 and #112

Refactoring: Interception interruption

Makes it possible for certain generated exchanges to skip time / size constraints.
As discussed, this is necessary for the provenance summary, to which this new rule applies.
Refactoring: logic to decide if a step needs to run or not.
Refactoring: interruption of requests interception.
Instead of calling for capture.teardown(), we instead turn the recordExchanges flag off.
Refactoring: removed keepPartialResponses option. TBD

Better handling of non-web resources as main url

Scoop might be used to capture non-web resources (.pdf, .docx, .mp4, etc ...) directly, as opposed to part of a web page.

When the main url to capture is not a web page, using the browser to do so is more a problem than a solution: for example, Chromium does not support rendering PDFs in headless mode.

In this PR, I added an out-of-browser detection and capture step to account for cases when the main url is not a web page.
If that's the case, we use curl behind ScoopProxy to intercept exchanges.

This therefore supports our captureTimeout, maxCaptureSize, blocklist and raw exchanges systems out of the box.

Example: Logs when Scoop detects that main url is a PDF file.

[18:38:55] INFO STEP [1/11]: Intercepter
[18:38:55] INFO TCP-Proxy-Server started {"address":"::1","family":"IPv6","port":9000}
[18:38:55] INFO STEP [2/11]: Out-of-browser detection and capture of non-web resource
[18:38:56] WARN Requested URL is not a web page (detected: application/pdf)
[18:38:56] INFO Scoop will attempt to capture this resource out-of-browser
[18:39:01] INFO Resource fully captured (9753770 bytes)
[18:39:01] WARN STEP [3/11]: Initial page load (skipped)
[18:39:01] WARN STEP [4/11]: Browser scripts (skipped)
[18:39:01] WARN STEP [5/11]: Wait for network idle (skipped)
[18:39:01] WARN STEP [6/11]: Scroll-up (skipped)
[18:39:01] WARN STEP [7/11]: Screenshot (skipped)
[18:39:01] WARN STEP [8/11]: Out-of-browser capture of video as attachment (if any) (skipped)
[18:39:01] INFO STEP [9/11]: Provenance summary
[18:39:01] WARN STEP [10/11]: Capture page info (skipped)
[18:39:01] INFO STEP [11/11]: Teardown
[18:39:01] INFO Closing browser and intercepter

- Makes it possible for certain generated exchanges to skip time / size constraints. Necessary for provenance summary. - Refactoring: logic to decide if a step needs to run - Refactoring: interruption of requests interception. Instead of calling for capture teardown, we instead turn the `recordExchanges` flag off - Refactoring: removed `keepPartialResponses` option. TBD

matteocargnelutti · 2023-03-03T23:44:03Z

Scoop.js

+    //
+    try {
+      const before = new Date()
+      const headRequest = await fetch(this.url, {


Note: I was tempted to move this entire step after the initial page load in capture(), and use the info we got from the page to make that determination instead of using fetch() to detect Content-Type.

The problem is that page.goto() throws when trying to access non-web resources and wait for page load.

Made it run even on `PARTIAL` state. This step grabs info about the page from the browser and also tries to grab the favicon if missing (it is not pulled automatically in headless mode). Replaced fetch call for doing that with `curl` + `ScoopProxy` as a way to enforce `PARTIAL` state and `blocklist`. Bonus: the favicon is no longer a generated exchange in that case.

Added title and description to WACZ export

Publish allowlist tweak

Replaced by ytDlpHash (privacy concern, TBD)

matteocargnelutti · 2023-03-07T17:14:28Z

@leppert -- Added to this PR: redaction of ytDlpPath in provenance summary, replaced by checksum of yt-dlp executable.

Scoop.js

leppert · 2023-03-04T08:15:27Z

README.md

-**Recommended:**
- `captureVideoAsAttachment` option: A Python 3 interpreter should be available for `yt-dlp` to function.
+**Other (recommended) system-wide dependencies:**
+- `curl` 


All roads lead to curl

leppert · 2023-03-07T18:06:00Z

intercepters/ScoopIntercepter.js

   */
  checkAndEnforceSizeLimit () {
    if (this.byteLength >= this.options.maxCaptureSize && this.capture.state === Scoop.states.CAPTURE) {
      this.capture.log.warn(`Max size ${this.options.maxCaptureSize} reached. Ending interception.`)
      this.capture.state = Scoop.states.PARTIAL
-      this.capture.teardown()
+      this.recordExchanges = false


The only thing that comes to mind for this is that a capture could take longer than it needs to once it hits this limit since it'll continue loading resources that aren't then captured, but I suppose we need to do that if there's going to be a screenshot or PDF down the line.

That's the trade-off indeed. I believe it is worth it, given that - that way - we make sure we can still collect the provenance summary and page info regardless of what happens. Users also still have the lever of captureTimeout to make sure things don't take "too" long.

matteocargnelutti requested a review from leppert March 3, 2023 16:53

matteocargnelutti mentioned this pull request Mar 3, 2023

Should the provenance summary bypass maxCaptureSize and captureTimeout? #113

Closed

matteocargnelutti added 2 commits March 3, 2023 14:36

Update ScoopProxy.js

41b7568

Handling of non-web resources as principal capture url

fec6eb0

matteocargnelutti mentioned this pull request Mar 3, 2023

Streamline capture of non-web resources #112

Closed

matteocargnelutti changed the title ~~Refactoring: Interception interruption~~ Refactoring: Interception interruption + Better handling of non-web resources as main url Mar 3, 2023

Update Scoop.js

64e3f2f

matteocargnelutti commented Mar 3, 2023

View reviewed changes

matteocargnelutti added 10 commits March 3, 2023 18:45

Update Scoop.js

af49fae

Update Scoop.js

a60f849

Update Scoop.js

563d915

Update Scoop.js

2149ad6

Update scoopToWACZ.js

3a87b34

Added title and description to WACZ export

yt-dlp upgrade

31c1997

Update package.json

6105d36

Publish allowlist tweak

Added checksum of yt-dlp executable to provenance summary

9c97761

Tentative: remove ytDlpPath from provenance summary

ef2eabd

Replaced by ytDlpHash (privacy concern, TBD)

leppert approved these changes Mar 7, 2023

View reviewed changes

Update Scoop.js

202906b

matteocargnelutti merged commit ae6c26d into main Mar 7, 2023

matteocargnelutti deleted the interrupt-refactoring branch March 8, 2023 19:15

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Refactoring: Interception interruption + Better handling of non-web resources as main url #115

Refactoring: Interception interruption + Better handling of non-web resources as main url #115

matteocargnelutti commented Mar 3, 2023 •

edited

Loading

matteocargnelutti Mar 3, 2023 •

edited

Loading

matteocargnelutti commented Mar 7, 2023

leppert Mar 4, 2023

leppert Mar 7, 2023

matteocargnelutti Mar 7, 2023

Refactoring: Interception interruption + Better handling of non-web resources as main url #115

Refactoring: Interception interruption + Better handling of non-web resources as main url #115

Conversation

matteocargnelutti commented Mar 3, 2023 • edited Loading