Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactoring: Interception interruption + Better handling of non-web resources as main url #115

Merged
merged 15 commits into from
Mar 7, 2023

Conversation

matteocargnelutti
Copy link
Collaborator

@matteocargnelutti matteocargnelutti commented Mar 3, 2023

Edit: Extended the scope of this PR to include both #113 and #112


Refactoring: Interception interruption

  • Makes it possible for certain generated exchanges to skip time / size constraints.
    As discussed, this is necessary for the provenance summary, to which this new rule applies.
  • Refactoring: logic to decide if a step needs to run or not.
  • Refactoring: interruption of requests interception.
    Instead of calling for capture.teardown(), we instead turn the recordExchanges flag off.
  • Refactoring: removed keepPartialResponses option. TBD

Better handling of non-web resources as main url

Scoop might be used to capture non-web resources (.pdf, .docx, .mp4, etc ...) directly, as opposed to part of a web page.

When the main url to capture is not a web page, using the browser to do so is more a problem than a solution: for example, Chromium does not support rendering PDFs in headless mode.

In this PR, I added an out-of-browser detection and capture step to account for cases when the main url is not a web page.
If that's the case, we use curl behind ScoopProxy to intercept exchanges.

This therefore supports our captureTimeout, maxCaptureSize, blocklist and raw exchanges systems out of the box.

Example: Logs when Scoop detects that main url is a PDF file.

[18:38:55] INFO STEP [1/11]: Intercepter
[18:38:55] INFO TCP-Proxy-Server started {"address":"::1","family":"IPv6","port":9000}
[18:38:55] INFO STEP [2/11]: Out-of-browser detection and capture of non-web resource
[18:38:56] WARN Requested URL is not a web page (detected: application/pdf)
[18:38:56] INFO Scoop will attempt to capture this resource out-of-browser
[18:39:01] INFO Resource fully captured (9753770 bytes)
[18:39:01] WARN STEP [3/11]: Initial page load (skipped)
[18:39:01] WARN STEP [4/11]: Browser scripts (skipped)
[18:39:01] WARN STEP [5/11]: Wait for network idle (skipped)
[18:39:01] WARN STEP [6/11]: Scroll-up (skipped)
[18:39:01] WARN STEP [7/11]: Screenshot (skipped)
[18:39:01] WARN STEP [8/11]: Out-of-browser capture of video as attachment (if any) (skipped)
[18:39:01] INFO STEP [9/11]: Provenance summary
[18:39:01] WARN STEP [10/11]: Capture page info (skipped)
[18:39:01] INFO STEP [11/11]: Teardown
[18:39:01] INFO Closing browser and intercepter

- Makes it possible for certain generated exchanges to skip time / size constraints. Necessary for provenance summary.
- Refactoring: logic to decide if a step needs to run
- Refactoring: interruption of requests interception. Instead of calling for capture teardown, we instead turn the `recordExchanges` flag off
- Refactoring: removed `keepPartialResponses` option. TBD
@matteocargnelutti matteocargnelutti changed the title Refactoring: Interception interruption Refactoring: Interception interruption + Better handling of non-web resources as main url Mar 3, 2023
//
try {
const before = new Date()
const headRequest = await fetch(this.url, {
Copy link
Collaborator Author

@matteocargnelutti matteocargnelutti Mar 3, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Note: I was tempted to move this entire step after the initial page load in capture(), and use the info we got from the page to make that determination instead of using fetch() to detect Content-Type.

The problem is that page.goto() throws when trying to access non-web resources and wait for page load.

Made it run even on `PARTIAL` state.

This step grabs info about the page from the browser and also tries to grab the favicon if missing (it is not pulled automatically in headless mode).

Replaced fetch call for doing that with `curl` + `ScoopProxy` as a way to enforce `PARTIAL` state and `blocklist`.

Bonus: the favicon is no longer a generated exchange in that case.
Added title and description to WACZ export
Publish allowlist tweak
Replaced by ytDlpHash (privacy concern, TBD)
@matteocargnelutti
Copy link
Collaborator Author

@leppert -- Added to this PR: redaction of ytDlpPath in provenance summary, replaced by checksum of yt-dlp executable.

Scoop.js Outdated Show resolved Hide resolved
**Recommended:**
- `captureVideoAsAttachment` option: A Python 3 interpreter should be available for `yt-dlp` to function.
**Other (recommended) system-wide dependencies:**
- `curl`
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All roads lead to curl

*/
checkAndEnforceSizeLimit () {
if (this.byteLength >= this.options.maxCaptureSize && this.capture.state === Scoop.states.CAPTURE) {
this.capture.log.warn(`Max size ${this.options.maxCaptureSize} reached. Ending interception.`)
this.capture.state = Scoop.states.PARTIAL
this.capture.teardown()
this.recordExchanges = false
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only thing that comes to mind for this is that a capture could take longer than it needs to once it hits this limit since it'll continue loading resources that aren't then captured, but I suppose we need to do that if there's going to be a screenshot or PDF down the line.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's the trade-off indeed. I believe it is worth it, given that - that way - we make sure we can still collect the provenance summary and page info regardless of what happens. Users also still have the lever of captureTimeout to make sure things don't take "too" long.

@matteocargnelutti matteocargnelutti merged commit ae6c26d into main Mar 7, 2023
@matteocargnelutti matteocargnelutti deleted the interrupt-refactoring branch March 8, 2023 19:15
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants