-
Notifications
You must be signed in to change notification settings - Fork 8
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Refactoring: Interception interruption + Better handling of non-web resources as main url #115
Conversation
- Makes it possible for certain generated exchanges to skip time / size constraints. Necessary for provenance summary. - Refactoring: logic to decide if a step needs to run - Refactoring: interruption of requests interception. Instead of calling for capture teardown, we instead turn the `recordExchanges` flag off - Refactoring: removed `keepPartialResponses` option. TBD
// | ||
try { | ||
const before = new Date() | ||
const headRequest = await fetch(this.url, { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note: I was tempted to move this entire step after the initial page load in capture()
, and use the info we got from the page to make that determination instead of using fetch()
to detect Content-Type
.
The problem is that page.goto()
throws when trying to access non-web resources and wait for page load.
Made it run even on `PARTIAL` state. This step grabs info about the page from the browser and also tries to grab the favicon if missing (it is not pulled automatically in headless mode). Replaced fetch call for doing that with `curl` + `ScoopProxy` as a way to enforce `PARTIAL` state and `blocklist`. Bonus: the favicon is no longer a generated exchange in that case.
Added title and description to WACZ export
Publish allowlist tweak
Replaced by ytDlpHash (privacy concern, TBD)
@leppert -- Added to this PR: redaction of |
**Recommended:** | ||
- `captureVideoAsAttachment` option: A Python 3 interpreter should be available for `yt-dlp` to function. | ||
**Other (recommended) system-wide dependencies:** | ||
- `curl` |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All roads lead to curl
*/ | ||
checkAndEnforceSizeLimit () { | ||
if (this.byteLength >= this.options.maxCaptureSize && this.capture.state === Scoop.states.CAPTURE) { | ||
this.capture.log.warn(`Max size ${this.options.maxCaptureSize} reached. Ending interception.`) | ||
this.capture.state = Scoop.states.PARTIAL | ||
this.capture.teardown() | ||
this.recordExchanges = false |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only thing that comes to mind for this is that a capture could take longer than it needs to once it hits this limit since it'll continue loading resources that aren't then captured, but I suppose we need to do that if there's going to be a screenshot or PDF down the line.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's the trade-off indeed. I believe it is worth it, given that - that way - we make sure we can still collect the provenance summary and page info regardless of what happens. Users also still have the lever of captureTimeout
to make sure things don't take "too" long.
Edit: Extended the scope of this PR to include both #113 and #112
Refactoring: Interception interruption
As discussed, this is necessary for the provenance summary, to which this new rule applies.
Instead of calling for
capture.teardown()
, we instead turn therecordExchanges
flag off.keepPartialResponses
option. TBDBetter handling of non-web resources as main url
Scoop might be used to capture non-web resources (
.pdf
,.docx
,.mp4
, etc ...) directly, as opposed to part of a web page.When the main url to capture is not a web page, using the browser to do so is more a problem than a solution: for example, Chromium does not support rendering PDFs in headless mode.
In this PR, I added an out-of-browser detection and capture step to account for cases when the main url is not a web page.
If that's the case, we use
curl
behindScoopProxy
to intercept exchanges.This therefore supports our
captureTimeout
,maxCaptureSize
,blocklist
and raw exchanges systems out of the box.Example: Logs when Scoop detects that main url is a PDF file.