Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Load non-HTML resources directly whenever possible #583

Merged
merged 5 commits into from
May 24, 2024

Conversation

ikreymer
Copy link
Member

@ikreymer ikreymer commented May 24, 2024

Optimize the direct loading of non-HTML pages. Currently, the behavior is:

  • make a HEAD request first
  • make a direct fetch request only if HEAD request is a non-HTML and 200
  • only use fetch request if non-HTML and 200 and doesn't set any cookies

This changes the behavior to:

  • get cookies from browser for page URL
  • make a direct fetch request with cookies, if provided
  • only use fetch request if non-HTML and 200
    Also:
  • ensures pageinfo is properly set with timestamp for direct fetch.
  • remove obsolete Agent handling that is no longer used in default (fetch)

If fetch request results in HTML, the response is aborted and browser loading is used.

Note: initially attempted to handle redirects, but gets a bit complicated since many need to follow the redirect check to determine if request is non-HTML, and would need to buffer those responses, and not write them in case they need to be dropped in favor of browser loading. Erring on side of caution of requiring 4xx/5xx responses to still be loaded through browser, just in case.

Other optimization: it is possible to turn on http/2 loading for fetch, as mentioned in: nodejs/undici#2750 (comment) (the obsolete Agent setup from http/https modules has been removed)

ikreymer added 5 commits May 23, 2024 10:55
- drop initial HEAD check (and obsolete 'agent' params to fetch, no longer used)
- load cookies for each page for direct fetch
- attempt direct fetch GET request on every page with cookies + correct user-agent
- abort direct fetch if response is HTML, and then load in browser, otherwise proceed with direct fetch
- ensure direct fetch timestamp is set correctly, populated in pageinfo
- handle direct fetch redirects via additional fetching
- use manual redirect mode for AsyncFetcher
- fallback to browser for error responses, just in case
@ikreymer ikreymer requested a review from tw4l May 24, 2024 18:34
Copy link
Member

@tw4l tw4l left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! Tested well, and the reorganization is a nice touch.

@ikreymer ikreymer merged commit a7d279c into main May 24, 2024
4 checks passed
@ikreymer ikreymer deleted the direct-fetch-optimize branch May 24, 2024 21:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants