Load non-HTML resources directly whenever possible #583

ikreymer · 2024-05-24T18:34:45Z

Optimize the direct loading of non-HTML pages. Currently, the behavior is:

make a HEAD request first
make a direct fetch request only if HEAD request is a non-HTML and 200
only use fetch request if non-HTML and 200 and doesn't set any cookies

This changes the behavior to:

get cookies from browser for page URL
make a direct fetch request with cookies, if provided
only use fetch request if non-HTML and 200
Also:
ensures pageinfo is properly set with timestamp for direct fetch.
remove obsolete Agent handling that is no longer used in default (fetch)

If fetch request results in HTML, the response is aborted and browser loading is used.

Note: initially attempted to handle redirects, but gets a bit complicated since many need to follow the redirect check to determine if request is non-HTML, and would need to buffer those responses, and not write them in case they need to be dropped in favor of browser loading. Erring on side of caution of requiring 4xx/5xx responses to still be loaded through browser, just in case.

Other optimization: it is possible to turn on http/2 loading for fetch, as mentioned in: nodejs/undici#2750 (comment) (the obsolete Agent setup from http/https modules has been removed)

- drop initial HEAD check (and obsolete 'agent' params to fetch, no longer used) - load cookies for each page for direct fetch - attempt direct fetch GET request on every page with cookies + correct user-agent - abort direct fetch if response is HTML, and then load in browser, otherwise proceed with direct fetch - ensure direct fetch timestamp is set correctly, populated in pageinfo

- handle direct fetch redirects via additional fetching - use manual redirect mode for AsyncFetcher - fallback to browser for error responses, just in case

… redirect records

…erialized even if need to redo via browser

tw4l

Very nice! Tested well, and the reorganization is a nice touch.

ikreymer added 5 commits May 23, 2024 10:55

additional cleanup:

1290050

- handle direct fetch redirects via additional fetching - use manual redirect mode for AsyncFetcher - fallback to browser for error responses, just in case

make manualRedirect opt configurable

4957396

direct fetch: ensure redirect succeeds before committing direct fetch…

7d0adc4

… redirect records

only handle non-redirects in direct fetch, as redirects records get s…

e8e1cdf

…erialized even if need to redo via browser

ikreymer requested a review from tw4l May 24, 2024 18:34

tw4l approved these changes May 24, 2024

View reviewed changes

ikreymer merged commit a7d279c into main May 24, 2024
4 checks passed

ikreymer deleted the direct-fetch-optimize branch May 24, 2024 21:51

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Load non-HTML resources directly whenever possible #583

Load non-HTML resources directly whenever possible #583

ikreymer commented May 24, 2024 •

edited

Loading

tw4l left a comment

Load non-HTML resources directly whenever possible #583

Load non-HTML resources directly whenever possible #583

Conversation

ikreymer commented May 24, 2024 • edited Loading

tw4l left a comment

Choose a reason for hiding this comment

ikreymer commented May 24, 2024 •

edited

Loading