-
Notifications
You must be signed in to change notification settings - Fork 681
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: resilient sitemap loading #2619
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice work, just a couple of comments on tests. I'll review this more thoroughly when tests pass.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty cool stuff!
method: 'GET', | ||
timeout: networkTimeouts, | ||
headers: { | ||
'accept': 'application/xhtml+xml,application/xml,text/plain', |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can't this make a well-behaved server not return a .gz file? I'd say Content negotiation failed in this case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is (a shortened) version of what regular web browsers send in the accept
header. I added a */*
directive to allow for any type the server has. The accept-encoding
header is something got-scraping
sets itself, as it depends on the HTTP client capabilities.
I say let's monitor for errors like this, worst case, we know where to look :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do browsers send that for files that end in .gz as well? It sounds fishy...
Adds:
async SitemapRequestList.teardown()
Allows users to signal that they won't need the
SitemapRequestList
instance anymore (so the sitemap loading can stop and trash all pending URLs). Internally, this unblocks the waitingpushNextUrl
calls and allows the runtime environment to terminate correctly.On-error retries for
parseSitemap()
Network errors do not crash the
parseSitemap()
orSitemapRequestList.load()
calls anymore. Instead, the sitemap load is retried (plus, withSitemapRequestList
, the URLs aren't duplicated if the error happened somewhere in the middle of the sitemap load).Parametrizable network timeouts for
parseSitemap()
Both
parseSitemap()
andSitemapRequestList.load()
calls now accept got-like timeout object (see more details here). This can help with issues described in #2577 .Closes #2577
Closes #2617