-
Notifications
You must be signed in to change notification settings - Fork 683
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CheerioCrawler - add got-scraping headers persistence per session. #1008
Comments
I think we can fix this in gotScraping.prepareHeaders('unique-session-id')
await gotScraping({
...,
headersSessionId: 'unique-session-id'
})
// perform all the requests
gotScraping.removeHeaders('unique-session-id') What if a request fails? When to change the generated headers? |
I'm not sure if it makes sense to add this to the The So technically, yeah. The ability to save the headers and then remove them would be compatible with how sessionPool.on('sessionRetired', (s) => gotScraping.removeHeaders(s.id)) |
I think about this more like an SDK feature than a |
On the other hand, even if we forget about the whole |
Click for exampleconst http2 = require('http2-wrapper');
// Works without .default as well but with .default IDEs give hints as it's a TypeScript module
const gotScraping = require('.').default;
const { context } = gotScraping.defaults.options;
const instance = gotScraping.extend({
context: {
headers: {
1: context.headerGenerator.getHeaders({
httpVersion: '1',
...context.headerGeneratorOptions,
}),
2: context.headerGenerator.getHeaders({
httpVersion: '2',
...context.headerGeneratorOptions,
}),
},
useHeaderGenerator: false,
},
hooks: {
beforeRequest: [
async (options) => {
const { url } = options;
const protocol = await http2.auto.resolveProtocol({
host: url.hostname,
port: url.port || 443,
rejectUnauthorized: false,
ALPNProtocols: ['h2', 'http/1.1'],
servername: url.hostname,
});
let headers;
if (protocol === 'h2') {
headers = options.context.headers[2]; // eslint-disable-line prefer-destructuring
options.ALPNProtocols = ['h2'];
} else {
headers = options.context.headers[1]; // eslint-disable-line prefer-destructuring
options.ALPNProtocols = ['http/1.1'];
}
// TODO: proper merge
Object.assign(options.headers, headers);
},
],
},
});
console.log(instance.defaults.options.context);
(async () => {
// httpbin.org always normalizes the headers to Pascal-Case
const response = await instance('https://httpbin.org/anything', {
responseType: 'json',
}).on('request', (request) => {
request.once('response', () => {
console.log(request.socket.alpnProtocol);
});
});
console.log(response.body.headers['User-Agent']);
})(); |
This is a good showcase of the implementation, but more importantly we need to see how the user interface would look like. |
It would be also good for the headers to be related with each other. E.g. if h1 contains |
I don't think that's needed. I might be wrong, but I assume a website either accepts H2 or doesn't. So once we use H2 we will always use H2 for that website. No need to have matching H1 headers then. |
I think that |
I think it's a bit overcomplicated. We can do it two ways IMO.
Personally I would go with 1. because we want to replace SessionPool with UserPool in the future so who knows how it will look. |
If the // in Session
this.sessionToken = new Object();
// in `requestAsBrowser` or somewhere else appropriate
gotScraping(..., { context: { sessionToken } }) |
Technically, we could use the whole |
Yeah, that would work as well! 👍🏼
The best way would be to use a symbol as the key. But that's not possible, they can either use const sessionToken = {}; // it doesn't matter what the object is, it just has to be non-primitive
gotScraping(..., { context: { sessionToken } })
Cause it's rare. But that's the point of a |
Yeah, I know WeakMaps. I'm only concerned that we're trying a bit too hard to use them. When designing APIs we should put the user first. And I think we can safely say that most users expect string IDs. Anything else would be counter-intuitive and would need to be explained in docs. For us, WeakMaps are the "easy way out". But I'm not convinced (yet) it's the best approach for the users. |
I don't think strings are easy way either. We'd need to explain how to generate them so they're unique. See https://nodejs.org/api/crypto.html#crypto_crypto_randomuuid_options |
I prefer |
what if you can't keep the object around to reuse, like inside of a stateless server environment. |
Describe the feature
If cookie persistence is on the request, headers should be persisted in the session
userData
object. If no headers are found, the got-scraping will work with default settings. If headers are found, the old headers are going to be used to be consistent.Inspiration
Motivation
got-scraping automatically generates headers for each request, which means that even if the
SessionPool
cookie persistence is enabled, it generates new headers over and over, which can cause serious blocking issues.Constraints
I don't know about any.
The text was updated successfully, but these errors were encountered: