-
Notifications
You must be signed in to change notification settings - Fork 769
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Google Scholar: Add delays between consecutive network requests. #3043
Google Scholar: Add delays between consecutive network requests. #3043
Conversation
I am currently using 500ms as the (minimum) delay between successive requests. But this may not be enough. Sometimes, I get a There are a few more ideas for playing even nicer with the server:
I'll do some experiments and see if we can catch a 429 response, and if we can adaptively handle that without hard-failing. |
Oh, Google almost certainly doesn't include a Retry-After. That would be offered for a documented public API, not a site like Google trying to stop bots. Approximately how many items were you able to save before you got a 429 with the 500ms delay? Something like the token idea might help, but I think we'd have to implement it in the Connector itself and enforce it based on HTTP request domain, since we'd want it to apply across pages. And we wouldn't do it with |
I have great difficulty getting any consistent behaviour. Now I'm getting Moreover there are too many moving parts (IP address, browser/version, prior interactions with the servers, etc.) for me to deduce anything for now. Sometimes, I am blocked on the first request, which is a GET for the citation info page-fragment (containing the BibTeX link), with URL pattern shown here: Here's some explanation about what we're trying to do for any one looking into this. On a GS results page, e.g. this one: there's a "Cite" button under each result item. In the web app UI, clicking on "Cite" will initiate an XHR to obtain a HTML fragment, which the web app uses to display the citation info (and BibTeX link) in the style of a bubble. It is this XHR that the translator tries to emulate, by requesting The differences between the webapp-XHR and the translator-initiated one includes the following. First and the most obvious, the "native" one's request URL looks like this:
while in ours, the query parameters have In addition, the native XHR is sent with the "correct" headers incl. |
In Chrome, switch on Developer mode and view the extension's background page to see the translator requests. In Firefox, use about:debugging. |
Yeah, we've always had the view that it's essentially a black box and not something we can really solve in a reliable way. But anything in the direction of looking more human-like is likely to help to some extent. That may mean a 1s minimum rather than 500ms, at least between items. |
When selecting multiple items to save, I'm still having With the changes below, in Firefox the referrers were indeed applied, but in Chromium they weren't. (EDIT: The inability to set |
Thanks for working on this. Did you still want my input on something -- I'm not quite sure I understand on what? If it's exact strategies on construction the URLs, I have no recollection. I (or whoever last changed this) almost certainly did exactly what you're doing now, replicating the behavior of the website, which may well have changed. |
Thanks! Indeed I'm following the same principle, emulating the behaviour of requests. |
I'm working on a better solution that cuts down the number of requests we have to send to Google servers, reducing the server load hence the likelihood of a block/captcha. Namely, if the main link to the article contains an extractable identifier (DOI, arXiv id, etc.), and if that identifier alone is sufficient for us to retrieve the item metadata by means of a search translator (i.e. no rendering of the actual page needed), we will delegate to that and bypass Google. For the rest of the links, we will fall back to Google Scholar. And for those 3rd-party resolution attempts that fail, we will also use the GS fallback. In any case, network requests will be interspersed with delays. |
When there are multiple items to be translated, the previous implementation fires network requests to Google Scholar server in rapid succession. This increases the likelihood of being rate-limited or denied by the server. To fix this, an artificial delay of at least 500ms is added between successive network requests. In addition, calls to the callback-based `doGet()` are replaced with promise-based asynchronous calls. A few stylistic fixes (such as `===` vs. `==`) are also included. Resolves zotero#3037.
- The delay between requests of successive entries is increased to 1s. - Add the Referer header to the requests for citation-info page and BibTeX text, emulating the in-browser behaviour. - Update the citation-info pages' URL pattern to better match the ones used during in-browser navigation. - Added comments accordingly.
- Add the missing delays between requests when processing the URL listing on the profile. - Refactoring and minor fixes for maintainability: - Make the delay between scrapes a constant instead of hard-coded number. - Refactor scrape() so that it always accept an URL as the second parameter. The logic for determining what to do (batch scraping vs. single) is moved to the caller. - Prefer "let" over "const" when the latter doesn't add to clarity.
These changes, introduced in 4c53d52, is reverted. Other uses of `===` in the same file, before said commit, are not touched.
- For the citation-info page request, inherit the language parameter from the browsing context in the URL. - For the BibTeX request, use the correct TLD in the hostname part of the referrer.
To reduce the load of Google Scholar servers and therefore the likelihood of getting an error or captcha, we use the DOI and arXiv identifiers (inferred from article title links) with the corresponding search translators when possible. The Google Scholar "native" scrape is used as fallback when the external search fails. The requests to different services are done concurrently, but for each service, requests are sequentialized, and consecutive requests are interspersed with delays. If there are items left untranslated at the end of this process, an error is raised.
- Add the ability to delay in-pipeline task on the leading edge in addition to the trailing one (for profile pages). - Add choice to control row-generation delays (not necessary for search-page scraping given the in-pipeline delays). - Cancel the trailing-edge delays for the last row to translate, to improve responsiveness. - No longer require work functions to signal their completion status by return value. - Increased the default delay for GS native requests to 2 seconds.
5f3fa4e
to
abaedc0
Compare
Force-pushed after rebase. The logic of applying translators has changed significantly. We now use a "pipelined" chain of DOI search -> ArXiv search -> "native" Google Scholar scraping, where each selected item will go through the chain in that order, and upon the first successful translation it will stop propagating. In other words, the translation method using Google Scholar network resources will be used as a last resort. In each pipeline, the requests to network resources are queued with suitable delays between successive ones. But the three pipelines work concurrently, without waiting for each other. In the end, if there are items left untranslated having gone through the full chain, an error will be thrown. The DOI and ArXiv searches work by examining the item's link for candidate ID strings. If successful, each translation saves two requests to GS (one for citation-info page fragment, the other for the BibTeX file). The DOI is chosen because it is often the case that the DOI is part of the article's link URL (easy to obtain without further requests for search results). Another benefit of this approach is a unified way to add additional pipelines should this be further extended. There are some corner cases that needs addressing, including saving article text or snapshot when DOI is used, and the handling of citation-only entries (without title link). Overall, this does reduces the likelihood of getting a 403 or captcha while using Google Scholar. However I'm not under the impression that we can completely prevent that without using an external API for "resolving" the citations to objects (something like Crossref's API that can be queried using title, author, and other info.). |
Replace all occurrences of the space character as the plus sign (+) in addition to just the first one.
Other minor fixes: - Some property names are revised to be more descriptive and better follow convention. - The title of the attachment obtained from GS link now appears as "Full Text (via Google Scholar)" instead of merely "Full Text". The GS-linked version may not be the VoR. - Minor comment fixes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Haven't tried this or studied the logic, but some quick style/clarity/policy notes
Google Scholar.js
Outdated
// attach linked document as attachment if available | ||
if (row.attachmentLink) { | ||
let attachment = { | ||
title: "Full Text (via Google Scholar)", |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"(via Google Scholar)" isn't really appropriate for the title. What's the concern? That Google Scholar may link to an unofficial copy of the PDF?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These are the links appearing on the right of the search results page, beside each entry. Google seemed to prefer a publicly available version that matches the biblio entry, for example, personal archived versions hosted by the researchers' institutions, preprints, and files hosted on social media (for instance ResearchGate, Academia.edu, which may have some sort of data sharing with GS). My feeling was that this should be distinguished from the usual "Full Text" files. I was thinking about "Article Preview" but it might be misleading, suggesting a partial view which is often not the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
How about simply "Linked Document"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, yeah. This is tough.
I think "Linked Document" is too vague. Linked from where?
For files from Unpaywall, we use "Submitted Version", "Accepted Version", and "Published Version", but we don't know what those are here. Could be "Google Scholar Linked Version", but that's not really appropriate for the same reason "Full Text (via Google Scholar)" isn't really appropriate — the fact that it came from Google Scholar isn't really relevant to what it is. "Available Version"? "Publicly Available Version"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW, I actually liked the (via Google Scholar) option because it allows quick searching/filtering by the attachment title and also makes clear to users why they get a 'wrong' (e.g., old working paper) PDF. Otoh, with "Available Version" it's unclear where it's available from and, where it's wrong, it looks like Zotero mishandled it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's fair. I guess the practical aspect of this outweighs any conceptual objections to putting that in the title. OK, @zoe-translates, so we can revert to that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@dstillman @adam3smith these ideas aren't contradictory. Can we have Available Version via Google Scholar
? This is a bit long but it should be more accurate (the earlier iteration said "full text" but this may be inaccurate).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In eff689d I decided to use Available Version via Google Scholar
. This will still make search and filtering easier, while being more accurate about the file's provenance. I hope we can close this fairly soon so we could focus on further improvements.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
OK, sorry, let's go with Available Version (Google Scholar)
. Available Version via Google Scholar
is vaguely ungrammatical, but Available Version
is nice and consistent, and (Google Scholar)
makes sense as a sort of tag. Put them together and we have a winner.
- Use more descriptive variable names. - Corrected misleading comments. - More accurate type annotation in JSDoc comments. - Various formatting changes for code readability.
- Rename the linked document that appears on the right side of a search result/citation page "Available Version" (see https://scholar.google.com/intl/en/scholar/help.html#access). - Remove the "Snapshot" attachment.
Settled on "Available Version via Google Scholar"
In addition to the dot (.), try slash (/) as a possible marker that separates the candidate DOI from common file-type extensions such as "pdf", "html" etc. A test case is added.
It's unnecessary to keep an array of all task promises just to synchronize with the end of all pipelines' processing. The end of each pipeline's queue is sufficient.
- Remove unnecessary ZU namespace reference - Don't break function definitions into two lines
Google Scholar.js
Outdated
* @property {Z.Translate<Z.SearchTranslator>} translator | ||
* @property {string} key | ||
*/ | ||
function ExternalSearch(searchKey, translatorUUID, doAttach = false) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So echoing #3062 (comment), as a general rule, let's just say no constructors in translators. There are likely a few bad examples of this in existing translators (RIS comes to mind), but most translators just use functions and it is good and fine.
There's just too much engineering here. Translators need to be simple and easy to understand. There can't be constructors, methods, pipelines, factory functions, generators — it's just not appropriate for a translator. I understand that you're trying to do some things here that other translators don't have to deal with, but if the logic needs to be slightly less optimized to keep the translator simpler, so be it.
@dstillman, Could you provide some concrete criteria for the Google Scholar translator code you'd like to accept? Am I correct in understand that you want the following on the code itself:
And the behaviour of the code should be like this..
Are there any other requirements I missed? Please let me know, thanks. (A lot of the current complexity comes from the desire to optimize for concurrency [so the user perceives progress], and possible expansion in the future to include more 3rd-party services we can use. Should I forgo those, for reason of 1-3 above?) |
Yes, that's all correct. Optimizing for concurrency is a worthy goal, but not at the expense of readability and maintainability. I suspect there are some simpler ways we can achieve some level of concurrency without so much extra machinery. |
Hi, thanks!
This is a bit tricky. It's analogous to writing without using a particular glyph, as in a lipogram. But that won't stop us from trying.. right? (Jest aside, suggestions are ever appreciated!) |
- To reduce the cognitive load of maintainers and present the control flow in a more straightforward way, the code for the fallback-scraping technique is rewritten to eliminate classes (constructors), dynamically-generated and chained promises, and higher-order helper functions. - Currently, this results in some loss of concurrency; there are a few techniques to compensate for that TBD. - The attachment is now titled "Available Version (via Google Scholar)". - When the item is obtained from external search, do not add "via Google Scholar" to the Library Catalog field. - Some test cases are slightly modified to reflect these changes.
Re: 47957c8 - a few |
- Don't wait for the row to pass through the DOI -> arXiv -> GS processing inside the loop over the rows, but do wait between each adjacent row iterations. This will help improve the perception of progress. - Error reporting improvements. - Fix some typos.
As of 504c50e, the code now does what it supposed to do (delegating to external searches, running with some level of concurrency) using largely elementary constructs (no constructors, no factories, no I inherited a large part of the code about this |
When there are multiple items to be translated, the previous implementation fires network requests to Google Scholar server in rapid succession. This increases the likelihood of being rate-limited or denied by the server.
To fix this, an artificial delay of at least 500ms is added between successive network requests.
In addition, calls to the callback-based
doGet()
are replaced with promise-based asynchronous calls.A few stylistic fixes (such as
===
vs.==
) are also included.Resolves #3037.