Google Scholar: Add delays between consecutive network requests. #3043

zoe-translates · 2023-06-04T12:27:11Z

When there are multiple items to be translated, the previous implementation fires network requests to Google Scholar server in rapid succession. This increases the likelihood of being rate-limited or denied by the server.

To fix this, an artificial delay of at least 500ms is added between successive network requests.

In addition, calls to the callback-based doGet() are replaced with promise-based asynchronous calls.

A few stylistic fixes (such as === vs. ==) are also included.

Resolves #3037.

zoe-translates · 2023-06-05T03:50:13Z

I am currently using 500ms as the (minimum) delay between successive requests. But this may not be enough.

Sometimes, I get a 429 Too Many Requests response, and perhaps I could inspect that and see if there's a Retry-After: header with it, which may be helpful. (Currently, the requestXXXX rejects on any non-2xx status.)

There are a few more ideas for playing even nicer with the server:

Use longer delays, perhaps a second or more
Add some randomness to baseline delay; see the Python package scholarly.
Use a queue of requests that is throttled by a "pool of tokens", as used in the Google Chrome extension "Scholar H-index Calculator"; each request consumes a token (i.e. decreases a counter), while the total number of tokens is slowly replenished by setInterval but never exceeding a maximum; see also stackexchange discussion --- This will make the code a lot more complex though.

I'll do some experiments and see if we can catch a 429 response, and if we can adaptively handle that without hard-failing.

dstillman · 2023-06-05T05:45:12Z

Oh, Google almost certainly doesn't include a Retry-After. That would be offered for a documented public API, not a site like Google trying to stop bots.

Approximately how many items were you able to save before you got a 429 with the 500ms delay?

Something like the token idea might help, but I think we'd have to implement it in the Connector itself and enforce it based on HTTP request domain, since we'd want it to apply across pages. And we wouldn't do it with setInterval — it would just have to be mathematical based on an in-memory record of recent translator HTTP requests for that domain or URL pattern. So we can do that separately from this commit.

Google Scholar.js

zoe-translates · 2023-06-05T06:34:16Z

Approximately how many items were you able to save before you got a 429 with the 500ms delay?

I have great difficulty getting any consistent behaviour. Now I'm getting 403 Forbidden instead of 429, even with 1s delay.

Moreover there are too many moving parts (IP address, browser/version, prior interactions with the servers, etc.) for me to deduce anything for now.

Sometimes, I am blocked on the first request, which is a GET for the citation info page-fragment (containing the BibTeX link), with URL pattern shown here:
https://github.com/zotero/translators/blob/e458dc3402cb2d7a788aeb0c4758035492182623/Google%20Scholar.js#LL170C1-L170C1

Here's some explanation about what we're trying to do for any one looking into this. On a GS results page, e.g. this one:
https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=desire+production&btnG=

there's a "Cite" button under each result item. In the web app UI, clicking on "Cite" will initiate an XHR to obtain a HTML fragment, which the web app uses to display the citation info (and BibTeX link) in the style of a bubble. It is this XHR that the translator tries to emulate, by requesting citeUrl in the code referenced above.

The differences between the webapp-XHR and the translator-initiated one includes the following.

First and the most obvious, the "native" one's request URL looks like this:

https://scholar.google.com/scholar?q=info:kXCamG4ZSioJ:scholar.google.com/&output=cite&scirp=0&hl=en

while in ours, the query parameters have scirp=1 (instead of 0 in the native one above). I have no idea whether this is significant (maybe @adam3smith could help? thx).

In addition, the native XHR is sent with the "correct" headers incl. Accept: */*, Referer, CORS, etc. I haven't checked whether the translator emulates these (neither FF nor Chromium lets me see the extension-initiated request's details in the inspector > network tab).

dstillman · 2023-06-05T06:36:43Z

In Chrome, switch on Developer mode and view the extension's background page to see the translator requests. In Firefox, use about:debugging.

dstillman · 2023-06-05T07:06:23Z

I have great difficulty getting any consistent behaviour. Now I'm getting 403 Forbidden instead of 429, even with 1s delay.

Moreover there are too many moving parts (IP address, browser/version, prior interactions with the servers, etc.) for me to deduce anything for now.

Yeah, we've always had the view that it's essentially a black box and not something we can really solve in a reliable way. But anything in the direction of looking more human-like is likely to help to some extent. That may mean a 1s minimum rather than 500ms, at least between items.

zoe-translates · 2023-06-05T09:46:16Z

When selecting multiple items to save, I'm still having 403s in Firefox and Scaffold (running doWeb() from Scaffold UI). This is after increasing the delay to 1s and applying Referer: to the requests.

With the changes below, in Firefox the referrers were indeed applied, but in Chromium they weren't.

(EDIT: The inability to set Referer is a bug in Chromium; See this Chromium bug discussion, by Adomas)

adam3smith · 2023-06-06T19:21:07Z

Thanks for working on this. Did you still want my input on something -- I'm not quite sure I understand on what? If it's exact strategies on construction the URLs, I have no recollection. I (or whoever last changed this) almost certainly did exactly what you're doing now, replicating the behavior of the website, which may well have changed.

zoe-translates · 2023-06-07T08:18:53Z

I (or whoever last changed this) almost certainly did exactly what you're doing now, replicating the behavior of the website, which may well have changed.

Thanks! Indeed I'm following the same principle, emulating the behaviour of requests.

zoe-translates · 2023-06-07T08:29:53Z

I'm working on a better solution that cuts down the number of requests we have to send to Google servers, reducing the server load hence the likelihood of a block/captcha.

Namely, if the main link to the article contains an extractable identifier (DOI, arXiv id, etc.), and if that identifier alone is sufficient for us to retrieve the item metadata by means of a search translator (i.e. no rendering of the actual page needed), we will delegate to that and bypass Google.

For the rest of the links, we will fall back to Google Scholar.

And for those 3rd-party resolution attempts that fail, we will also use the GS fallback.

In any case, network requests will be interspersed with delays.

When there are multiple items to be translated, the previous implementation fires network requests to Google Scholar server in rapid succession. This increases the likelihood of being rate-limited or denied by the server. To fix this, an artificial delay of at least 500ms is added between successive network requests. In addition, calls to the callback-based `doGet()` are replaced with promise-based asynchronous calls. A few stylistic fixes (such as `===` vs. `==`) are also included. Resolves zotero#3037.

- The delay between requests of successive entries is increased to 1s. - Add the Referer header to the requests for citation-info page and BibTeX text, emulating the in-browser behaviour. - Update the citation-info pages' URL pattern to better match the ones used during in-browser navigation. - Added comments accordingly.

- Add the missing delays between requests when processing the URL listing on the profile. - Refactoring and minor fixes for maintainability: - Make the delay between scrapes a constant instead of hard-coded number. - Refactor scrape() so that it always accept an URL as the second parameter. The logic for determining what to do (batch scraping vs. single) is moved to the caller. - Prefer "let" over "const" when the latter doesn't add to clarity.

These changes, introduced in 4c53d52, is reverted. Other uses of `===` in the same file, before said commit, are not touched.

- For the citation-info page request, inherit the language parameter from the browsing context in the URL. - For the BibTeX request, use the correct TLD in the hostname part of the referrer.

To reduce the load of Google Scholar servers and therefore the likelihood of getting an error or captcha, we use the DOI and arXiv identifiers (inferred from article title links) with the corresponding search translators when possible. The Google Scholar "native" scrape is used as fallback when the external search fails. The requests to different services are done concurrently, but for each service, requests are sequentialized, and consecutive requests are interspersed with delays. If there are items left untranslated at the end of this process, an error is raised.

- Add the ability to delay in-pipeline task on the leading edge in addition to the trailing one (for profile pages). - Add choice to control row-generation delays (not necessary for search-page scraping given the in-pipeline delays). - Cancel the trailing-edge delays for the last row to translate, to improve responsiveness. - No longer require work functions to signal their completion status by return value. - Increased the default delay for GS native requests to 2 seconds.

zoe-translates · 2023-06-13T09:35:43Z

Force-pushed after rebase.

The logic of applying translators has changed significantly. We now use a "pipelined" chain of DOI search -> ArXiv search -> "native" Google Scholar scraping, where each selected item will go through the chain in that order, and upon the first successful translation it will stop propagating. In other words, the translation method using Google Scholar network resources will be used as a last resort.

In each pipeline, the requests to network resources are queued with suitable delays between successive ones. But the three pipelines work concurrently, without waiting for each other.

In the end, if there are items left untranslated having gone through the full chain, an error will be thrown.

The DOI and ArXiv searches work by examining the item's link for candidate ID strings. If successful, each translation saves two requests to GS (one for citation-info page fragment, the other for the BibTeX file). The DOI is chosen because it is often the case that the DOI is part of the article's link URL (easy to obtain without further requests for search results).

Another benefit of this approach is a unified way to add additional pipelines should this be further extended.

There are some corner cases that needs addressing, including saving article text or snapshot when DOI is used, and the handling of citation-only entries (without title link).

Overall, this does reduces the likelihood of getting a 403 or captcha while using Google Scholar. However I'm not under the impression that we can completely prevent that without using an external API for "resolving" the citations to objects (something like Crossref's API that can be queried using title, author, and other info.).

Replace all occurrences of the space character as the plus sign (+) in addition to just the first one.

…stent.

Other minor fixes: - Some property names are revised to be more descriptive and better follow convention. - The title of the attachment obtained from GS link now appears as "Full Text (via Google Scholar)" instead of merely "Full Text". The GS-linked version may not be the VoR. - Minor comment fixes.

dstillman

Haven't tried this or studied the logic, but some quick style/clarity/policy notes

dstillman · 2023-06-14T06:58:30Z

Google Scholar.js

+	// attach linked document as attachment if available
+	if (row.attachmentLink) {
+		let	attachment = {
+			title: "Full Text (via Google Scholar)",


"(via Google Scholar)" isn't really appropriate for the title. What's the concern? That Google Scholar may link to an unofficial copy of the PDF?

These are the links appearing on the right of the search results page, beside each entry. Google seemed to prefer a publicly available version that matches the biblio entry, for example, personal archived versions hosted by the researchers' institutions, preprints, and files hosted on social media (for instance ResearchGate, Academia.edu, which may have some sort of data sharing with GS). My feeling was that this should be distinguished from the usual "Full Text" files. I was thinking about "Article Preview" but it might be misleading, suggesting a partial view which is often not the case.

How about simply "Linked Document"?

Hmm, yeah. This is tough.

I think "Linked Document" is too vague. Linked from where?

For files from Unpaywall, we use "Submitted Version", "Accepted Version", and "Published Version", but we don't know what those are here. Could be "Google Scholar Linked Version", but that's not really appropriate for the same reason "Full Text (via Google Scholar)" isn't really appropriate — the fact that it came from Google Scholar isn't really relevant to what it is. "Available Version"? "Publicly Available Version"?

FWIW, I actually liked the (via Google Scholar) option because it allows quick searching/filtering by the attachment title and also makes clear to users why they get a 'wrong' (e.g., old working paper) PDF. Otoh, with "Available Version" it's unclear where it's available from and, where it's wrong, it looks like Zotero mishandled it.

That's fair. I guess the practical aspect of this outweighs any conceptual objections to putting that in the title. OK, @zoe-translates, so we can revert to that.

@dstillman @adam3smith these ideas aren't contradictory. Can we have Available Version via Google Scholar? This is a bit long but it should be more accurate (the earlier iteration said "full text" but this may be inaccurate).

In eff689d I decided to use Available Version via Google Scholar. This will still make search and filtering easier, while being more accurate about the file's provenance. I hope we can close this fairly soon so we could focus on further improvements.

OK, sorry, let's go with Available Version (Google Scholar). Available Version via Google Scholar is vaguely ungrammatical, but Available Version is nice and consistent, and (Google Scholar) makes sense as a sort of tag. Put them together and we have a winner.

Google Scholar.js

- Use more descriptive variable names. - Corrected misleading comments. - More accurate type annotation in JSDoc comments. - Various formatting changes for code readability.

- Rename the linked document that appears on the right side of a search result/citation page "Available Version" (see https://scholar.google.com/intl/en/scholar/help.html#access). - Remove the "Snapshot" attachment.

Settled on "Available Version via Google Scholar"

In addition to the dot (.), try slash (/) as a possible marker that separates the candidate DOI from common file-type extensions such as "pdf", "html" etc. A test case is added.

It's unnecessary to keep an array of all task promises just to synchronize with the end of all pipelines' processing. The end of each pipeline's queue is sufficient.

Google Scholar.js

- Remove unnecessary ZU namespace reference - Don't break function definitions into two lines

dstillman · 2023-07-06T04:41:47Z

Google Scholar.js

+ * @property {Z.Translate<Z.SearchTranslator>} translator
+ * @property {string} key
+ */
+function ExternalSearch(searchKey, translatorUUID, doAttach = false) {


So echoing #3062 (comment), as a general rule, let's just say no constructors in translators. There are likely a few bad examples of this in existing translators (RIS comes to mind), but most translators just use functions and it is good and fine.

There's just too much engineering here. Translators need to be simple and easy to understand. There can't be constructors, methods, pipelines, factory functions, generators — it's just not appropriate for a translator. I understand that you're trying to do some things here that other translators don't have to deal with, but if the logic needs to be slightly less optimized to keep the translator simpler, so be it.

zoe-translates · 2023-07-06T08:36:05Z

@dstillman, Could you provide some concrete criteria for the Google Scholar translator code you'd like to accept?

Am I correct in understand that you want the following on the code itself:

"There can't be constructors, methods, pipelines, factory functions, generators" -- this is very clear, and this is what you meant by "simple and easy to understand"
Based on WoS Tagged: Reimplement the core algorithms. #3062 (comment) you referenced, you want that "someone uninitiated and/or inexperienced can make tweaks to [it] in a few minutes" -- this is less objective; My understanding is to write things as a mostly loose collection of functions, which takes simple input and returns simple output -- rather than custom data types conforming to some "typedef" -- so that the reader of the code don't have to understand and remember those "typedef"s. Right?
It should be written "how almost all our other translators are written" -- "unless there's an extremely compelling reason to change it". Is this just the same as 1? And a compelling reason would be to code obsolescence, so unlike most others, I should use requestXXX and remove doGet/doPost/processDocuments, and async doWeb/scrape functions.

And the behaviour of the code should be like this..

Make less requests, and serialize them (this I'm sure)
Save Available version (via Google Scholar) attachments when it makes sense (we've decided on the name)
Opportunistically delegate to DOI or arXiv (see 1)
In a chat we mentioned the speed to translate a page of items (typically 20). Is there some requirements about this? This question is specific to Google Scholar because of we need artificial delays.

Are there any other requirements I missed? Please let me know, thanks.

(A lot of the current complexity comes from the desire to optimize for concurrency [so the user perceives progress], and possible expansion in the future to include more 3rd-party services we can use. Should I forgo those, for reason of 1-3 above?)

dstillman · 2023-07-06T09:02:55Z

Yes, that's all correct.

Optimizing for concurrency is a worthy goal, but not at the expense of readability and maintainability. I suspect there are some simpler ways we can achieve some level of concurrency without so much extra machinery.

zoe-translates · 2023-07-07T07:08:51Z

Hi, thanks!

I suspect there are some simpler ways we can achieve some level of concurrency without so much extra machinery.

This is a bit tricky. It's analogous to writing without using a particular glyph, as in a lipogram. But that won't stop us from trying.. right?

(Jest aside, suggestions are ever appreciated!)

- To reduce the cognitive load of maintainers and present the control flow in a more straightforward way, the code for the fallback-scraping technique is rewritten to eliminate classes (constructors), dynamically-generated and chained promises, and higher-order helper functions. - Currently, this results in some loss of concurrency; there are a few techniques to compensate for that TBD. - The attachment is now titled "Available Version (via Google Scholar)". - When the item is obtained from external search, do not add "via Google Scholar" to the Library Catalog field. - Some test cases are slightly modified to reflect these changes.

zoe-translates · 2023-07-11T05:57:55Z

Re: 47957c8 - a few awaits can still be optimized away without sending requests to Google Scholar too frequently. I'm experimenting with them.

- Don't wait for the row to pass through the DOI -> arXiv -> GS processing inside the loop over the rows, but do wait between each adjacent row iterations. This will help improve the perception of progress. - Error reporting improvements. - Fix some typos.

zoe-translates · 2023-07-11T09:22:14Z

As of 504c50e, the code now does what it supposed to do (delegating to external searches, running with some level of concurrency) using largely elementary constructs (no constructors, no factories, no then()). The only place where I passed a callback is to choose from two pre-defined functions (for search page and profile page respectively), so as to reduce a lot of redundant code. I kept one "typedef" in the comments, because it was just a group of properties describing the info strings from a row of search/profile results.

I inherited a large part of the code about this ItemFactory for scraping case law. I didn't modify that part, even if it didn't seem fully functional (understandable given its age). It doesn't affect the problem of excessive requests.

dstillman requested changes Jun 5, 2023

View reviewed changes

Google Scholar.js Outdated Show resolved Hide resolved

Google Scholar.js Outdated Show resolved Hide resolved

Google Scholar.js Outdated Show resolved Hide resolved

Google Scholar.js Outdated Show resolved Hide resolved

zoe-translates added 8 commits June 13, 2023 17:15

Google Scholar: [Minor] Revert the change from == to ===.

0168955

These changes, introduced in 4c53d52, is reverted. Other uses of `===` in the same file, before said commit, are not touched.

Google Scholar: Further emulation of in-browser navigation.

e5b17ba

- For the citation-info page request, inherit the language parameter from the browsing context in the URL. - For the BibTeX request, use the correct TLD in the hostname part of the referrer.

Google Scholar: [Minor] Add explanation about single-item scraping.

4af6a9e

zoe-translates force-pushed the google-scholar-rate-limit branch from 5f3fa4e to abaedc0 Compare June 13, 2023 09:16

zoe-translates added 6 commits June 13, 2023 17:39

Google Scholar: [Minor] Update test case for Article View page.

2ce126a

Google Scholar: [Minor] Properly encode space character in referrer.

fadb5e1

Replace all occurrences of the space character as the plus sign (+) in addition to just the first one.

Google Scholar: Ensure that URL is added to the translated item.

e2db733

Google Scholar: Make the return value of pipeline work function consi…

d59b297

…stent.

Google Scholar: Minor pattern-matching fixes for external identifiers.

4cabcb7

dstillman reviewed Jun 14, 2023

View reviewed changes

zoe-translates added 3 commits June 15, 2023 08:46

Google Scholar: Multiple stylistic fixes.

e3ed00f

- Use more descriptive variable names. - Corrected misleading comments. - More accurate type annotation in JSDoc comments. - Various formatting changes for code readability.

Google Scholar: Adjustments to attachment-saving

b02acb3

- Rename the linked document that appears on the right side of a search result/citation page "Available Version" (see https://scholar.google.com/intl/en/scholar/help.html#access). - Remove the "Snapshot" attachment.

Google Scholar: More accurate name for saved article

eff689d

Settled on "Available Version via Google Scholar"

Google Scholar: Add slash (/) as a possible file-extension separator

39a342a

In addition to the dot (.), try slash (/) as a possible marker that separates the candidate DOI from common file-type extensions such as "pdf", "html" etc. A test case is added.

adam3smith mentioned this pull request Jul 2, 2023

New web translator: https://www.e-periodica.ch #3008

Merged

Google Scholar: Create pipeline job-completion synchronization method.

694d778

It's unnecessary to keep an array of all task promises just to synchronize with the end of all pipelines' processing. The end of each pipeline's queue is sufficient.

zoe-translates requested a review from dstillman July 3, 2023 05:09

dstillman mentioned this pull request Jul 4, 2023

Add ability to test saving from multiple results zotero/translate#27

Open

dstillman reviewed Jul 5, 2023

View reviewed changes

Google Scholar.js Outdated Show resolved Hide resolved

Google Scholar.js Outdated Show resolved Hide resolved

Google Scholar: Code readability fixes

0cd720c

- Remove unnecessary ZU namespace reference - Don't break function definitions into two lines

dstillman reviewed Jul 6, 2023

View reviewed changes

dstillman merged commit 158299c into zotero:master Sep 23, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Google Scholar: Add delays between consecutive network requests. #3043

Google Scholar: Add delays between consecutive network requests. #3043

zoe-translates commented Jun 4, 2023

zoe-translates commented Jun 5, 2023

dstillman commented Jun 5, 2023

zoe-translates commented Jun 5, 2023

dstillman commented Jun 5, 2023 •

edited

Loading

dstillman commented Jun 5, 2023

zoe-translates commented Jun 5, 2023 •

edited

Loading

adam3smith commented Jun 6, 2023

zoe-translates commented Jun 7, 2023

zoe-translates commented Jun 7, 2023

zoe-translates commented Jun 13, 2023

dstillman left a comment

dstillman Jun 14, 2023

zoe-translates Jun 14, 2023

zoe-translates Jun 14, 2023

dstillman Jun 27, 2023

adam3smith Jun 27, 2023

dstillman Jun 27, 2023

zoe-translates Jun 27, 2023 •

edited

Loading

zoe-translates Jun 30, 2023

dstillman Jul 5, 2023

dstillman Jul 6, 2023

zoe-translates commented Jul 6, 2023

dstillman commented Jul 6, 2023

zoe-translates commented Jul 7, 2023

zoe-translates commented Jul 11, 2023

zoe-translates commented Jul 11, 2023

Google Scholar: Add delays between consecutive network requests. #3043

Google Scholar: Add delays between consecutive network requests. #3043

Conversation

zoe-translates commented Jun 4, 2023

zoe-translates commented Jun 5, 2023

dstillman commented Jun 5, 2023

zoe-translates commented Jun 5, 2023

dstillman commented Jun 5, 2023 • edited Loading

dstillman commented Jun 5, 2023

zoe-translates commented Jun 5, 2023 • edited Loading

adam3smith commented Jun 6, 2023

zoe-translates commented Jun 7, 2023

zoe-translates commented Jun 7, 2023

zoe-translates commented Jun 13, 2023

dstillman left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zoe-translates Jun 27, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

zoe-translates commented Jul 6, 2023

dstillman commented Jul 6, 2023

zoe-translates commented Jul 7, 2023

zoe-translates commented Jul 11, 2023

zoe-translates commented Jul 11, 2023

dstillman commented Jun 5, 2023 •

edited

Loading

zoe-translates commented Jun 5, 2023 •

edited

Loading

zoe-translates Jun 27, 2023 •

edited

Loading