Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Google Scholar: Add delays between consecutive network requests. #3043

Merged
merged 22 commits into from
Sep 23, 2023

Conversation

zoe-translates
Copy link
Collaborator

When there are multiple items to be translated, the previous implementation fires network requests to Google Scholar server in rapid succession. This increases the likelihood of being rate-limited or denied by the server.

To fix this, an artificial delay of at least 500ms is added between successive network requests.

In addition, calls to the callback-based doGet() are replaced with promise-based asynchronous calls.

A few stylistic fixes (such as === vs. ==) are also included.

Resolves #3037.

@zoe-translates
Copy link
Collaborator Author

I am currently using 500ms as the (minimum) delay between successive requests. But this may not be enough.

Sometimes, I get a 429 Too Many Requests response, and perhaps I could inspect that and see if there's a Retry-After: header with it, which may be helpful. (Currently, the requestXXXX rejects on any non-2xx status.)

There are a few more ideas for playing even nicer with the server:

  • Use longer delays, perhaps a second or more
  • Add some randomness to baseline delay; see the Python package scholarly.
  • Use a queue of requests that is throttled by a "pool of tokens", as used in the Google Chrome extension "Scholar H-index Calculator"; each request consumes a token (i.e. decreases a counter), while the total number of tokens is slowly replenished by setInterval but never exceeding a maximum; see also stackexchange discussion --- This will make the code a lot more complex though.

I'll do some experiments and see if we can catch a 429 response, and if we can adaptively handle that without hard-failing.

@dstillman
Copy link
Member

Oh, Google almost certainly doesn't include a Retry-After. That would be offered for a documented public API, not a site like Google trying to stop bots.

Approximately how many items were you able to save before you got a 429 with the 500ms delay?

Something like the token idea might help, but I think we'd have to implement it in the Connector itself and enforce it based on HTTP request domain, since we'd want it to apply across pages. And we wouldn't do it with setInterval — it would just have to be mathematical based on an in-memory record of recent translator HTTP requests for that domain or URL pattern. So we can do that separately from this commit.

Google Scholar.js Outdated Show resolved Hide resolved
Google Scholar.js Outdated Show resolved Hide resolved
Google Scholar.js Outdated Show resolved Hide resolved
Google Scholar.js Outdated Show resolved Hide resolved
@zoe-translates
Copy link
Collaborator Author

Approximately how many items were you able to save before you got a 429 with the 500ms delay?

I have great difficulty getting any consistent behaviour. Now I'm getting 403 Forbidden instead of 429, even with 1s delay.

Moreover there are too many moving parts (IP address, browser/version, prior interactions with the servers, etc.) for me to deduce anything for now.

Sometimes, I am blocked on the first request, which is a GET for the citation info page-fragment (containing the BibTeX link), with URL pattern shown here:
https://github.com/zotero/translators/blob/e458dc3402cb2d7a788aeb0c4758035492182623/Google%20Scholar.js#LL170C1-L170C1

Here's some explanation about what we're trying to do for any one looking into this. On a GS results page, e.g. this one:
https://scholar.google.com/scholar?hl=en&as_sdt=0%2C5&q=desire+production&btnG=

there's a "Cite" button under each result item. In the web app UI, clicking on "Cite" will initiate an XHR to obtain a HTML fragment, which the web app uses to display the citation info (and BibTeX link) in the style of a bubble. It is this XHR that the translator tries to emulate, by requesting citeUrl in the code referenced above.

The differences between the webapp-XHR and the translator-initiated one includes the following.

First and the most obvious, the "native" one's request URL looks like this:

https://scholar.google.com/scholar?q=info:kXCamG4ZSioJ:scholar.google.com/&output=cite&scirp=0&hl=en

while in ours, the query parameters have scirp=1 (instead of 0 in the native one above). I have no idea whether this is significant (maybe @adam3smith could help? thx).

In addition, the native XHR is sent with the "correct" headers incl. Accept: */*, Referer, CORS, etc. I haven't checked whether the translator emulates these (neither FF nor Chromium lets me see the extension-initiated request's details in the inspector > network tab).

@dstillman
Copy link
Member

dstillman commented Jun 5, 2023

In Chrome, switch on Developer mode and view the extension's background page to see the translator requests. In Firefox, use about:debugging.

@dstillman
Copy link
Member

I have great difficulty getting any consistent behaviour. Now I'm getting 403 Forbidden instead of 429, even with 1s delay.

Moreover there are too many moving parts (IP address, browser/version, prior interactions with the servers, etc.) for me to deduce anything for now.

Yeah, we've always had the view that it's essentially a black box and not something we can really solve in a reliable way. But anything in the direction of looking more human-like is likely to help to some extent. That may mean a 1s minimum rather than 500ms, at least between items.

@zoe-translates
Copy link
Collaborator Author

zoe-translates commented Jun 5, 2023

When selecting multiple items to save, I'm still having 403s in Firefox and Scaffold (running doWeb() from Scaffold UI). This is after increasing the delay to 1s and applying Referer: to the requests.

With the changes below, in Firefox the referrers were indeed applied, but in Chromium they weren't.

(EDIT: The inability to set Referer is a bug in Chromium; See this Chromium bug discussion, by Adomas)

@adam3smith
Copy link
Collaborator

Thanks for working on this. Did you still want my input on something -- I'm not quite sure I understand on what? If it's exact strategies on construction the URLs, I have no recollection. I (or whoever last changed this) almost certainly did exactly what you're doing now, replicating the behavior of the website, which may well have changed.

@zoe-translates
Copy link
Collaborator Author

I (or whoever last changed this) almost certainly did exactly what you're doing now, replicating the behavior of the website, which may well have changed.

Thanks! Indeed I'm following the same principle, emulating the behaviour of requests.

@zoe-translates
Copy link
Collaborator Author

I'm working on a better solution that cuts down the number of requests we have to send to Google servers, reducing the server load hence the likelihood of a block/captcha.

Namely, if the main link to the article contains an extractable identifier (DOI, arXiv id, etc.), and if that identifier alone is sufficient for us to retrieve the item metadata by means of a search translator (i.e. no rendering of the actual page needed), we will delegate to that and bypass Google.

For the rest of the links, we will fall back to Google Scholar.

And for those 3rd-party resolution attempts that fail, we will also use the GS fallback.

In any case, network requests will be interspersed with delays.

When there are multiple items to be translated, the previous
implementation fires network requests to Google Scholar server in rapid
succession. This increases the likelihood of being rate-limited or
denied by the server.

To fix this, an artificial delay of at least 500ms is added between
successive network requests.

In addition, calls to the callback-based `doGet()` are replaced with
promise-based asynchronous calls.

A few stylistic fixes (such as `===` vs. `==`) are also included.

Resolves zotero#3037.
- The delay between requests of successive entries is increased to 1s.
- Add the Referer header to the requests for citation-info page and
  BibTeX text, emulating the in-browser behaviour.
- Update the citation-info pages' URL pattern to better match the ones
  used during in-browser navigation.
- Added comments accordingly.
- Add the missing delays between requests when processing the URL
  listing on the profile.
- Refactoring and minor fixes for maintainability:
  - Make the delay between scrapes a constant instead of hard-coded
    number.
  - Refactor scrape() so that it always accept an URL as the second
    parameter. The logic for determining what to do (batch scraping vs.
    single) is moved to the caller.
  - Prefer "let" over "const" when the latter doesn't add to clarity.
These changes, introduced in 4c53d52,
is reverted. Other uses of `===` in the same file, before said commit,
are not touched.
- For the citation-info page request, inherit the language parameter
  from the browsing context in the URL.
- For the BibTeX request, use the correct TLD in the hostname part of
  the referrer.
To reduce the load of Google Scholar servers and therefore the
likelihood of getting an error or captcha, we use the DOI and arXiv
identifiers (inferred from article title links) with the corresponding
search translators when possible.

The Google Scholar "native" scrape is used as fallback when the external
search fails.

The requests to different services are done concurrently, but for each
service, requests are sequentialized, and consecutive requests are
interspersed with delays.

If there are items left untranslated at the end of this process, an
error is raised.
- Add the ability to delay in-pipeline task on the leading edge in
  addition to the trailing one (for profile pages).
- Add choice to control row-generation delays (not necessary for
  search-page scraping given the in-pipeline delays).
- Cancel the trailing-edge delays for the last row to translate, to
  improve responsiveness.
- No longer require work functions to signal their completion status by
  return value.
- Increased the default delay for GS native requests to 2 seconds.
@zoe-translates zoe-translates force-pushed the google-scholar-rate-limit branch from 5f3fa4e to abaedc0 Compare June 13, 2023 09:16
@zoe-translates
Copy link
Collaborator Author

Force-pushed after rebase.

The logic of applying translators has changed significantly. We now use a "pipelined" chain of DOI search -> ArXiv search -> "native" Google Scholar scraping, where each selected item will go through the chain in that order, and upon the first successful translation it will stop propagating. In other words, the translation method using Google Scholar network resources will be used as a last resort.

In each pipeline, the requests to network resources are queued with suitable delays between successive ones. But the three pipelines work concurrently, without waiting for each other.

In the end, if there are items left untranslated having gone through the full chain, an error will be thrown.

The DOI and ArXiv searches work by examining the item's link for candidate ID strings. If successful, each translation saves two requests to GS (one for citation-info page fragment, the other for the BibTeX file). The DOI is chosen because it is often the case that the DOI is part of the article's link URL (easy to obtain without further requests for search results).

Another benefit of this approach is a unified way to add additional pipelines should this be further extended.

There are some corner cases that needs addressing, including saving article text or snapshot when DOI is used, and the handling of citation-only entries (without title link).

Overall, this does reduces the likelihood of getting a 403 or captcha while using Google Scholar. However I'm not under the impression that we can completely prevent that without using an external API for "resolving" the citations to objects (something like Crossref's API that can be queried using title, author, and other info.).

Replace all occurrences of the space character as the plus sign (+) in
addition to just the first one.
Other minor fixes:

- Some property names are revised to be more descriptive and better
  follow convention.
- The title of the attachment obtained from GS link now appears as "Full
  Text (via Google Scholar)" instead of merely "Full Text". The
  GS-linked version may not be the VoR.
- Minor comment fixes.
Copy link
Member

@dstillman dstillman left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Haven't tried this or studied the logic, but some quick style/clarity/policy notes

// attach linked document as attachment if available
if (row.attachmentLink) {
let attachment = {
title: "Full Text (via Google Scholar)",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"(via Google Scholar)" isn't really appropriate for the title. What's the concern? That Google Scholar may link to an unofficial copy of the PDF?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These are the links appearing on the right of the search results page, beside each entry. Google seemed to prefer a publicly available version that matches the biblio entry, for example, personal archived versions hosted by the researchers' institutions, preprints, and files hosted on social media (for instance ResearchGate, Academia.edu, which may have some sort of data sharing with GS). My feeling was that this should be distinguished from the usual "Full Text" files. I was thinking about "Article Preview" but it might be misleading, suggesting a partial view which is often not the case.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How about simply "Linked Document"?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, yeah. This is tough.

I think "Linked Document" is too vague. Linked from where?

For files from Unpaywall, we use "Submitted Version", "Accepted Version", and "Published Version", but we don't know what those are here. Could be "Google Scholar Linked Version", but that's not really appropriate for the same reason "Full Text (via Google Scholar)" isn't really appropriate — the fact that it came from Google Scholar isn't really relevant to what it is. "Available Version"? "Publicly Available Version"?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FWIW, I actually liked the (via Google Scholar) option because it allows quick searching/filtering by the attachment title and also makes clear to users why they get a 'wrong' (e.g., old working paper) PDF. Otoh, with "Available Version" it's unclear where it's available from and, where it's wrong, it looks like Zotero mishandled it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair. I guess the practical aspect of this outweighs any conceptual objections to putting that in the title. OK, @zoe-translates, so we can revert to that.

Copy link
Collaborator Author

@zoe-translates zoe-translates Jun 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dstillman @adam3smith these ideas aren't contradictory. Can we have Available Version via Google Scholar? This is a bit long but it should be more accurate (the earlier iteration said "full text" but this may be inaccurate).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In eff689d I decided to use Available Version via Google Scholar. This will still make search and filtering easier, while being more accurate about the file's provenance. I hope we can close this fairly soon so we could focus on further improvements.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, sorry, let's go with Available Version (Google Scholar). Available Version via Google Scholar is vaguely ungrammatical, but Available Version is nice and consistent, and (Google Scholar) makes sense as a sort of tag. Put them together and we have a winner.

Google Scholar.js Outdated Show resolved Hide resolved
Google Scholar.js Outdated Show resolved Hide resolved
Google Scholar.js Outdated Show resolved Hide resolved
Google Scholar.js Outdated Show resolved Hide resolved
Google Scholar.js Outdated Show resolved Hide resolved
Google Scholar.js Outdated Show resolved Hide resolved
Google Scholar.js Outdated Show resolved Hide resolved
Google Scholar.js Outdated Show resolved Hide resolved
Google Scholar.js Outdated Show resolved Hide resolved
- Use more descriptive variable names.
- Corrected misleading comments.
- More accurate type annotation in JSDoc comments.
- Various formatting changes for code readability.
- Rename the linked document that appears on the right side of a search
  result/citation page "Available Version" (see
  https://scholar.google.com/intl/en/scholar/help.html#access).
- Remove the "Snapshot" attachment.
Settled on "Available Version via Google Scholar"
In addition to the dot (.), try slash (/) as a possible marker that
separates the candidate DOI from common file-type extensions such as
"pdf", "html" etc.

A test case is added.
It's unnecessary to keep an array of all task promises just to
synchronize with the end of all pipelines' processing. The end of each
pipeline's queue is sufficient.
Google Scholar.js Outdated Show resolved Hide resolved
Google Scholar.js Outdated Show resolved Hide resolved
- Remove unnecessary ZU namespace reference
- Don't break function definitions into two lines
* @property {Z.Translate<Z.SearchTranslator>} translator
* @property {string} key
*/
function ExternalSearch(searchKey, translatorUUID, doAttach = false) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So echoing #3062 (comment), as a general rule, let's just say no constructors in translators. There are likely a few bad examples of this in existing translators (RIS comes to mind), but most translators just use functions and it is good and fine.

There's just too much engineering here. Translators need to be simple and easy to understand. There can't be constructors, methods, pipelines, factory functions, generators — it's just not appropriate for a translator. I understand that you're trying to do some things here that other translators don't have to deal with, but if the logic needs to be slightly less optimized to keep the translator simpler, so be it.

@zoe-translates
Copy link
Collaborator Author

@dstillman, Could you provide some concrete criteria for the Google Scholar translator code you'd like to accept?

Am I correct in understand that you want the following on the code itself:

  1. "There can't be constructors, methods, pipelines, factory functions, generators" -- this is very clear, and this is what you meant by "simple and easy to understand"
  2. Based on WoS Tagged: Reimplement the core algorithms. #3062 (comment) you referenced, you want that "someone uninitiated and/or inexperienced can make tweaks to [it] in a few minutes" -- this is less objective; My understanding is to write things as a mostly loose collection of functions, which takes simple input and returns simple output -- rather than custom data types conforming to some "typedef" -- so that the reader of the code don't have to understand and remember those "typedef"s. Right?
  3. It should be written "how almost all our other translators are written" -- "unless there's an extremely compelling reason to change it". Is this just the same as 1? And a compelling reason would be to code obsolescence, so unlike most others, I should use requestXXX and remove doGet/doPost/processDocuments, and async doWeb/scrape functions.

And the behaviour of the code should be like this..

  1. Make less requests, and serialize them (this I'm sure)
  2. Save Available version (via Google Scholar) attachments when it makes sense (we've decided on the name)
  3. Opportunistically delegate to DOI or arXiv (see 1)
  4. In a chat we mentioned the speed to translate a page of items (typically 20). Is there some requirements about this? This question is specific to Google Scholar because of we need artificial delays.

Are there any other requirements I missed? Please let me know, thanks.

(A lot of the current complexity comes from the desire to optimize for concurrency [so the user perceives progress], and possible expansion in the future to include more 3rd-party services we can use. Should I forgo those, for reason of 1-3 above?)

@dstillman
Copy link
Member

Yes, that's all correct.

Optimizing for concurrency is a worthy goal, but not at the expense of readability and maintainability. I suspect there are some simpler ways we can achieve some level of concurrency without so much extra machinery.

@zoe-translates
Copy link
Collaborator Author

Hi, thanks!

I suspect there are some simpler ways we can achieve some level of concurrency without so much extra machinery.

This is a bit tricky. It's analogous to writing without using a particular glyph, as in a lipogram. But that won't stop us from trying.. right?

(Jest aside, suggestions are ever appreciated!)

- To reduce the cognitive load of maintainers and present the control
  flow in a more straightforward way, the code for the fallback-scraping
  technique is rewritten to eliminate classes (constructors),
  dynamically-generated and chained promises, and higher-order helper
  functions.
- Currently, this results in some loss of concurrency; there are a few
  techniques to compensate for that TBD.
- The attachment is now titled "Available Version (via Google Scholar)".
- When the item is obtained from external search, do not add "via Google
  Scholar" to the Library Catalog field.
- Some test cases are slightly modified to reflect these changes.
@zoe-translates
Copy link
Collaborator Author

Re: 47957c8 - a few awaits can still be optimized away without sending requests to Google Scholar too frequently. I'm experimenting with them.

- Don't wait for the row to pass through the DOI -> arXiv -> GS
  processing inside the loop over the rows, but do wait between each
  adjacent row iterations. This will help improve the perception of
  progress.
- Error reporting improvements.
- Fix some typos.
@zoe-translates
Copy link
Collaborator Author

As of 504c50e, the code now does what it supposed to do (delegating to external searches, running with some level of concurrency) using largely elementary constructs (no constructors, no factories, no then()). The only place where I passed a callback is to choose from two pre-defined functions (for search page and profile page respectively), so as to reduce a lot of redundant code. I kept one "typedef" in the comments, because it was just a group of properties describing the info strings from a row of search/profile results.

I inherited a large part of the code about this ItemFactory for scraping case law. I didn't modify that part, even if it didn't seem fully functional (understandable given its age). It doesn't affect the problem of excessive requests.

@dstillman dstillman merged commit 158299c into zotero:master Sep 23, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

Google Scholar: Add delay between metadata requests
3 participants