-
-
Notifications
You must be signed in to change notification settings - Fork 2.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quick fix google scholar entry fetching #2082
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some small remarks.
int lastRegionStart = 0; | ||
|
||
while (m.find()) { | ||
String link = m.group(1).replace("&", "&"); | ||
link = link+"&oe=utf-8"; // append param 'oe=utf-8' to tell google to serve UTF-8 encoded results |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The UTF-8 string is no longer required?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, as I am using a Firefox Useragent UTF-8 is automatically delivered.
String link = m.group(1).replace("&", "&"); | ||
link = link+"&oe=utf-8"; // append param 'oe=utf-8' to tell google to serve UTF-8 encoded results | ||
|
||
String citationsPageURL = CITATIONS_PAGE_URL_BASE+m.group(1)+CITATIONS_PAGE_URL_SUFFIX; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use Apache's URIBuilder instead of String concatenation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mhmm. I don't think this is that much an improvement as still some string concatenation is needed for the param "q" with "info:"+id+":scholar.google.com/"
|
||
String citationsPage = URLDownload.createURLDownloadWithBrowserUserAgent(citationsPageURL).downloadToString(StandardCharsets.UTF_8); | ||
|
||
Matcher citationPageMatcher = GoogleScholarFetcher.BIBTEX_LINK_PATTERN.matcher(citationsPage); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not possible to get the BibTeX Url directly from the id?
For example, directly access https://scholar.googleusercontent.com/scholar.bib?q=info:b2pGeL14LLMJ:scholar.google.com/&output=citation&scisig=AAGBfm0AAAAAV-vITSVVNk7OSER8S_LFaMSElM9jnUcv&scisf=4&ct=citation&cd=-1&hl=en
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, as this "scisig" is necessary and seems to be changing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's unfortunate.
I did some research and I think I found something.
Google stores the setting to display bib-links in a cookie. Sending the cookie GSP=IN=d192d757fd09588a+7e6cc990821af64:CF=4
along the first query directly shows the Import into Bibtex link,
<a href="https://scholar.googleusercontent.com/scholar.bib?q=info:b2pGeL14LLMJ:scholar.google.com/&output=citation&scisig=AAGBfm0AAAAAV-vafm6NWi20weGxxou9W2xi8GzZ8YCf&scisf=4&ct=citation&cd=0&hl=en" class="gs_nta gs_nph">Import into BibTeX</a>
Appearently, the part after GSP=IN=... is not that important since my real id had 63 at the end instead of 64.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah using a Cookie instead of the config method can do the trick...
I'll merge the current state in and than start implementing a new version based on the new fetcher interfaces. Using these we should be able to perform automated tests which will be helpful for the initial development and to detect issues caused by changes of Google structure faster...
Google Scholar fetching was broken again (see #1886)
With this fix at least getting the first 10 search hits is possible again.
Configuration is no longer possible in the current form and google generally limits the responses (per page) to 20 hits (however, even using this will cause a captcha challenge for JabRef).
As only 10 hits are allowed a rewrite to the new FetcherInfrastructure should now be possible (thus, omitting the 2-step approach).