New approach to data collection (using Google Scholar then WoS then Unpaywall) #10

solstag · 2021-05-07T15:33:32Z

Ni! Here's a new approach me and @amchagas put together earlier today. This comes after our conclusion (see #8) that one can't easily automatically tell from the abstract whether a paper is about open hardware, and our previous approaches (Wos+Scopus+Scielo) were thus very limiting since they don't permit full text search.

The new approach consists of:

Search for "open hardware" or "open-source hardware" (or something similar), saving all result pages and extracting article data. That will be about 20k results, with 10 results per page it makes for 2k requests.

Here's an example of a proper (i.e. using Scrapy) Google Scholar scraper : Build Your Own Google Scholar API With Python Scrapy
Using the titles and whatever metadata (first autor, ano), do a WoS/Scopus search to get their full metadata and DOI, and to confirm whether each article falls into our scope (published journal and conference papers?).
As far as we manually treat the full text contents, we can get the full text manually as that's very little ovehead. If we're automating, then get the full text through either Google Scholar, Unpaywall, and Sci-hub if necessary.

The text was updated successfully, but these errors were encountered:

amchagas · 2021-06-08T21:55:26Z

Ok, after a long time without managing to work on this, I looked at the link showing how to build a spider using scrapy and spend a bit of time trying to download things... the first query I managed was "open hardware" and the csv file can be found here:

https://github.com/amchagas/open-hardware-supply/blob/master/data/scrapy/openhardware_query.csv

have to check in more details the content of the data, and maybe we could have a chat on where to take this?

solstag · 2021-06-13T13:23:41Z

Legal! I agree we should use a more restrained query when searching the full text. Probably only things we are really sure refer to what we want, like combinations of ("open", "open-source") with ("hardware", "science hardware", "scientific hardware"), this way we can argue a well defined scope tied to the usage of the term in the full text. Let me know when is a good time for a chat!

amchagas · 2021-06-26T18:50:09Z

A little update:
We learned the hard way that google scholar only displays 1000 results of whatever search term is being used. So the search "open hardware" gives 14800 hits, but only the first 1000 are displayed.

This makes things complicated as we are then very far away from reaching a data collection that is representative of all the papers that have "open hardware" in the body somewhere.

There are other things being tried to see if we can break down search terms and results to smaller than 1000 references, and scrape things "slowly"

One more learned thing is that GS allows searches like "open * hardware" which finds "open source hardware", "open science hardware", "open access hardware" and "open loop hardware", but it does not find "open hardware".

removing terms (that is having a search that involves all terms but one ie "open source hardware" AND "open science hardware" AND NOT "open loop hardware") can be done with a syntax like this (the order of terms matter): "open * hardware" -loop

amchagas · 2021-06-30T21:14:54Z

Here is some new data all files are "JSON line" files that can be opened up with pandas.. as can be seen in the code link below

and here some new code

I think it is time for a chat again...

solstag · 2021-07-01T16:39:13Z

Ni! Let's chat chat chat. This week and early next week I am busy with responding to a reviewer. Maybe Thursday or Friday next week?

amchagas · 2021-07-02T07:49:28Z

Works for me! I'll DM you for time details

solstag · 2021-10-06T15:32:12Z

Note on how to match gscholar entries with wos entries given our new powers of querying the wos api: "permite puxar os títulos, mas precisa melhorar a entrada da busca com outros elementos além do título (e.g., primeiro autor do resultado gscholar e talvez ano) e a lógica de escolher o melhor resultado na saída (e.g., checar presença do primeiro autor gscholar na lista de autores wos, checar se ano é próximo ± 1)"

solstag · 2021-11-09T13:54:30Z

85ed621 implements querying passing author name and year, and then checking title, author name and year among the matching records, with tolerance for small errors, and logging of accepted non-exact matches for later manual verification.
There are still some corners to polish, but I suspect we should be able to get the WoS records for pretty much every scraped entry that has one now.

solstag · 2021-11-10T16:48:10Z

Currently running on one of my lab's servers to produce the full WoS records database from the Google Scholar entries (=

solstag referenced this issue Sep 1, 2021

Add get_wos_records_from_api.py

ec1db4f

amchagas closed this as completed Jan 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New approach to data collection (using Google Scholar then WoS then Unpaywall) #10

New approach to data collection (using Google Scholar then WoS then Unpaywall) #10

solstag commented May 7, 2021

amchagas commented Jun 8, 2021 •

edited by solstag

Loading

solstag commented Jun 13, 2021

amchagas commented Jun 26, 2021 •

edited

Loading

amchagas commented Jun 30, 2021

solstag commented Jul 1, 2021

amchagas commented Jul 2, 2021

solstag commented Oct 6, 2021

solstag commented Nov 9, 2021

solstag commented Nov 10, 2021

New approach to data collection (using Google Scholar then WoS then Unpaywall) #10

New approach to data collection (using Google Scholar then WoS then Unpaywall) #10

Comments

solstag commented May 7, 2021

amchagas commented Jun 8, 2021 • edited by solstag Loading

solstag commented Jun 13, 2021

amchagas commented Jun 26, 2021 • edited Loading

amchagas commented Jun 30, 2021

solstag commented Jul 1, 2021

amchagas commented Jul 2, 2021

solstag commented Oct 6, 2021

solstag commented Nov 9, 2021

solstag commented Nov 10, 2021

amchagas commented Jun 8, 2021 •

edited by solstag

Loading

amchagas commented Jun 26, 2021 •

edited

Loading