Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New approach to data collection (using Google Scholar then WoS then Unpaywall) #10

Closed
solstag opened this issue May 7, 2021 · 9 comments

Comments

@solstag
Copy link
Collaborator

solstag commented May 7, 2021

Ni! Here's a new approach me and @amchagas put together earlier today. This comes after our conclusion (see #8) that one can't easily automatically tell from the abstract whether a paper is about open hardware, and our previous approaches (Wos+Scopus+Scielo) were thus very limiting since they don't permit full text search.

The new approach consists of:

  1. Search for "open hardware" or "open-source hardware" (or something similar), saving all result pages and extracting article data. That will be about 20k results, with 10 results per page it makes for 2k requests.

    Here's an example of a proper (i.e. using Scrapy) Google Scholar scraper : Build Your Own Google Scholar API With Python Scrapy

  2. Using the titles and whatever metadata (first autor, ano), do a WoS/Scopus search to get their full metadata and DOI, and to confirm whether each article falls into our scope (published journal and conference papers?).

  3. As far as we manually treat the full text contents, we can get the full text manually as that's very little ovehead. If we're automating, then get the full text through either Google Scholar, Unpaywall, and Sci-hub if necessary.

@amchagas
Copy link
Owner

amchagas commented Jun 8, 2021

Ok, after a long time without managing to work on this, I looked at the link showing how to build a spider using scrapy and spend a bit of time trying to download things... the first query I managed was "open hardware" and the csv file can be found here:

https://github.com/amchagas/open-hardware-supply/blob/master/data/scrapy/openhardware_query.csv

have to check in more details the content of the data, and maybe we could have a chat on where to take this?

@solstag
Copy link
Collaborator Author

solstag commented Jun 13, 2021

Legal! I agree we should use a more restrained query when searching the full text. Probably only things we are really sure refer to what we want, like combinations of ("open", "open-source") with ("hardware", "science hardware", "scientific hardware"), this way we can argue a well defined scope tied to the usage of the term in the full text. Let me know when is a good time for a chat!

@amchagas
Copy link
Owner

amchagas commented Jun 26, 2021

A little update:
We learned the hard way that google scholar only displays 1000 results of whatever search term is being used. So the search "open hardware" gives 14800 hits, but only the first 1000 are displayed.

This makes things complicated as we are then very far away from reaching a data collection that is representative of all the papers that have "open hardware" in the body somewhere.

There are other things being tried to see if we can break down search terms and results to smaller than 1000 references, and scrape things "slowly"

One more learned thing is that GS allows searches like "open * hardware" which finds "open source hardware", "open science hardware", "open access hardware" and "open loop hardware", but it does not find "open hardware".

removing terms (that is having a search that involves all terms but one ie "open source hardware" AND "open science hardware" AND NOT "open loop hardware") can be done with a syntax like this (the order of terms matter): "open * hardware" -loop

@amchagas
Copy link
Owner

Here is some new data all files are "JSON line" files that can be opened up with pandas.. as can be seen in the code link below

and here some new code

I think it is time for a chat again...

@solstag
Copy link
Collaborator Author

solstag commented Jul 1, 2021

Ni! Let's chat chat chat. This week and early next week I am busy with responding to a reviewer. Maybe Thursday or Friday next week?

@amchagas
Copy link
Owner

amchagas commented Jul 2, 2021

Works for me! I'll DM you for time details

@solstag
Copy link
Collaborator Author

solstag commented Oct 6, 2021

Note on how to match gscholar entries with wos entries given our new powers of querying the wos api: "permite puxar os títulos, mas precisa melhorar a entrada da busca com outros elementos além do título (e.g., primeiro autor do resultado gscholar e talvez ano) e a lógica de escolher o melhor resultado na saída (e.g., checar presença do primeiro autor gscholar na lista de autores wos, checar se ano é próximo ± 1)"

@solstag
Copy link
Collaborator Author

solstag commented Nov 9, 2021

85ed621 implements querying passing author name and year, and then checking title, author name and year among the matching records, with tolerance for small errors, and logging of accepted non-exact matches for later manual verification.
There are still some corners to polish, but I suspect we should be able to get the WoS records for pretty much every scraped entry that has one now.

@solstag
Copy link
Collaborator Author

solstag commented Nov 10, 2021

Currently running on one of my lab's servers to produce the full WoS records database from the Google Scholar entries (=

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants