-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New approach to data collection (using Google Scholar then WoS then Unpaywall) #10
Comments
Ok, after a long time without managing to work on this, I looked at the link showing how to build a spider using scrapy and spend a bit of time trying to download things... the first query I managed was "open hardware" and the csv file can be found here: have to check in more details the content of the data, and maybe we could have a chat on where to take this? |
Legal! I agree we should use a more restrained query when searching the full text. Probably only things we are really sure refer to what we want, like combinations of ("open", "open-source") with ("hardware", "science hardware", "scientific hardware"), this way we can argue a well defined scope tied to the usage of the term in the full text. Let me know when is a good time for a chat! |
A little update: This makes things complicated as we are then very far away from reaching a data collection that is representative of all the papers that have "open hardware" in the body somewhere. There are other things being tried to see if we can break down search terms and results to smaller than 1000 references, and scrape things "slowly" One more learned thing is that GS allows searches like "open * hardware" which finds "open source hardware", "open science hardware", "open access hardware" and "open loop hardware", but it does not find "open hardware". removing terms (that is having a search that involves all terms but one ie "open source hardware" AND "open science hardware" AND NOT "open loop hardware") can be done with a syntax like this (the order of terms matter): "open * hardware" -loop |
Ni! Let's chat chat chat. This week and early next week I am busy with responding to a reviewer. Maybe Thursday or Friday next week? |
Works for me! I'll DM you for time details |
Note on how to match gscholar entries with wos entries given our new powers of querying the wos api: "permite puxar os títulos, mas precisa melhorar a entrada da busca com outros elementos além do título (e.g., primeiro autor do resultado gscholar e talvez ano) e a lógica de escolher o melhor resultado na saída (e.g., checar presença do primeiro autor gscholar na lista de autores wos, checar se ano é próximo ± 1)" |
85ed621 implements querying passing author name and year, and then checking title, author name and year among the matching records, with tolerance for small errors, and logging of accepted non-exact matches for later manual verification. |
Currently running on one of my lab's servers to produce the full WoS records database from the Google Scholar entries (= |
Ni! Here's a new approach me and @amchagas put together earlier today. This comes after our conclusion (see #8) that one can't easily automatically tell from the abstract whether a paper is about open hardware, and our previous approaches (Wos+Scopus+Scielo) were thus very limiting since they don't permit full text search.
The new approach consists of:
Search for
"open hardware" or "open-source hardware"
(or something similar), saving all result pages and extracting article data. That will be about 20k results, with 10 results per page it makes for 2k requests.Here's an example of a proper (i.e. using Scrapy) Google Scholar scraper : Build Your Own Google Scholar API With Python Scrapy
Using the titles and whatever metadata (first autor, ano), do a WoS/Scopus search to get their full metadata and DOI, and to confirm whether each article falls into our scope (published journal and conference papers?).
As far as we manually treat the full text contents, we can get the full text manually as that's very little ovehead. If we're automating, then get the full text through either Google Scholar, Unpaywall, and Sci-hub if necessary.
The text was updated successfully, but these errors were encountered: