-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Integrate Plos Collection Data Dump in sources #8
Comments
I worked on the original file that lives in https://github.com/amchagas/open-source-toolkit. Namely classified all entries as hardware, software or both... Now it should be easy to filter out only the hardware ones. Sorry I am only seeing that you working on the data now.. would you mind giving a little bit more detail on what you have done so far? |
ACK, since it's such a small file, I'll update the code to download it from that repository instead of keeping a copy. We'll talk details on Monday (: |
Ni! André (@amchagas), when you find the time, can you work a bit more on the
PS: I think I've explained the situation in DM, but if you wanna see what I'm seeing check out b2cab27 |
Hi @solstag ! just worked a bit more on the csv file, it is uploaded to the repo.
to keep track of which entries were missing DOIs, I added a new column "missing DOIs" and added the missing information there. I managed to get 32 DOIs, but in the process of doing so, noticed that there was some missclassified entries. For instance, some things that should have been classified as "web articles" were classified as "research articles" (the page showing the GOSH manifesto was one of them). counted 14 of these missclassifications (but note that in this number are also things that were the other way around > research articles classified as web articles). From what I remember, this classification came from PLoS people, so I am not sure if there are more of these cases in there. For the cases that have a new DOI, I found that some of them do not get found by our query, some are found and some are not found because of hyphen.. So the papers spell "open-hardware" and not "open hardware" or "open-source hardware" etc. Maybe we would gain from adding these instances to the query. The same is true for the DOIs from PLoS one you listed above. (also these DOIs, in their current format, do not lead to the article webpage when I past them to the address bar on my navigator). |
Ni! Cool, excellent, thanks! Hm, I would be surprised if Scopus or WOS would not find "open-hardware" from "open hardware", I was assuming they treat hyphens as spaces. Did you confirm that? |
ok, so you are correct, both databases manage differences between "open hardware" and "open-hardware" and so does Scielo |
I forgot to follow up on my comment: (Could be that the ones that are not found are not indexed in those databases somehow?) I guess the question is, do we want to do a deep dive in this issue, or do we acknowledge it exists and move on? |
The issue is that these open hardware papers may not use any terms to designate open hardware in their title+abstracts+keywords, or we might not have the good terms. A ideia seria para os 10 fazer uma tabela com:
|
some results are in :P Other than that, one of the papers does not actually shares the needed data for replication and is more focused on a biological question, rather than the description of a tool method. Other two papers mention likely keywords only in the introduction, they are One paper has no mentions to open source whatsoever, even though code and design files are nicely placed in GH. All of these do not have our terms in the title and/or abstract I have saved a table with this info under the |
A quick look at WOS shows that "open source design" puts out about ~100 entries. Some are not hardware, but there are also some hardware articles we did not find before a possible keyword combination is |
Ni! Ok, I've checked the table. It answers question 3 (quais outros termos usa para referir que trata de OH), but it doesn't answer questions 1 and 2 (quais dos nossos termos tão presentes / se ele seria achado pela nossa query ou não). So we still don't know whether our current search would catch those. Or am I missing something? Cheeeers |
you are not missing something. I missed to write down specifically: |
So, basically, you're saying that none of those 10 papers would show up in our current search? Well that's pretty bad. EDIT: sorry, paying more attention to the keywords (and not only the observations) I guess two of them would show up because they contain "open hardware". It's still bad though. The upside is that by adding "open source design" we'd cover half of them. |
I think it would be useful to check again how many papers are in the plos collection that are not in our query? I mean are these 10 selected papers representing a big number of papers that are not caught by our query? Or was it more a fluke ? |
This was a random sample of "papers from the plos collection having plos DOIs" (:X). So it should be somewhat representative of the plos papers. I'm going to directly check our current data for all the DOIs in X, and see what we find in those papers absent from it. But it's clear we are leaving a lot of - and still possibly most! - papers out. |
Ni! Done:
The conclusion is that these numbers makes it hard to argue that BIBLIO is representative. We can try to add "design" to the query and see how this improves, given that in our sample of 10 so many used that. I'm committing that change to the query generator in project_definitions.py. Is it too much work to regenerate the RIS files? But I'm not very confident that it will improve the situation enough. I'd hope to find at least half of the PLOSC stuff in BIBLIO. I'm trying to think of what else can we do. Maybe we could add "open source method" and "open source electronics" like you suggested. Or include "open source tools" but conditioned on record also mentioning "hardware" or "electronics". Cheers! |
Just did a new search using the new definitions for Scoupus, Wos and Scielo.. |
Ni! There's something wrong with the new files. Files "scielo.ciw" and "wos1-500.ciw" are the same. Can you check? |
Ups! should be fixed now... |
Ni!
So I checked with the earlier data and it seems we weren't getting 7 and 10 before either. Then I went looking into the contents of the articles, and it turns out you marked them using the full-text. The problem is that Scopus and Wos, and probably Scielo as well, only let us search on the title+abstract+keywords.
That could be a different approach, less usual, but perhaps easier. We can say that we limit ourselves to open access publications available through CORE because searching the full text is more reliable, even if CORE coverage is messy to define. We'll miss both paywalled and OA content not in CORE. I checked and Sensors is consistently there because it's in PMC, PLOS seems to depend on institutional repositories but does get indexed full text; HardwareX papers can be found but it's worse as it denies full text indexing.
So, do you think you can play some more with the searches, maybe we want to include some more search statements like
Abraço, |
Two perspectives:
In doing this I may be tempted to refactor data loading a little bit.
The text was updated successfully, but these errors were encountered: