Integrate Plos Collection Data Dump in sources #8

solstag · 2021-02-03T09:56:25Z

Two perspectives:

ensure we'got all articles in this collection covered
together with the other sources, produce a larger collection for https://github.com/amchagas/open-source-toolkit

In doing this I may be tempted to refactor data loading a little bit.

amchagas · 2021-02-18T10:54:21Z

I worked on the original file that lives in https://github.com/amchagas/open-source-toolkit. Namely classified all entries as hardware, software or both... Now it should be easy to filter out only the hardware ones.

Sorry I am only seeing that you working on the data now.. would you mind giving a little bit more detail on what you have done so far?

solstag · 2021-02-18T18:07:42Z

ACK, since it's such a small file, I'll update the code to download it from that repository instead of keeping a copy. We'll talk details on Monday (:

solstag · 2021-03-03T00:32:58Z

Ni! André (@amchagas), when you find the time, can you work a bit more on the plos-items.csv? Specifically:

Most addresses in the DOI field are already DOIs or very close to a DOI (like plos URLs that contain the DOI), only 46 are not. Can you check if those 46 have DOIs on the page in the URLs, and in that case replace them by the DOIs?
As you do that, can you take note of whether our search query would match those articles (title+abstract+keywords)?
Also, here's a sample of 10 pone dois from the collection, can you also check if they'd get picked by our search query?

index  doi
350    10.1371/journal.pone.0187219
405    10.1371/journal.pone.0059840
407    10.1371/journal.pone.0030837
393    10.1371/journal.pone.0118545
310    10.1371/journal.pone.0206678
388    10.1371/journal.pone.0143547
295    10.1371/journal.pone.0220751
398    10.1371/journal.pone.0107216
281    10.1371/journal.pone.0226761
338    10.1371/journal.pone.0193744

PS: I think I've explained the situation in DM, but if you wanna see what I'm seeing check out b2cab27
.~´

amchagas · 2021-03-04T12:04:42Z

Hi @solstag !

just worked a bit more on the csv file, it is uploaded to the repo.

Most addresses in the DOI field are already DOIs or very close to a DOI (like plos URLs that contain the DOI), only 46 are not. Can you check if those 46 have DOIs on the page in the URLs, and in that case replace them by the DOIs?

to keep track of which entries were missing DOIs, I added a new column "missing DOIs" and added the missing information there. I managed to get 32 DOIs, but in the process of doing so, noticed that there was some missclassified entries. For instance, some things that should have been classified as "web articles" were classified as "research articles" (the page showing the GOSH manifesto was one of them). counted 14 of these missclassifications (but note that in this number are also things that were the other way around > research articles classified as web articles). From what I remember, this classification came from PLoS people, so I am not sure if there are more of these cases in there.

For the cases that have a new DOI, I found that some of them do not get found by our query, some are found and some are not found because of hyphen.. So the papers spell "open-hardware" and not "open hardware" or "open-source hardware" etc. Maybe we would gain from adding these instances to the query.

The same is true for the DOIs from PLoS one you listed above. (also these DOIs, in their current format, do not lead to the article webpage when I past them to the address bar on my navigator).

solstag · 2021-03-04T12:28:13Z

Ni! Cool, excellent, thanks!

Hm, I would be surprised if Scopus or WOS would not find "open-hardware" from "open hardware", I was assuming they treat hyphens as spaces. Did you confirm that?

amchagas · 2021-03-04T13:08:22Z

ok, so you are correct, both databases manage differences between "open hardware" and "open-hardware" and so does Scielo

amchagas · 2021-03-12T13:11:18Z

I forgot to follow up on my comment:
If hyphenation is not the issue for some of them getting found and some not, then what could be the issue? I could not think of anything obvious...

(Could be that the ones that are not found are not indexed in those databases somehow?)

I guess the question is, do we want to do a deep dive in this issue, or do we acknowledge it exists and move on?

solstag · 2021-03-12T13:21:54Z

The issue is that these open hardware papers may not use any terms to designate open hardware in their title+abstracts+keywords, or we might not have the good terms.

A ideia seria para os 10 fazer uma tabela com:

quais dos nossos termos tão presentes
se ele seria achado pela nossa query ou não
quais outros termos usa para referir que trata de OH

amchagas · 2021-03-15T21:03:10Z

some results are in :P
from the ten papers I found some keywords that we might want to test:
3 papers use Open source design ~100 hits at WOS
1 paper uses open source method 50 hits at WOS
1 paper uses open source tool 1201 hits at WOS (and quite some software papers)
1 paper uses open source electronics 62 hits at WOS
1 paper uses inexpensive hardware (using this could lead to a lot of papers that describe affordable solutions but not necessarily open source)

Other than that, one of the papers does not actually shares the needed data for replication and is more focused on a biological question, rather than the description of a tool method.

Other two papers mention likely keywords only in the introduction, they are open hardware design and open source hardware

One paper has no mentions to open source whatsoever, even though code and design files are nicely placed in GH.

All of these do not have our terms in the title and/or abstract

I have saved a table with this info under the /data folder

amchagas · 2021-03-15T21:41:41Z

A quick look at WOS shows that "open source design" puts out about ~100 entries. Some are not hardware, but there are also some hardware articles we did not find before

a possible keyword combination is open source 3D printed which outputs 20 articles in WOS and all of them are about hardware.

solstag · 2021-03-26T14:25:23Z

Ni! Ok, I've checked the table. It answers question 3 (quais outros termos usa para referir que trata de OH), but it doesn't answer questions 1 and 2 (quais dos nossos termos tão presentes / se ele seria achado pela nossa query ou não). So we still don't know whether our current search would catch those. Or am I missing something? Cheeeers

amchagas · 2021-03-26T14:33:32Z

you are not missing something. I missed to write down specifically:
On the ods file there is an "observations" column where I made comments to some of the papers, things like "our keywords are only present in the introduction" - so not in abstract, title or keywords. The rows where no observations are made, our keywords were not found on the paper. This is because the plos collection was made by people either submitting things to it (so the authors knew their paper matched the collection requirements, even though they did not use the keywords we are now looking for), or by us finding papers "by chance"...

solstag · 2021-03-26T14:45:44Z

So, basically, you're saying that none of those 10 papers would show up in our current search? Well that's pretty bad.

EDIT: sorry, paying more attention to the keywords (and not only the observations) I guess two of them would show up because they contain "open hardware". It's still bad though. The upside is that by adding "open source design" we'd cover half of them.

amchagas · 2021-03-26T15:17:30Z

I think it would be useful to check again how many papers are in the plos collection that are not in our query? I mean are these 10 selected papers representing a big number of papers that are not caught by our query? Or was it more a fluke ?
In any case, we can make a note of that on the writing, and try to assess the size of the issue?
We are going to have to go through more entries manually anyway I think to check a statistical significant subsample of the papers, so we might learn more when we do that...

solstag · 2021-03-26T15:39:59Z

This was a random sample of "papers from the plos collection having plos DOIs" (:X). So it should be somewhat representative of the plos papers. I'm going to directly check our current data for all the DOIs in X, and see what we find in those papers absent from it. But it's clear we are leaving a lot of - and still possibly most! - papers out.

solstag · 2021-03-26T21:48:09Z

Ni! Done:

recap:
- there are 706 articles in the bibliographic database query data (BIBLIO).
- there are 623 articles in BIBLIO that contain a DOI.
- there are 172 articles in the PLOS Collection (PLOSC).
new:
- there are 126 articles in PLOSC that contain a DOI (either in DOI format or inside a URL, which I then extract).
- only 19 articles are found both in BIBLIO and PLOSC.
- if we compare titles, we get 36 titles in both BIBLIO and PLOSC
- but consistent with the DOI matching, only 20 titles match if we consider titles that have DOIs
  - (there are 20 instead of the 19 for DOI matches because there's a matching title in PLOSC with a bioRxiv DOI).

The conclusion is that these numbers makes it hard to argue that BIBLIO is representative.

We can try to add "design" to the query and see how this improves, given that in our sample of 10 so many used that. I'm committing that change to the query generator in project_definitions.py. Is it too much work to regenerate the RIS files?

But I'm not very confident that it will improve the situation enough. I'd hope to find at least half of the PLOSC stuff in BIBLIO. I'm trying to think of what else can we do. Maybe we could add "open source method" and "open source electronics" like you suggested. Or include "open source tools" but conditioned on record also mentioning "hardware" or "electronics".

Cheers!

amchagas · 2021-03-30T21:53:57Z

Just did a new search using the new definitions for Scoupus, Wos and Scielo..
For Scopus, exported only "articles", avoiding "proceedings", "book chapter" etc. since the number of entries was quite high and exporting limited to 2000 entries (could export every type, but would have to filter, select etc).

solstag · 2021-03-31T15:18:21Z

Ni! There's something wrong with the new files. Files "scielo.ciw" and "wos1-500.ciw" are the same. Can you check?

amchagas · 2021-04-01T16:27:40Z

Ups! should be fixed now...

solstag · 2021-04-02T02:55:00Z

Ni!

Ok, big mistake, turns out "open design" brings in a lot of unrelated medical literature, so we have to exclude it and only keep only the other combinatinos with "design" : P
I've checked the 10 sampled articles and found something strange: with the new search, we get articles 2, 3 and 6 (your table index). Those are the articles you tagged with "open source design". But we do not get the articles 7 and 10, respectively with "open hardware" and "open source hardware". That's weird!

So I checked with the earlier data and it seems we weren't getting 7 and 10 before either. Then I went looking into the contents of the articles, and it turns out you marked them using the full-text. The problem is that Scopus and Wos, and probably Scielo as well, only let us search on the title+abstract+keywords.

The only way to large scale search full text is through CORE, but it's restricted to open access sources, mostly from institutional repositories: https://core.ac.uk/

That could be a different approach, less usual, but perhaps easier. We can say that we limit ourselves to open access publications available through CORE because searching the full text is more reliable, even if CORE coverage is messy to define. We'll miss both paywalled and OA content not in CORE. I checked and Sensors is consistently there because it's in PMC, PLOS seems to depend on institutional repositories but does get indexed full text; HardwareX papers can be found but it's worse as it denies full text indexing.

With some recent improvements in the code and the current search we got from 19 to 28 abstracts in both BIBLIO and PLOSC. But that's still pretty low. I'm thinking we might want some kind of AI to learn from the abstract if the paper is introducing an open source hardware. I'm thinking that we should keep improving our search until we get to about a third of the papers in PLOSC, at which point we may publish something. And later we can use the data from that to train a machine learning model.

So, do you think you can play some more with the searches, maybe we want to include some more search statements like ("open ource" AND "electronics"), which will catch papers containing both terms even if they're not contiguous, or even "DIY".

In any case, since I've spent some hours improving the ETL, I'm going to ask you to replay the searches again with the current new definition which excludes "open design" (see bbf3cd7), to see what we get after processing. Ok? Also, if you try some other stuff like I suggested in the previous paragraph, and you figure it's good, you can add that too. And yes, do export only "articles".

Abraço,

solstag self-assigned this Feb 3, 2021

solstag added a commit that referenced this issue Feb 3, 2021

Add plos collection raw data (for #8)

dd88ebf

solstag referenced this issue Feb 7, 2021

Initial work on the plos collection data

aed24ca

solstag added a commit that referenced this issue Mar 26, 2021

add 'design' to query generator per issue #8

94cf868

solstag mentioned this issue May 7, 2021

New approach to data collection (using Google Scholar then WoS then Unpaywall) #10

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Integrate Plos Collection Data Dump in sources #8

Integrate Plos Collection Data Dump in sources #8

solstag commented Feb 3, 2021

amchagas commented Feb 18, 2021

solstag commented Feb 18, 2021

solstag commented Mar 3, 2021 •

edited

Loading

amchagas commented Mar 4, 2021

solstag commented Mar 4, 2021

amchagas commented Mar 4, 2021

amchagas commented Mar 12, 2021

solstag commented Mar 12, 2021 •

edited

Loading

amchagas commented Mar 15, 2021 •

edited

Loading

amchagas commented Mar 15, 2021

solstag commented Mar 26, 2021

amchagas commented Mar 26, 2021

solstag commented Mar 26, 2021 •

edited

Loading

amchagas commented Mar 26, 2021 •

edited

Loading

solstag commented Mar 26, 2021

solstag commented Mar 26, 2021

amchagas commented Mar 30, 2021

solstag commented Mar 31, 2021

amchagas commented Apr 1, 2021

solstag commented Apr 2, 2021 •

edited

Loading

Integrate Plos Collection Data Dump in sources #8

Integrate Plos Collection Data Dump in sources #8

Comments

solstag commented Feb 3, 2021

amchagas commented Feb 18, 2021

solstag commented Feb 18, 2021

solstag commented Mar 3, 2021 • edited Loading

amchagas commented Mar 4, 2021

solstag commented Mar 4, 2021

amchagas commented Mar 4, 2021

amchagas commented Mar 12, 2021

solstag commented Mar 12, 2021 • edited Loading

amchagas commented Mar 15, 2021 • edited Loading

amchagas commented Mar 15, 2021

solstag commented Mar 26, 2021

amchagas commented Mar 26, 2021

solstag commented Mar 26, 2021 • edited Loading

amchagas commented Mar 26, 2021 • edited Loading

solstag commented Mar 26, 2021

solstag commented Mar 26, 2021

amchagas commented Mar 30, 2021

solstag commented Mar 31, 2021

amchagas commented Apr 1, 2021

solstag commented Apr 2, 2021 • edited Loading

solstag commented Mar 3, 2021 •

edited

Loading

solstag commented Mar 12, 2021 •

edited

Loading

amchagas commented Mar 15, 2021 •

edited

Loading

solstag commented Mar 26, 2021 •

edited

Loading

amchagas commented Mar 26, 2021 •

edited

Loading

solstag commented Apr 2, 2021 •

edited

Loading