Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate Plos Collection Data Dump in sources #8

Open
solstag opened this issue Feb 3, 2021 · 20 comments
Open

Integrate Plos Collection Data Dump in sources #8

solstag opened this issue Feb 3, 2021 · 20 comments
Assignees

Comments

@solstag
Copy link
Collaborator

solstag commented Feb 3, 2021

Two perspectives:

  1. ensure we'got all articles in this collection covered
  2. together with the other sources, produce a larger collection for https://github.com/amchagas/open-source-toolkit

In doing this I may be tempted to refactor data loading a little bit.

@amchagas
Copy link
Owner

I worked on the original file that lives in https://github.com/amchagas/open-source-toolkit. Namely classified all entries as hardware, software or both... Now it should be easy to filter out only the hardware ones.

Sorry I am only seeing that you working on the data now.. would you mind giving a little bit more detail on what you have done so far?

@solstag
Copy link
Collaborator Author

solstag commented Feb 18, 2021

ACK, since it's such a small file, I'll update the code to download it from that repository instead of keeping a copy. We'll talk details on Monday (:

@solstag
Copy link
Collaborator Author

solstag commented Mar 3, 2021

Ni! André (@amchagas), when you find the time, can you work a bit more on the plos-items.csv? Specifically:

  • Most addresses in the DOI field are already DOIs or very close to a DOI (like plos URLs that contain the DOI), only 46 are not. Can you check if those 46 have DOIs on the page in the URLs, and in that case replace them by the DOIs?

  • As you do that, can you take note of whether our search query would match those articles (title+abstract+keywords)?

  • Also, here's a sample of 10 pone dois from the collection, can you also check if they'd get picked by our search query?

index  doi
350    10.1371/journal.pone.0187219
405    10.1371/journal.pone.0059840
407    10.1371/journal.pone.0030837
393    10.1371/journal.pone.0118545
310    10.1371/journal.pone.0206678
388    10.1371/journal.pone.0143547
295    10.1371/journal.pone.0220751
398    10.1371/journal.pone.0107216
281    10.1371/journal.pone.0226761
338    10.1371/journal.pone.0193744

PS: I think I've explained the situation in DM, but if you wanna see what I'm seeing check out b2cab27
.~´

@amchagas
Copy link
Owner

amchagas commented Mar 4, 2021

Hi @solstag !

just worked a bit more on the csv file, it is uploaded to the repo.

  • Most addresses in the DOI field are already DOIs or very close to a DOI (like plos URLs that contain the DOI), only 46 are not. Can you check if those 46 have DOIs on the page in the URLs, and in that case replace them by the DOIs?

to keep track of which entries were missing DOIs, I added a new column "missing DOIs" and added the missing information there. I managed to get 32 DOIs, but in the process of doing so, noticed that there was some missclassified entries. For instance, some things that should have been classified as "web articles" were classified as "research articles" (the page showing the GOSH manifesto was one of them). counted 14 of these missclassifications (but note that in this number are also things that were the other way around > research articles classified as web articles). From what I remember, this classification came from PLoS people, so I am not sure if there are more of these cases in there.

For the cases that have a new DOI, I found that some of them do not get found by our query, some are found and some are not found because of hyphen.. So the papers spell "open-hardware" and not "open hardware" or "open-source hardware" etc. Maybe we would gain from adding these instances to the query.

The same is true for the DOIs from PLoS one you listed above. (also these DOIs, in their current format, do not lead to the article webpage when I past them to the address bar on my navigator).

@solstag
Copy link
Collaborator Author

solstag commented Mar 4, 2021

Ni! Cool, excellent, thanks!

Hm, I would be surprised if Scopus or WOS would not find "open-hardware" from "open hardware", I was assuming they treat hyphens as spaces. Did you confirm that?

@amchagas
Copy link
Owner

amchagas commented Mar 4, 2021

ok, so you are correct, both databases manage differences between "open hardware" and "open-hardware" and so does Scielo

@amchagas
Copy link
Owner

I forgot to follow up on my comment:
If hyphenation is not the issue for some of them getting found and some not, then what could be the issue? I could not think of anything obvious...

(Could be that the ones that are not found are not indexed in those databases somehow?)

I guess the question is, do we want to do a deep dive in this issue, or do we acknowledge it exists and move on?

@solstag
Copy link
Collaborator Author

solstag commented Mar 12, 2021

The issue is that these open hardware papers may not use any terms to designate open hardware in their title+abstracts+keywords, or we might not have the good terms.

A ideia seria para os 10 fazer uma tabela com:

  1. quais dos nossos termos tão presentes
  2. se ele seria achado pela nossa query ou não
  3. quais outros termos usa para referir que trata de OH

@amchagas
Copy link
Owner

amchagas commented Mar 15, 2021

some results are in :P
from the ten papers I found some keywords that we might want to test:
3 papers use Open source design ~100 hits at WOS
1 paper uses open source method 50 hits at WOS
1 paper uses open source tool 1201 hits at WOS (and quite some software papers)
1 paper uses open source electronics 62 hits at WOS
1 paper uses inexpensive hardware (using this could lead to a lot of papers that describe affordable solutions but not necessarily open source)

Other than that, one of the papers does not actually shares the needed data for replication and is more focused on a biological question, rather than the description of a tool method.

Other two papers mention likely keywords only in the introduction, they are open hardware design and open source hardware

One paper has no mentions to open source whatsoever, even though code and design files are nicely placed in GH.

All of these do not have our terms in the title and/or abstract

I have saved a table with this info under the /data folder


@amchagas
Copy link
Owner

A quick look at WOS shows that "open source design" puts out about ~100 entries. Some are not hardware, but there are also some hardware articles we did not find before

a possible keyword combination is open source 3D printed which outputs 20 articles in WOS and all of them are about hardware.

@solstag
Copy link
Collaborator Author

solstag commented Mar 26, 2021

Ni! Ok, I've checked the table. It answers question 3 (quais outros termos usa para referir que trata de OH), but it doesn't answer questions 1 and 2 (quais dos nossos termos tão presentes / se ele seria achado pela nossa query ou não). So we still don't know whether our current search would catch those. Or am I missing something? Cheeeers

@amchagas
Copy link
Owner

you are not missing something. I missed to write down specifically:
On the ods file there is an "observations" column where I made comments to some of the papers, things like "our keywords are only present in the introduction" - so not in abstract, title or keywords. The rows where no observations are made, our keywords were not found on the paper. This is because the plos collection was made by people either submitting things to it (so the authors knew their paper matched the collection requirements, even though they did not use the keywords we are now looking for), or by us finding papers "by chance"...

@solstag
Copy link
Collaborator Author

solstag commented Mar 26, 2021

So, basically, you're saying that none of those 10 papers would show up in our current search? Well that's pretty bad.

EDIT: sorry, paying more attention to the keywords (and not only the observations) I guess two of them would show up because they contain "open hardware". It's still bad though. The upside is that by adding "open source design" we'd cover half of them.

@amchagas
Copy link
Owner

amchagas commented Mar 26, 2021

I think it would be useful to check again how many papers are in the plos collection that are not in our query? I mean are these 10 selected papers representing a big number of papers that are not caught by our query? Or was it more a fluke ?
In any case, we can make a note of that on the writing, and try to assess the size of the issue?
We are going to have to go through more entries manually anyway I think to check a statistical significant subsample of the papers, so we might learn more when we do that...

@solstag
Copy link
Collaborator Author

solstag commented Mar 26, 2021

This was a random sample of "papers from the plos collection having plos DOIs" (:X). So it should be somewhat representative of the plos papers. I'm going to directly check our current data for all the DOIs in X, and see what we find in those papers absent from it. But it's clear we are leaving a lot of - and still possibly most! - papers out.

@solstag
Copy link
Collaborator Author

solstag commented Mar 26, 2021

Ni! Done:

  • recap:
    • there are 706 articles in the bibliographic database query data (BIBLIO).
    • there are 623 articles in BIBLIO that contain a DOI.
    • there are 172 articles in the PLOS Collection (PLOSC).
  • new:
    • there are 126 articles in PLOSC that contain a DOI (either in DOI format or inside a URL, which I then extract).
    • only 19 articles are found both in BIBLIO and PLOSC.
    • if we compare titles, we get 36 titles in both BIBLIO and PLOSC
    • but consistent with the DOI matching, only 20 titles match if we consider titles that have DOIs
      • (there are 20 instead of the 19 for DOI matches because there's a matching title in PLOSC with a bioRxiv DOI).

The conclusion is that these numbers makes it hard to argue that BIBLIO is representative.

We can try to add "design" to the query and see how this improves, given that in our sample of 10 so many used that. I'm committing that change to the query generator in project_definitions.py. Is it too much work to regenerate the RIS files?

But I'm not very confident that it will improve the situation enough. I'd hope to find at least half of the PLOSC stuff in BIBLIO. I'm trying to think of what else can we do. Maybe we could add "open source method" and "open source electronics" like you suggested. Or include "open source tools" but conditioned on record also mentioning "hardware" or "electronics".

Cheers!

@amchagas
Copy link
Owner

Just did a new search using the new definitions for Scoupus, Wos and Scielo..
For Scopus, exported only "articles", avoiding "proceedings", "book chapter" etc. since the number of entries was quite high and exporting limited to 2000 entries (could export every type, but would have to filter, select etc).

@solstag
Copy link
Collaborator Author

solstag commented Mar 31, 2021

Ni! There's something wrong with the new files. Files "scielo.ciw" and "wos1-500.ciw" are the same. Can you check?

@amchagas
Copy link
Owner

amchagas commented Apr 1, 2021

Ups! should be fixed now...

@solstag
Copy link
Collaborator Author

solstag commented Apr 2, 2021

Ni!

  1. Ok, big mistake, turns out "open design" brings in a lot of unrelated medical literature, so we have to exclude it and only keep only the other combinatinos with "design" : P

  2. I've checked the 10 sampled articles and found something strange: with the new search, we get articles 2, 3 and 6 (your table index). Those are the articles you tagged with "open source design". But we do not get the articles 7 and 10, respectively with "open hardware" and "open source hardware". That's weird!

So I checked with the earlier data and it seems we weren't getting 7 and 10 before either. Then I went looking into the contents of the articles, and it turns out you marked them using the full-text. The problem is that Scopus and Wos, and probably Scielo as well, only let us search on the title+abstract+keywords.

  1. The only way to large scale search full text is through CORE, but it's restricted to open access sources, mostly from institutional repositories: https://core.ac.uk/

That could be a different approach, less usual, but perhaps easier. We can say that we limit ourselves to open access publications available through CORE because searching the full text is more reliable, even if CORE coverage is messy to define. We'll miss both paywalled and OA content not in CORE. I checked and Sensors is consistently there because it's in PMC, PLOS seems to depend on institutional repositories but does get indexed full text; HardwareX papers can be found but it's worse as it denies full text indexing.

  1. With some recent improvements in the code and the current search we got from 19 to 28 abstracts in both BIBLIO and PLOSC. But that's still pretty low. I'm thinking we might want some kind of AI to learn from the abstract if the paper is introducing an open source hardware. I'm thinking that we should keep improving our search until we get to about a third of the papers in PLOSC, at which point we may publish something. And later we can use the data from that to train a machine learning model.

So, do you think you can play some more with the searches, maybe we want to include some more search statements like ("open ource" AND "electronics"), which will catch papers containing both terms even if they're not contiguous, or even "DIY".

  1. In any case, since I've spent some hours improving the ETL, I'm going to ask you to replay the searches again with the current new definition which excludes "open design" (see bbf3cd7), to see what we get after processing. Ok? Also, if you try some other stuff like I suggested in the previous paragraph, and you figure it's good, you can add that too. And yes, do export only "articles".

Abraço,

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants