Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Getting PDF from identifiers #560

Open
kermitt2 opened this issue Oct 19, 2018 · 7 comments
Open

Getting PDF from identifiers #560

kermitt2 opened this issue Oct 19, 2018 · 7 comments

Comments

@kermitt2
Copy link
Member

For exploiting the annotations, we need to be able to get the same version of the PDF which has been used by the annotators.

If I am not wrong, the only information for doing this right now is the identifier provided with the article attributes. For PMC, there is no problem because we can find unambiguously the corresponding PDF URL and everything is well archived/preserved at NIH for a couple of millennia.

With a DOI, we have several issues:

  • using Unpaywall, the Open Access PDF url that we can get can, lead to a PDF but there is no guarantee that this PDF is the same as the one used by the annotator, and no guarantee that the same version will be accessible in the future.

For example for DOI: 10.1007/s00148-011-0355-y
Unpaywall will give the Sppringer Open Access version: https://link.springer.com/content/pdf/10.1007%2Fs00148-011-0355-y.pdf
While the preprint versions on the OA repositories (for instance version linked via https://econpapers.repec.org/paper/nbrnberwo/14900.htm) is a different version:
https://www.nber.org/papers/w14900.pdf

  • Open Access PDF identified by Unpaywall for a DOI is not always reliable over time. For instance, this DOI (10.1007/bf00163432) is associated to an open access PDF via Unpaywall, but the URL lead now to a paid version:
    https://link.springer.com/content/pdf/10.1007%2FBF00163432.pdf

  • some DOI are not associated with an Open Access PDF by Unpaywall... in this case, we cannot access it automatically and the PDF might be copyrighted so we cannot exploit the annotations (ML model will be a derived product under copyright too)

Example: 10.1257/089533002320951064
https://api.unpaywall.org/v2/10.1257/[email protected]
-> no OA PDF

This is significant, 21 DOI currently are not open access according to Unpaywall.

There are also non-DOI identifiers:
a2001-35-NAT_BIOTECHNOL, a2010-05-BMC_MOL_BIOL, a2010-05-BMC_MOL_BIOL, a2010-05-BMC_MOL_BIOL
For these ones, there is no standard and stable automatic way for downloading them.

@kermitt2
Copy link
Member Author

To solve the issue, what about:

  • keeping track of the original url of the PDF in the dataset

  • preserved a version on a AWS S3 space?

  • ensure the Open Access status of the annotated documents based on Unpaywall as minimal requirement

@jameshowison
Copy link
Contributor

jameshowison commented Oct 19, 2018 via email

@kermitt2
Copy link
Member Author

Thanks a lot! Having the original PDF will save me time for sure.

We can release the PDF with the dataset if the publications are CC-0 or CC-BY, so in general the green Open Access versions.

There are different cases to distinguish, but if the goal is to release a dataset that can be reused in a stable manner over time and which is open, the corresponding PDF have to be well identified, accessible and legally re-usable.

The main issue is, if we have copyrighted PDF, we cannot release them with the dataset, but we also cannot use them for training and the annotations are not exploitable which is a bit a pity.

That's why I raise these issues, and probably the simplest solution would be to restrict the set of PDF to green open access publications having a stable preserved version on a main preprint archive.

@jameshowison
Copy link
Contributor

jameshowison commented Oct 19, 2018 via email

@kermitt2
Copy link
Member Author

I made a new check and here is the current list of DOI which are not Open Access according to Unpaywall (I am using their web service):

No Open Access PDF found via Unpaywall for DOI: 10.1080/17421772.2011.647058
No Open Access PDF found via Unpaywall for DOI: 10.1002/ijfe.1565
No Open Access PDF found via Unpaywall for DOI: 10.1080/00036846.2016.1218430
No Open Access PDF found via Unpaywall for DOI: 10.1111/jors.12246
No Open Access PDF found via Unpaywall for DOI: 10.1111/j.1467-9957.2008.01084.x
No Open Access PDF found via Unpaywall for DOI: 10.1257/jep.14.1.109
No Open Access PDF found via Unpaywall for DOI: 10.1111/j.1467-9701.2007.01019.x
No Open Access PDF found via Unpaywall for DOI: 10.1002/soej.12180
No Open Access PDF found via Unpaywall for DOI: 10.1257/jep.25.3.83
No Open Access PDF found via Unpaywall for DOI: 10.1111/cwe.12158
No Open Access PDF found via Unpaywall for DOI: 10.1111/j.1468-0297.2008.02177.x
No Open Access PDF found via Unpaywall for DOI: 10.3846/1611-1699.2009.10.279-289
No Open Access PDF found via Unpaywall for DOI: 10.1257/jep.8.2.117
No Open Access PDF found via Unpaywall for DOI: 10.1257/jep.13.1.181
No Open Access PDF found via Unpaywall for DOI: 10.1002/pam.21962
No Open Access PDF found via Unpaywall for DOI: 10.1111/1468-0106.12204
No Open Access PDF found via Unpaywall for DOI: 10.3846/jbem.2010.13
No Open Access PDF found via Unpaywall for DOI: 10.1108/jes-04-2014-0055
No Open Access PDF found via Unpaywall for DOI: 10.3846/jbem.2010.30
No Open Access PDF found via Unpaywall for DOI: 10.3846/jbem.2010.20
No Open Access PDF found via Unpaywall for DOI: 10.1257/jep.30.2.201
No Open Access PDF found via Unpaywall for DOI: 10.1111/j.1468-0351.2009.00342.x
No Open Access PDF found via Unpaywall for DOI: 10.1257/jep.31.2.211
No Open Access PDF found via Unpaywall for DOI: 10.1108/jes-01-2015-0013
No Open Access PDF found via Unpaywall for DOI: 10.1257/jep.30.1.77
No Open Access PDF found via Unpaywall for DOI: 10.1257/jep.2.1.153
No Open Access PDF found via Unpaywall for DOI: 10.1257/089533002320951064

@jameshowison
Copy link
Contributor

@jasonpriem could you take a look here? Seems that some of the DOIs that came from the lists you pulled from unpaywall aren't actually Open Access? I am about to swap over to astro articles and it would be good to avoid similar issues there?

@kermitt2 Could you use the same approach to check the astro articles here: https://github.com/howisonlab/softcite-pdf-files/blob/master/docs/pdf-files/astronomy_pdf_files/journal_articles_astronomy_random_5000_dois_with_pdf_links.csv

@kermitt2
Copy link
Member Author

kermitt2 commented Dec 9, 2018

Sorry for taking so long to analyse this list of papers! It was a bit more complicated than I thought, here are the results:

  • in this list 100% of these DOI are considered OA by the latest Unpaywall data dump (snapshot from last September)

  • however I failed to download the Open Access resource for 986 out of 5000 entries with my dedicated harvester (https://github.com/kermitt2/biblio-glutton-harvester which supports quite well redirection, multiple retry, etc.), this is high as compared to my usual failure rate for unpaywall (rather around 4%)

  • out of the 4014 sucessful DOI, only 1778 are actual correct PDF, the rest are abstracts or full texts in html. Apparently, there is an issue with the PDF link via the ADS server, the "url_for_pdf" field actually point to the ADS landing page. So it's a problem specific to Astronomy.

  • in these 1778 DOI, there are still quite a few documents that are just abstract or very short communication (less than one page), but I don't really have reliable way to detect them...

You'll find attached here these 1778 sucessful DOI with their Open Access link.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants