Getting PDF from identifiers #560

kermitt2 · 2018-10-19T20:56:21Z

For exploiting the annotations, we need to be able to get the same version of the PDF which has been used by the annotators.

If I am not wrong, the only information for doing this right now is the identifier provided with the article attributes. For PMC, there is no problem because we can find unambiguously the corresponding PDF URL and everything is well archived/preserved at NIH for a couple of millennia.

With a DOI, we have several issues:

using Unpaywall, the Open Access PDF url that we can get can, lead to a PDF but there is no guarantee that this PDF is the same as the one used by the annotator, and no guarantee that the same version will be accessible in the future.

For example for DOI: 10.1007/s00148-011-0355-y
Unpaywall will give the Sppringer Open Access version: https://link.springer.com/content/pdf/10.1007%2Fs00148-011-0355-y.pdf
While the preprint versions on the OA repositories (for instance version linked via https://econpapers.repec.org/paper/nbrnberwo/14900.htm) is a different version:
https://www.nber.org/papers/w14900.pdf

Open Access PDF identified by Unpaywall for a DOI is not always reliable over time. For instance, this DOI (10.1007/bf00163432) is associated to an open access PDF via Unpaywall, but the URL lead now to a paid version:
https://link.springer.com/content/pdf/10.1007%2FBF00163432.pdf
some DOI are not associated with an Open Access PDF by Unpaywall... in this case, we cannot access it automatically and the PDF might be copyrighted so we cannot exploit the annotations (ML model will be a derived product under copyright too)

Example: 10.1257/089533002320951064
https://api.unpaywall.org/v2/10.1257/[email protected]
-> no OA PDF

This is significant, 21 DOI currently are not open access according to Unpaywall.

There are also non-DOI identifiers:
a2001-35-NAT_BIOTECHNOL, a2010-05-BMC_MOL_BIOL, a2010-05-BMC_MOL_BIOL, a2010-05-BMC_MOL_BIOL
For these ones, there is no standard and stable automatic way for downloading them.

The text was updated successfully, but these errors were encountered:

kermitt2 · 2018-10-19T21:03:51Z

To solve the issue, what about:

keeping track of the original url of the PDF in the dataset
preserved a version on a AWS S3 space?
ensure the Open Access status of the annotated documents based on Unpaywall as minimal requirement

jameshowison · 2018-10-19T21:14:46Z

Hi Patrice, Sorry, I didn't understand this question at first. I have all the PDFs that the annotators used, I just haven't made that repository public. Sorry for extra work here (although it is certainly important for when we release the dataset, I don't know if we can release the PDFs with it). I have added you to that repo, I hope it is what was needed.

…

--J

On Fri, Oct 19, 2018 at 4:03 PM Patrice Lopez ***@***.***> wrote: To solve the issue, what about: - keeping track of the original url of the PDF in the dataset - preserved a version on a AWS S3 space? - ensure the Open Access status of the annotated documents based on Unpaywall as minimal requirement — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#560 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAFnUq47B0iuOIbUgp52M8a43FHPM-dVks5umj43gaJpZM4XxaRj> .

-- James Howison Associate Professor and Director of Doctoral Studies School of Information University of Texas at Austin http://james.howison.name

kermitt2 · 2018-10-19T21:28:37Z

Thanks a lot! Having the original PDF will save me time for sure.

We can release the PDF with the dataset if the publications are CC-0 or CC-BY, so in general the green Open Access versions.

There are different cases to distinguish, but if the goal is to release a dataset that can be reused in a stable manner over time and which is open, the corresponding PDF have to be well identified, accessible and legally re-usable.

The main issue is, if we have copyrighted PDF, we cannot release them with the dataset, but we also cannot use them for training and the annotations are not exploitable which is a bit a pity.

That's why I raise these issues, and probably the simplest solution would be to restrict the set of PDF to green open access publications having a stable preserved version on a main preprint archive.

jameshowison · 2018-10-19T21:31:30Z

That makes sense to me. I thought we'd talked about that with Jason and Heather and that the articles from unpaywall were all green open access? Is there a straightforward way to find out? I can definitely avoid coding any more that aren't green open access.

…

--J

On Fri, Oct 19, 2018 at 4:28 PM Patrice Lopez ***@***.***> wrote: Thanks a lot! Having the original PDF will save me time for sure. We can release the PDF with the dataset if the publications are CC-0 or CC-BY, so in general the green Open Access versions. There are different cases to distinguish, but if the goal is to release a dataset that can be reused in a stable manner over time and which is open, the corresponding PDF have to be well identified, accessible and legally re-usable. The main issue is, if we have copyrighted PDF, we cannot release them with the dataset, but we also cannot use them for training and the annotations are not exploitable which is a bit a pity. That's why I raise these issues, and probably the simplest solution would be to restrict the set of PDF to green open access publications having a stable preserved version on a main preprint archive. — You are receiving this because you commented. Reply to this email directly, view it on GitHub <#560 (comment)>, or mute the thread <https://github.com/notifications/unsubscribe-auth/AAFnUrka0iA58biphkPG_fD2NjsBotvaks5umkQFgaJpZM4XxaRj> .

kermitt2 · 2018-11-11T23:39:06Z

I made a new check and here is the current list of DOI which are not Open Access according to Unpaywall (I am using their web service):

No Open Access PDF found via Unpaywall for DOI: 10.1080/17421772.2011.647058
No Open Access PDF found via Unpaywall for DOI: 10.1002/ijfe.1565
No Open Access PDF found via Unpaywall for DOI: 10.1080/00036846.2016.1218430
No Open Access PDF found via Unpaywall for DOI: 10.1111/jors.12246
No Open Access PDF found via Unpaywall for DOI: 10.1111/j.1467-9957.2008.01084.x
No Open Access PDF found via Unpaywall for DOI: 10.1257/jep.14.1.109
No Open Access PDF found via Unpaywall for DOI: 10.1111/j.1467-9701.2007.01019.x
No Open Access PDF found via Unpaywall for DOI: 10.1002/soej.12180
No Open Access PDF found via Unpaywall for DOI: 10.1257/jep.25.3.83
No Open Access PDF found via Unpaywall for DOI: 10.1111/cwe.12158
No Open Access PDF found via Unpaywall for DOI: 10.1111/j.1468-0297.2008.02177.x
No Open Access PDF found via Unpaywall for DOI: 10.3846/1611-1699.2009.10.279-289
No Open Access PDF found via Unpaywall for DOI: 10.1257/jep.8.2.117
No Open Access PDF found via Unpaywall for DOI: 10.1257/jep.13.1.181
No Open Access PDF found via Unpaywall for DOI: 10.1002/pam.21962
No Open Access PDF found via Unpaywall for DOI: 10.1111/1468-0106.12204
No Open Access PDF found via Unpaywall for DOI: 10.3846/jbem.2010.13
No Open Access PDF found via Unpaywall for DOI: 10.1108/jes-04-2014-0055
No Open Access PDF found via Unpaywall for DOI: 10.3846/jbem.2010.30
No Open Access PDF found via Unpaywall for DOI: 10.3846/jbem.2010.20
No Open Access PDF found via Unpaywall for DOI: 10.1257/jep.30.2.201
No Open Access PDF found via Unpaywall for DOI: 10.1111/j.1468-0351.2009.00342.x
No Open Access PDF found via Unpaywall for DOI: 10.1257/jep.31.2.211
No Open Access PDF found via Unpaywall for DOI: 10.1108/jes-01-2015-0013
No Open Access PDF found via Unpaywall for DOI: 10.1257/jep.30.1.77
No Open Access PDF found via Unpaywall for DOI: 10.1257/jep.2.1.153
No Open Access PDF found via Unpaywall for DOI: 10.1257/089533002320951064

jameshowison · 2018-11-16T20:46:14Z

@jasonpriem could you take a look here? Seems that some of the DOIs that came from the lists you pulled from unpaywall aren't actually Open Access? I am about to swap over to astro articles and it would be good to avoid similar issues there?

@kermitt2 Could you use the same approach to check the astro articles here: https://github.com/howisonlab/softcite-pdf-files/blob/master/docs/pdf-files/astronomy_pdf_files/journal_articles_astronomy_random_5000_dois_with_pdf_links.csv

kermitt2 · 2018-12-09T01:56:15Z

Sorry for taking so long to analyse this list of papers! It was a bit more complicated than I thought, here are the results:

in this list 100% of these DOI are considered OA by the latest Unpaywall data dump (snapshot from last September)
however I failed to download the Open Access resource for 986 out of 5000 entries with my dedicated harvester (https://github.com/kermitt2/biblio-glutton-harvester which supports quite well redirection, multiple retry, etc.), this is high as compared to my usual failure rate for unpaywall (rather around 4%)
out of the 4014 sucessful DOI, only 1778 are actual correct PDF, the rest are abstracts or full texts in html. Apparently, there is an issue with the PDF link via the ADS server, the "url_for_pdf" field actually point to the ADS landing page. So it's a problem specific to Astronomy.
in these 1778 DOI, there are still quite a few documents that are just abstract or very short communication (less than one page), but I don't really have reliable way to detect them...

You'll find attached here these 1778 sucessful DOI with their Open Access link.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Getting PDF from identifiers #560

Getting PDF from identifiers #560

kermitt2 commented Oct 19, 2018

kermitt2 commented Oct 19, 2018

jameshowison commented Oct 19, 2018 via email

kermitt2 commented Oct 19, 2018

jameshowison commented Oct 19, 2018 via email

kermitt2 commented Nov 11, 2018

jameshowison commented Nov 16, 2018

kermitt2 commented Dec 9, 2018

Getting PDF from identifiers #560

Getting PDF from identifiers #560

Comments

kermitt2 commented Oct 19, 2018

kermitt2 commented Oct 19, 2018

jameshowison commented Oct 19, 2018 via email

kermitt2 commented Oct 19, 2018

jameshowison commented Oct 19, 2018 via email

kermitt2 commented Nov 11, 2018

jameshowison commented Nov 16, 2018

kermitt2 commented Dec 9, 2018