-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Getting PDF from identifiers #560
Comments
To solve the issue, what about:
|
Hi Patrice,
Sorry, I didn't understand this question at first. I have all the PDFs
that the annotators used, I just haven't made that repository public.
Sorry for extra work here (although it is certainly important for when we
release the dataset, I don't know if we can release the PDFs with it). I
have added you to that repo, I hope it is what was needed.
…--J
On Fri, Oct 19, 2018 at 4:03 PM Patrice Lopez ***@***.***> wrote:
To solve the issue, what about:
-
keeping track of the original url of the PDF in the dataset
-
preserved a version on a AWS S3 space?
-
ensure the Open Access status of the annotated documents based on
Unpaywall as minimal requirement
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#560 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAFnUq47B0iuOIbUgp52M8a43FHPM-dVks5umj43gaJpZM4XxaRj>
.
--
James Howison
Associate Professor and Director of Doctoral Studies
School of Information
University of Texas at Austin
http://james.howison.name
|
Thanks a lot! Having the original PDF will save me time for sure. We can release the PDF with the dataset if the publications are CC-0 or CC-BY, so in general the green Open Access versions. There are different cases to distinguish, but if the goal is to release a dataset that can be reused in a stable manner over time and which is open, the corresponding PDF have to be well identified, accessible and legally re-usable. The main issue is, if we have copyrighted PDF, we cannot release them with the dataset, but we also cannot use them for training and the annotations are not exploitable which is a bit a pity. That's why I raise these issues, and probably the simplest solution would be to restrict the set of PDF to green open access publications having a stable preserved version on a main preprint archive. |
That makes sense to me. I thought we'd talked about that with Jason and
Heather and that the articles from unpaywall were all green open access?
Is there a straightforward way to find out? I can definitely avoid coding
any more that aren't green open access.
…--J
On Fri, Oct 19, 2018 at 4:28 PM Patrice Lopez ***@***.***> wrote:
Thanks a lot! Having the original PDF will save me time for sure.
We can release the PDF with the dataset if the publications are CC-0 or
CC-BY, so in general the green Open Access versions.
There are different cases to distinguish, but if the goal is to release a
dataset that can be reused in a stable manner over time and which is open,
the corresponding PDF have to be well identified, accessible and legally
re-usable.
The main issue is, if we have copyrighted PDF, we cannot release them with
the dataset, but we also cannot use them for training and the annotations
are not exploitable which is a bit a pity.
That's why I raise these issues, and probably the simplest solution would
be to restrict the set of PDF to green open access publications having a
stable preserved version on a main preprint archive.
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#560 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAFnUrka0iA58biphkPG_fD2NjsBotvaks5umkQFgaJpZM4XxaRj>
.
|
I made a new check and here is the current list of DOI which are not Open Access according to Unpaywall (I am using their web service): No Open Access PDF found via Unpaywall for DOI: 10.1080/17421772.2011.647058 |
@jasonpriem could you take a look here? Seems that some of the DOIs that came from the lists you pulled from unpaywall aren't actually Open Access? I am about to swap over to astro articles and it would be good to avoid similar issues there? @kermitt2 Could you use the same approach to check the astro articles here: https://github.com/howisonlab/softcite-pdf-files/blob/master/docs/pdf-files/astronomy_pdf_files/journal_articles_astronomy_random_5000_dois_with_pdf_links.csv |
Sorry for taking so long to analyse this list of papers! It was a bit more complicated than I thought, here are the results:
You'll find attached here these 1778 sucessful DOI with their Open Access link. |
For exploiting the annotations, we need to be able to get the same version of the PDF which has been used by the annotators.
If I am not wrong, the only information for doing this right now is the identifier provided with the article attributes. For PMC, there is no problem because we can find unambiguously the corresponding PDF URL and everything is well archived/preserved at NIH for a couple of millennia.
With a DOI, we have several issues:
For example for DOI: 10.1007/s00148-011-0355-y
Unpaywall will give the Sppringer Open Access version: https://link.springer.com/content/pdf/10.1007%2Fs00148-011-0355-y.pdf
While the preprint versions on the OA repositories (for instance version linked via https://econpapers.repec.org/paper/nbrnberwo/14900.htm) is a different version:
https://www.nber.org/papers/w14900.pdf
Open Access PDF identified by Unpaywall for a DOI is not always reliable over time. For instance, this DOI (10.1007/bf00163432) is associated to an open access PDF via Unpaywall, but the URL lead now to a paid version:
https://link.springer.com/content/pdf/10.1007%2FBF00163432.pdf
some DOI are not associated with an Open Access PDF by Unpaywall... in this case, we cannot access it automatically and the PDF might be copyrighted so we cannot exploit the annotations (ML model will be a derived product under copyright too)
Example: 10.1257/089533002320951064
https://api.unpaywall.org/v2/10.1257/[email protected]
-> no OA PDF
This is significant, 21 DOI currently are not open access according to Unpaywall.
There are also non-DOI identifiers:
a2001-35-NAT_BIOTECHNOL, a2010-05-BMC_MOL_BIOL, a2010-05-BMC_MOL_BIOL, a2010-05-BMC_MOL_BIOL
For these ones, there is no standard and stable automatic way for downloading them.
The text was updated successfully, but these errors were encountered: