Cleanup entry "Move DOIs from note and URL field to DOI field and remove http prefix" incorrectly recognizies urls ending with "2010/stuff" as DOIs #6880

JasonGross · 2020-09-06T22:07:17Z

JabRef version 5.2--2020-09-06--c0b139a on Windows 10 10.0 amd64, Java 14.0.2

Mandatory: I have tested the latest development version from http://builds.jabref.org/master/ and the problem persists

Steps to reproduce the behavior:

Save the file

@Misc{TrustedSlind,
  author   = {Konrad Slind},
  title    = {Trusted Extensions of Interactive Theorem Provers: Workshop Summary},
  date     = {2010-08},
  location = {Cambridge, England},
  url      = {http://www.cs.utexas.edu/users/kaufmann/itp-trusted-extensions-aug-2010/summary/summary.pdf},
}

as a .bib file.

Open this file in JabRef
Click on the one entry to select it
Click Quality -> Cleanup entries / Alt+F8
Ensure that only the first item ("Move DOIs from note and URL field to DOI field and remove http prefix") is checked
Click OK
Double-click on the entry and click "BibTeX source"

Note that the new source is

@Misc{TrustedSlind,
  author   = {Konrad Slind},
  title    = {Trusted Extensions of Interactive Theorem Provers: Workshop Summary},
  date     = {2010-08},
  doi      = {10/summary},
  location = {Cambridge, England},
}

This url is not a DOI link, though! Presumably this is because the matcher code at

jabref/src/main/java/org/jabref/model/entry/identifier/DOI.java

Lines 30 to 77 in ba68c09

    
           // Regex 
        
           // (see http://www.doi.org/doi_handbook/2_Numbering.html) 
        
           private static final String DOI_EXP = "" 
        
                   + "(?:urn:)?"                       // optional urn 
        
                   + "(?:doi:)?"                       // optional doi 
        
                   + "("                               // begin group \1 
        
                   + "10"                              // directory indicator 
        
                   + "(?:\\.[0-9]+)+"                  // registrant codes 
        
                   + "[/:%]" // divider 
        
                   + "(?:.+)"                          // suffix alphanumeric string 
        
                   + ")";                              // end group \1 
        
           private static final String FIND_DOI_EXP = "" 
        
                   + "(?:urn:)?"                       // optional urn 
        
                   + "(?:doi:)?"                       // optional doi 
        
                   + "("                               // begin group \1 
        
                   + "10"                              // directory indicator 
        
                   + "(?:\\.[0-9]+)+"                  // registrant codes 
        
                   + "[/:]"                            // divider 
        
                   + "(?:[^\\s]+)"                     // suffix alphanumeric without space 
        
                   + ")";                              // end group \1 
        
           // Regex (Short DOI) 
        
           private static final String SHORT_DOI_EXP = "" 
        
                   + "(?:urn:)?"                       // optional urn 
        
                   + "(?:doi:)?"                       // optional doi 
        
                   + "("                               // begin group \1 
        
                   + "10"                              // directory indicator 
        
                   + "[/:%]"                            // divider 
        
                   + "[a-zA-Z0-9]+" 
        
                   + ")";                              // end group \1 
        
           private static final String FIND_SHORT_DOI_EXP = "" 
        
                   + "(?:urn:)?"                       // optional urn 
        
                   + "(?:doi:)?"                       // optional doi 
        
                   + "("                               // begin group \1 
        
                   + "10"                              // directory indicator 
        
                   + "[/:]"                            // divider 
        
                   + "[a-zA-Z0-9]+" 
        
                   + "(?:[^\\s]+)"                     // suffix alphanumeric without space 
        
                   + ")";                              // end group \1 
        
           private static final String HTTP_EXP = "https?://[^\\s]+?" + DOI_EXP; 
        
           private static final String SHORT_DOI_HTTP_EXP = "https?://[^\\s]+?" + SHORT_DOI_EXP; 
        
           // Pattern 
        
           private static final Pattern EXACT_DOI_PATT = Pattern.compile("^(?:https?://[^\\s]+?)?" + DOI_EXP + "$", Pattern.CASE_INSENSITIVE); 
        
           private static final Pattern DOI_PATT = Pattern.compile("(?:https?://[^\\s]+?)?" + FIND_DOI_EXP, Pattern.CASE_INSENSITIVE); 
        
           // Pattern (short DOI) 
        
           private static final Pattern EXACT_SHORT_DOI_PATT = Pattern.compile("^(?:https?://[^\\s]+?)?" + SHORT_DOI_EXP, Pattern.CASE_INSENSITIVE); 
        
           private static final Pattern SHORT_DOI_PATT = Pattern.compile("(?:https?://[^\\s]+?)?" + FIND_SHORT_DOI_EXP, Pattern.CASE_INSENSITIVE);

considers all non-space text starting with http:// or https://, followed by 10/ followed by any non-space text, to be a DOI. This is absurd. The character immediately preceding the 10, doi:, or urn: should at the very least be required to be a url separator character such as /, :, ?, &, or =.

The text was updated successfully, but these errors were encountered:

PremKolar · 2020-09-07T11:50:02Z

Can I please do this??
Looks like fun! :)
PLease!

Siedlerchr · 2020-09-07T11:58:37Z

@PremKolar Sure, go ahead!

PremKolar · 2020-09-07T20:06:42Z

This is not as straight forward as I thought.
The problem is with the short dois. These can look like

10/1234
doi:10/1234
10/2021
10/d8dn
https://doi.org/d8dn
etc..

I don't think there is a way to safely detect these in a url or in some other field. My only idea was to not delete the entry in respective original field in the case of a found short doi, so as not to lose the information in case of ambiguity. But this would inevitably result in wrong data in the doi field sometimes, when the url field is eg https://www.abc.de/10/abcd or when the field Note reads eg 01/10/2012.

Anyone willing to share their thoughts?

http://shortdoi.org/

JasonGross · 2020-09-07T22:44:39Z

https://doi.org/d8dn

This one isn't matched because there's no 10, though, right?

I think that detecting what comes before the 10 and ensuring that it's a valid separator would already be a great improvement.

Another option is to query doi validity (I think there's already something like this in automatically searching for dois for an entry). If the matched doi isn't valid (I don't think 10/summary is, for example), then it shouldn't move it to the doi field.

PremKolar · 2020-09-08T07:58:43Z

https://doi.org/d8dn
This one isn't matched because there's no 10, though, right?

Exactly! that's the 2nd Problem.

Ok yes, validating the doi is of course the obvious solution to this problem.. thanks for the idea!
I have quite a busy week ahead, but I should have found some time by the end of the week! :)

Siedlerchr · 2020-09-08T08:01:55Z

Please keep in mind that the Cleanup actions can be executed for all entries in your library. So if you have thousands of entries you would generate 1000 requestss to the DOI resolver

PremKolar · 2020-09-08T08:10:23Z

right..
I will test scalability and limit the validations to ambiguous cases only!

Siedlerchr added type: enhancement good first issue An issue intended for project-newcomers. Varies in difficulty. labels Sep 7, 2020

PremKolar added a commit to PremKolar/jabref that referenced this issue Sep 17, 2020

fixed issue JabRef#6880: Improved detection of short DOIs.

68c4beb

PremKolar added a commit to PremKolar/jabref that referenced this issue Sep 17, 2020

updated changelog for issue JabRef#6880

404a41a

Siedlerchr mentioned this issue Sep 20, 2020

Improve parsing of short DOIs #6920

Merged

5 tasks

koppor closed this as completed in #6920 Sep 20, 2020

koppor moved this to Done in Features & Enhancements Nov 7, 2022

koppor added this to Features & Enhancements Nov 7, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cleanup entry "Move DOIs from note and URL field to DOI field and remove http prefix" incorrectly recognizies urls ending with "2010/stuff" as DOIs #6880

Cleanup entry "Move DOIs from note and URL field to DOI field and remove http prefix" incorrectly recognizies urls ending with "2010/stuff" as DOIs #6880

JasonGross commented Sep 6, 2020

PremKolar commented Sep 7, 2020

Siedlerchr commented Sep 7, 2020

PremKolar commented Sep 7, 2020

JasonGross commented Sep 7, 2020

PremKolar commented Sep 8, 2020

Siedlerchr commented Sep 8, 2020

PremKolar commented Sep 8, 2020

Cleanup entry "Move DOIs from note and URL field to DOI field and remove http prefix" incorrectly recognizies urls ending with "2010/stuff" as DOIs #6880

Cleanup entry "Move DOIs from note and URL field to DOI field and remove http prefix" incorrectly recognizies urls ending with "2010/stuff" as DOIs #6880

Comments

JasonGross commented Sep 6, 2020

PremKolar commented Sep 7, 2020

Siedlerchr commented Sep 7, 2020

PremKolar commented Sep 7, 2020

JasonGross commented Sep 7, 2020

PremKolar commented Sep 8, 2020

Siedlerchr commented Sep 8, 2020

PremKolar commented Sep 8, 2020