Fix DOI URL parsing #11084

subhramit · 2024-03-23T20:46:36Z

This pull request closes #10648

The Issue

On attempting to use JabRef's "Import by ID" feature as described in the linked issue, while importing using DOI URLs or DOI numbers containing some special characters, such as https://doi.org/10.1175/1520-0493(2002)130%3C1913:EDAWPO%3E2.0.CO;2, the following error was thrown:

as if, the DOI link itself was an invalid one. However, the link was perfectly fine.

The bug

The bug was extremely subtle and tough to trace. On careful inspection it was detected that when the performSearchById(identifier) was being called somewhere along the flow, line 23 was causing the variable doi to be assigned to https://doi.org/10.1175/1520-0493(2002)130%3C1913:EDAWPO%3E2.0.CO. That is, somehow the trailing ";2" was being truncated from the identifier variable (which held the correct link) and assigned to doi. This resulted in an invalid DOI link, which was causing the HTTP status code 404 to be thrown, resulting in an internal FetcherClientException.

jabref/src/main/java/org/jabref/logic/importer/CompositeIdFetcher.java

Lines 22 to 23 in 8d08279

    
           public Optional<BibEntry> performSearchById(String identifier) throws FetcherException { 
        
               Optional<DOI> doi = DOI.findInText(identifier);

Then, on investigating the underlying findInText(identifier) function in the DOI class, the following snippet was analyzed:

jabref/src/main/java/org/jabref/model/entry/identifier/DOI.java

Lines 202 to 222 in 369b9a7

    
           public static Optional<DOI> findInText(String text) { 
        
               Optional<DOI> result = Optional.empty(); 
        
               Matcher matcher = FIND_DOI_PATT.matcher(text); 
        
               if (matcher.find()) { 
        
                   // match only group \1 
        
                   result = Optional.of(new DOI(matcher.group(1))); 
        
               } 
        
               matcher = FIND_SHORT_DOI_PATT.matcher(text); 
        
               if (matcher.find()) { 
        
                   result = Optional.of(new DOI(matcher.group(1))); 
        
               } 
        
               matcher = FIND_SHORT_DOI_SHORTCUT.matcher(text); 
        
               if (matcher.find()) { 
        
                   result = Optional.of(new DOI(matcher.group(0))); 
        
               } 
        
               return result; 
        
           }

It turns out that our input identifier = https://doi.org/10.1175/1520-0493(2002)130%3C1913:EDAWPO%3E2.0.CO;2 was being matched (partially) with FIND_DOI_PATT, which in turn contained a section that contained FIND_DOI_EXP, which had the following structure:

jabref/src/main/java/org/jabref/model/entry/identifier/DOI.java

Lines 43 to 51 in 369b9a7

    
           private static final String FIND_DOI_EXP = "" 
        
                   + "(?:urn:)?"                       // optional urn 
        
                   + "(?:doi:)?"                       // optional doi 
        
                   + "("                               // begin group \1 
        
                   + "10"                              // directory indicator 
        
                   + "(?:\\.[0-9]+)+"                  // registrant codes 
        
                   + "[/:]"                            // divider 
        
                   + "(?:[^\\s,;]+[^,;(\\.\\s)])"      // suffix alphanumeric without " "/","/";" and not ending on "."/","/";" 
        
                   + ")";                              // end group \1

In line 50, we can see that the pattern forbade ; to be in any part of the suffix, so everything before that in our example was being matched. Then, when the DOI(new DOI(matcher.group(1)); constructor (line 208 as seen in the DOI.java snippet linked above) was being called due to the match, it was identifying the partially matched part (https://doi.org/10.1175/1520-0493(2002)130%3C1913:EDAWPO%3E2.0.CO) to be a valid DOI URL, and result was being assigned that. Although a partial match in the second case was triggering the second if statement as well, the resultant matched part was https://doi.org/10.1175/1520-0493(2002)130, which was not being identified as a valid DOI URL by the DOI(matcher.group(1)) constructor call, and thus not overwriting result. Similarly third if statement was also immaterial here due to a mismatch or a partial match resulting in an invalid DOI URL.
So, the final value of result was https://doi.org/10.1175/1520-0493(2002)130%3C1913:EDAWPO%3E2.0.CO, which was an invalid DOI mistaken to be a valid one.

The fix

The fix was relatively simpler once the investigation was done. We just had to remove the ; from the forbidden characters in the expected regex structure of FIND_DOI_EXP, so that our link would fully match its structure and thus the variable result would contain the correct DOI URL, without the trailing ";2" being truncated. FIND_DOI_EXP had only one usage (as per IntelliJ), so it could be safely modified.
A test has also been added in DOITest.java.

Result

The article was successfully retrieved using Import by ID.

Mandatory checks

Change in CHANGELOG.md described in a way that is understandable for the average user (if applicable)
Tests created for changes (if applicable)
Manually tested changed features in running JabRef (always required)
Screenshots added in PR description (for UI changes)
Checked developer's documentation: Is the information available and up to date? If not, I outlined it in this pull request.
Checked documentation: Is the information available and up to date? If not, I created an issue at https://github.com/JabRef/user-documentation/issues or, even better, I submitted a pull request to the documentation repository.

…issue-10648

subhramit · 2024-03-23T21:00:28Z

Unit test failing (DOITest.java). Need help fixing the test.

subhramit · 2024-03-23T21:26:07Z

Unit test failing (DOITest.java). Need help fixing the test.

Okay, seems like the inclusion of semicolon is causing an extended pattern matching in:

jabref/src/test/java/org/jabref/model/entry/identifier/DOITest.java

Lines 202 to 203 in 8d08279

    
           Arguments.of("10.1007/s10549-018-4743-9", 
        
                   DOI.findInText("Breast Cancer Res Treat. 2018 July ; 170(1): 77–87. doi:10.1007/s10549-018-4743-9;something else").get().getDOI()),

Siedlerchr · 2024-03-23T21:42:07Z

That's a tough one and I am not sure how often that case with ;something happens e.g. text coming after the DOI
Some readings on Regex and DOI:
It's impossible to catch all edge cases
https://datacite.org/blog/cool-dois/
https://www.crossref.org/blog/dois-and-matching-regular-expressions/

Edit// This one is a hard one:
https://doi.org/10.1002/(sici)1099-1409(199908/10)3:6/7<672::aid-jpp192>3.0.co;2-8

subhramit · 2024-03-23T21:44:24Z

That's a tough one and I am not sure how often that case with ;something happens e.g. text coming after the DOI Some readings on Regex and DOI: It's impossible to catch all edge cases https://datacite.org/blog/cool-dois/ https://www.crossref.org/blog/dois-and-matching-regular-expressions/

I just added a space before "something" in my last commit. This will cause the last character to be ';', which will not be matched, resulting in the doi expected. Even if ;something was a valid doi, we can add it to the expected argument and the test will pass. Will take a read.

subhramit · 2024-03-23T21:47:01Z

Edit// This one is a hard one: https://doi.org/10.1002/(sici)1099-1409(199908/10)3:6/7<672::aid-jpp192>3.0.co;2-8

Tried this one in my updated build, works perfectly fine!

subhramit · 2024-03-24T07:14:36Z

Awaiting further comments.

Siedlerchr · 2024-03-24T11:49:49Z

I'm okay with this we cannot capture all edge cases

koppor · 2024-03-24T22:39:51Z

src/test/java/org/jabref/model/entry/identifier/DOITest.java

@@ -316,4 +316,10 @@ public void rejectMissingDividerInShortDoi() {
    public void rejectNullDoiParameter() {
        assertThrows(NullPointerException.class, () -> new DOI(null));
    }
+
+    @Test
+    public void findDoiWithSpecialCharactersInText() {


For the future: Integrate this in the exiting @ParamterizedTest. The method name is then the comment above the pair of Arguments.of. See Line 223.

Had forgotten about this. Done in #11603.

subhramit added 4 commits March 23, 2024 19:06

Update changelog

d63426d

Fix regex pattern match, patch success

1681f27

Merge branch 'main' of https://github.com/JabRef/jabref into fix-for-…

465db58

…issue-10648

Add test

201c054

subhramit and others added 2 commits March 24, 2024 02:43

Fix test

1a7791b

Merge branch 'JabRef:main' into fix-for-issue-10648

62c1029

Siedlerchr changed the title ~~Fix for issue 10648~~ Fix DOI Url parsing Mar 23, 2024

Fix test, as forbidden ';' was not terminating a doi case

19848d9

Siedlerchr approved these changes Mar 24, 2024

View reviewed changes

Siedlerchr added this pull request to the merge queue Mar 24, 2024

Merged via the queue into JabRef:main with commit 4432dcf Mar 24, 2024
20 checks passed

koppor reviewed Mar 24, 2024

View reviewed changes

subhramit deleted the fix-for-issue-10648 branch March 25, 2024 23:23

subhramit mentioned this pull request Aug 10, 2024

Integrate DOI special character parsing test in existing parameterized test #11603

Merged

6 tasks

subhramit changed the title ~~Fix DOI Url parsing~~ Fix DOI URL parsing Aug 21, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix DOI URL parsing #11084

Fix DOI URL parsing #11084

subhramit commented Mar 23, 2024 •

edited

Loading

subhramit commented Mar 23, 2024

subhramit commented Mar 23, 2024 •

edited

Loading

Siedlerchr commented Mar 23, 2024 •

edited

Loading

subhramit commented Mar 23, 2024 •

edited

Loading

subhramit commented Mar 23, 2024 •

edited

Loading

subhramit commented Mar 24, 2024

Siedlerchr commented Mar 24, 2024

koppor Mar 24, 2024

subhramit Aug 10, 2024

	public Optional<BibEntry> performSearchById(String identifier) throws FetcherException {
	Optional<DOI> doi = DOI.findInText(identifier);

	public static Optional<DOI> findInText(String text) {
	Optional<DOI> result = Optional.empty();

	Matcher matcher = FIND_DOI_PATT.matcher(text);
	if (matcher.find()) {
	// match only group \1
	result = Optional.of(new DOI(matcher.group(1)));
	}

	matcher = FIND_SHORT_DOI_PATT.matcher(text);
	if (matcher.find()) {
	result = Optional.of(new DOI(matcher.group(1)));
	}

	matcher = FIND_SHORT_DOI_SHORTCUT.matcher(text);
	if (matcher.find()) {
	result = Optional.of(new DOI(matcher.group(0)));
	}

	return result;
	}

	private static final String FIND_DOI_EXP = ""
	+ "(?:urn:)?" // optional urn
	+ "(?:doi:)?" // optional doi
	+ "(" // begin group \1
	+ "10" // directory indicator
	+ "(?:\\.[0-9]+)+" // registrant codes
	+ "[/:]" // divider
	+ "(?:[^\\s,;]+[^,;(\\.\\s)])" // suffix alphanumeric without " "/","/";" and not ending on "."/","/";"
	+ ")"; // end group \1

Fix DOI URL parsing #11084

Fix DOI URL parsing #11084

Conversation

subhramit commented Mar 23, 2024 • edited Loading

The Issue

The bug

The fix

Result

Mandatory checks

subhramit commented Mar 23, 2024

subhramit commented Mar 23, 2024 • edited Loading

Siedlerchr commented Mar 23, 2024 • edited Loading

subhramit commented Mar 23, 2024 • edited Loading

subhramit commented Mar 23, 2024 • edited Loading

subhramit commented Mar 24, 2024

Siedlerchr commented Mar 24, 2024

koppor Mar 24, 2024

Choose a reason for hiding this comment

subhramit Aug 10, 2024

Choose a reason for hiding this comment

subhramit commented Mar 23, 2024 •

edited

Loading

subhramit commented Mar 23, 2024 •

edited

Loading

Siedlerchr commented Mar 23, 2024 •

edited

Loading

subhramit commented Mar 23, 2024 •

edited

Loading

subhramit commented Mar 23, 2024 •

edited

Loading