Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix DOI URL parsing #11084

Merged
merged 7 commits into from
Mar 24, 2024
Merged

Fix DOI URL parsing #11084

merged 7 commits into from
Mar 24, 2024

Conversation

subhramit
Copy link
Member

@subhramit subhramit commented Mar 23, 2024

This pull request closes #10648

The Issue

On attempting to use JabRef's "Import by ID" feature as described in the linked issue, while importing using DOI URLs or DOI numbers containing some special characters, such as https://doi.org/10.1175/1520-0493(2002)130%3C1913:EDAWPO%3E2.0.CO;2, the following error was thrown:

image
as if, the DOI link itself was an invalid one. However, the link was perfectly fine.

The bug

The bug was extremely subtle and tough to trace. On careful inspection it was detected that when the performSearchById(identifier) was being called somewhere along the flow, line 23 was causing the variable doi to be assigned to https://doi.org/10.1175/1520-0493(2002)130%3C1913:EDAWPO%3E2.0.CO. That is, somehow the trailing ";2" was being truncated from the identifier variable (which held the correct link) and assigned to doi. This resulted in an invalid DOI link, which was causing the HTTP status code 404 to be thrown, resulting in an internal FetcherClientException.

public Optional<BibEntry> performSearchById(String identifier) throws FetcherException {
Optional<DOI> doi = DOI.findInText(identifier);

Then, on investigating the underlying findInText(identifier) function in the DOI class, the following snippet was analyzed:
public static Optional<DOI> findInText(String text) {
Optional<DOI> result = Optional.empty();
Matcher matcher = FIND_DOI_PATT.matcher(text);
if (matcher.find()) {
// match only group \1
result = Optional.of(new DOI(matcher.group(1)));
}
matcher = FIND_SHORT_DOI_PATT.matcher(text);
if (matcher.find()) {
result = Optional.of(new DOI(matcher.group(1)));
}
matcher = FIND_SHORT_DOI_SHORTCUT.matcher(text);
if (matcher.find()) {
result = Optional.of(new DOI(matcher.group(0)));
}
return result;
}

It turns out that our input identifier = https://doi.org/10.1175/1520-0493(2002)130%3C1913:EDAWPO%3E2.0.CO;2 was being matched (partially) with FIND_DOI_PATT, which in turn contained a section that contained FIND_DOI_EXP, which had the following structure:
private static final String FIND_DOI_EXP = ""
+ "(?:urn:)?" // optional urn
+ "(?:doi:)?" // optional doi
+ "(" // begin group \1
+ "10" // directory indicator
+ "(?:\\.[0-9]+)+" // registrant codes
+ "[/:]" // divider
+ "(?:[^\\s,;]+[^,;(\\.\\s)])" // suffix alphanumeric without " "/","/";" and not ending on "."/","/";"
+ ")"; // end group \1

In line 50, we can see that the pattern forbade ; to be in any part of the suffix, so everything before that in our example was being matched. Then, when the DOI(new DOI(matcher.group(1)); constructor (line 208 as seen in the DOI.java snippet linked above) was being called due to the match, it was identifying the partially matched part (https://doi.org/10.1175/1520-0493(2002)130%3C1913:EDAWPO%3E2.0.CO) to be a valid DOI URL, and result was being assigned that. Although a partial match in the second case was triggering the second if statement as well, the resultant matched part was https://doi.org/10.1175/1520-0493(2002)130, which was not being identified as a valid DOI URL by the DOI(matcher.group(1)) constructor call, and thus not overwriting result. Similarly third if statement was also immaterial here due to a mismatch or a partial match resulting in an invalid DOI URL.
So, the final value of result was https://doi.org/10.1175/1520-0493(2002)130%3C1913:EDAWPO%3E2.0.CO, which was an invalid DOI mistaken to be a valid one.

The fix

The fix was relatively simpler once the investigation was done. We just had to remove the ; from the forbidden characters in the expected regex structure of FIND_DOI_EXP, so that our link would fully match its structure and thus the variable result would contain the correct DOI URL, without the trailing ";2" being truncated. FIND_DOI_EXP had only one usage (as per IntelliJ), so it could be safely modified.
A test has also been added in DOITest.java.

Result

image
The article was successfully retrieved using Import by ID.

Mandatory checks

  • Change in CHANGELOG.md described in a way that is understandable for the average user (if applicable)
  • Tests created for changes (if applicable)
  • Manually tested changed features in running JabRef (always required)
  • Screenshots added in PR description (for UI changes)
  • Checked developer's documentation: Is the information available and up to date? If not, I outlined it in this pull request.
  • Checked documentation: Is the information available and up to date? If not, I created an issue at https://github.com/JabRef/user-documentation/issues or, even better, I submitted a pull request to the documentation repository.

@subhramit
Copy link
Member Author

Unit test failing (DOITest.java). Need help fixing the test.

@subhramit
Copy link
Member Author

subhramit commented Mar 23, 2024

Unit test failing (DOITest.java). Need help fixing the test.

Okay, seems like the inclusion of semicolon is causing an extended pattern matching in:

Arguments.of("10.1007/s10549-018-4743-9",
DOI.findInText("Breast Cancer Res Treat. 2018 July ; 170(1): 77–87. doi:10.1007/s10549-018-4743-9;something else").get().getDOI()),

@Siedlerchr Siedlerchr changed the title Fix for issue 10648 Fix DOI Url parsing Mar 23, 2024
@Siedlerchr
Copy link
Member

Siedlerchr commented Mar 23, 2024

That's a tough one and I am not sure how often that case with ;something happens e.g. text coming after the DOI
Some readings on Regex and DOI:
It's impossible to catch all edge cases
https://datacite.org/blog/cool-dois/
https://www.crossref.org/blog/dois-and-matching-regular-expressions/

Edit// This one is a hard one:
https://doi.org/10.1002/(sici)1099-1409(199908/10)3:6/7<672::aid-jpp192>3.0.co;2-8

@subhramit
Copy link
Member Author

subhramit commented Mar 23, 2024

That's a tough one and I am not sure how often that case with ;something happens e.g. text coming after the DOI Some readings on Regex and DOI: It's impossible to catch all edge cases https://datacite.org/blog/cool-dois/ https://www.crossref.org/blog/dois-and-matching-regular-expressions/

I just added a space before "something" in my last commit. This will cause the last character to be ';', which will not be matched, resulting in the doi expected. Even if ;something was a valid doi, we can add it to the expected argument and the test will pass. Will take a read.

@subhramit
Copy link
Member Author

subhramit commented Mar 23, 2024

Edit// This one is a hard one: https://doi.org/10.1002/(sici)1099-1409(199908/10)3:6/7<672::aid-jpp192>3.0.co;2-8

Tried this one in my updated build, works perfectly fine!

image

@subhramit
Copy link
Member Author

Awaiting further comments.

@Siedlerchr
Copy link
Member

I'm okay with this we cannot capture all edge cases

@Siedlerchr Siedlerchr added this pull request to the merge queue Mar 24, 2024
Merged via the queue into JabRef:main with commit 4432dcf Mar 24, 2024
20 checks passed
@@ -316,4 +316,10 @@ public void rejectMissingDividerInShortDoi() {
public void rejectNullDoiParameter() {
assertThrows(NullPointerException.class, () -> new DOI(null));
}

@Test
public void findDoiWithSpecialCharactersInText() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For the future: Integrate this in the exiting @ParamterizedTest. The method name is then the comment above the pair of Arguments.of. See Line 223.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Had forgotten about this. Done in #11603.

@subhramit subhramit deleted the fix-for-issue-10648 branch March 25, 2024 23:23
@subhramit subhramit changed the title Fix DOI Url parsing Fix DOI URL parsing Aug 21, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Entry creation from DOI does not recognize URL encoding, but "get bibliographical data" does
3 participants