-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix DOI URL parsing #11084
Fix DOI URL parsing #11084
Conversation
Unit test failing ( |
Okay, seems like the inclusion of semicolon is causing an extended pattern matching in: jabref/src/test/java/org/jabref/model/entry/identifier/DOITest.java Lines 202 to 203 in 8d08279
|
That's a tough one and I am not sure how often that case with ;something happens e.g. text coming after the DOI Edit// This one is a hard one: |
I just added a space before "something" in my last commit. This will cause the last character to be ';', which will not be matched, resulting in the doi expected. Even if ;something was a valid doi, we can add it to the expected argument and the test will pass. Will take a read. |
Tried this one in my updated build, works perfectly fine! |
Awaiting further comments. |
I'm okay with this we cannot capture all edge cases |
@@ -316,4 +316,10 @@ public void rejectMissingDividerInShortDoi() { | |||
public void rejectNullDoiParameter() { | |||
assertThrows(NullPointerException.class, () -> new DOI(null)); | |||
} | |||
|
|||
@Test | |||
public void findDoiWithSpecialCharactersInText() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For the future: Integrate this in the exiting @ParamterizedTest
. The method name is then the comment above the pair of Arguments.of
. See Line 223.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Had forgotten about this. Done in #11603.
This pull request closes #10648
The Issue
On attempting to use JabRef's "Import by ID" feature as described in the linked issue, while importing using DOI URLs or DOI numbers containing some special characters, such as https://doi.org/10.1175/1520-0493(2002)130%3C1913:EDAWPO%3E2.0.CO;2, the following error was thrown:
as if, the DOI link itself was an invalid one. However, the link was perfectly fine.
The bug
The bug was extremely subtle and tough to trace. On careful inspection it was detected that when the
performSearchById(identifier)
was being called somewhere along the flow, line 23 was causing the variabledoi
to be assigned tohttps://doi.org/10.1175/1520-0493(2002)130%3C1913:EDAWPO%3E2.0.CO
. That is, somehow the trailing ";2
" was being truncated from the identifier variable (which held the correct link) and assigned todoi
. This resulted in an invalid DOI link, which was causing the HTTP statuscode 404
to be thrown, resulting in an internalFetcherClientException
.jabref/src/main/java/org/jabref/logic/importer/CompositeIdFetcher.java
Lines 22 to 23 in 8d08279
Then, on investigating the underlying
findInText(identifier)
function in theDOI
class, the following snippet was analyzed:jabref/src/main/java/org/jabref/model/entry/identifier/DOI.java
Lines 202 to 222 in 369b9a7
It turns out that our input
identifier = https://doi.org/10.1175/1520-0493(2002)130%3C1913:EDAWPO%3E2.0.CO;2
was being matched (partially) withFIND_DOI_PATT
, which in turn contained a section that containedFIND_DOI_EXP
, which had the following structure:jabref/src/main/java/org/jabref/model/entry/identifier/DOI.java
Lines 43 to 51 in 369b9a7
In line 50, we can see that the pattern forbade
;
to be in any part of the suffix, so everything before that in our example was being matched. Then, when theDOI(new DOI(matcher.group(1));
constructor (line 208 as seen in theDOI.java
snippet linked above) was being called due to the match, it was identifying the partially matched part (https://doi.org/10.1175/1520-0493(2002)130%3C1913:EDAWPO%3E2.0.CO
) to be a valid DOI URL, andresult
was being assigned that. Although a partial match in the second case was triggering the secondif
statement as well, the resultant matched part washttps://doi.org/10.1175/1520-0493(2002)130
, which was not being identified as a valid DOI URL by theDOI(matcher.group(1))
constructor call, and thus not overwritingresult
. Similarly thirdif
statement was also immaterial here due to a mismatch or a partial match resulting in an invalid DOI URL.So, the final value of
result
washttps://doi.org/10.1175/1520-0493(2002)130%3C1913:EDAWPO%3E2.0.CO
, which was an invalid DOI mistaken to be a valid one.The fix
The fix was relatively simpler once the investigation was done. We just had to remove the
;
from the forbidden characters in the expected regex structure ofFIND_DOI_EXP
, so that our link would fully match its structure and thus the variableresult
would contain the correct DOI URL, without the trailing ";2
" being truncated.FIND_DOI_EXP
had only one usage (as per IntelliJ), so it could be safely modified.A test has also been added in
DOITest.java
.Result
The article was successfully retrieved using Import by ID.
Mandatory checks
CHANGELOG.md
described in a way that is understandable for the average user (if applicable)