Skip to content

Commit

Permalink
Fix DOI fetcher and add documentation on fetcher trust levels (#6990)
Browse files Browse the repository at this point in the history
* Add documentation on fetcher trust levels

* Avoid 'null' as evaluation result if env variable is not defined

* Refine documentation

* Improve DOIResolution

- remove similarity of links
- add citation meta tag <meta name="citation_pdf_url">

* Fix test

Default value is "" and not "null"

* Fix typo

* Unused imports

Co-authored-by: Stefan Kolb <[email protected]>
Co-authored-by: Stefan Kolb <[email protected]>
  • Loading branch information
3 people authored Jan 16, 2021
1 parent e173764 commit 8a66328
Show file tree
Hide file tree
Showing 8 changed files with 82 additions and 49 deletions.
2 changes: 1 addition & 1 deletion .github/workflows/tests-fetchers.yml
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ jobs:
with:
java-version: 14
- uses: actions/cache@v1
name: Restore gradle chache
name: Restore gradle cache
with:
path: ~/.gradle/caches
key: ${{ runner.os }}-gradle-${{ hashFiles('**/*.gradle') }}
Expand Down
8 changes: 4 additions & 4 deletions .github/workflows/tests.yml
Original file line number Diff line number Diff line change
Expand Up @@ -64,7 +64,7 @@ jobs:
with:
java-version: 14
- uses: actions/cache@v1
name: Restore gradle chache
name: Restore gradle cache
with:
path: ~/.gradle/caches
key: ${{ runner.os }}-gradle-${{ hashFiles('**/*.gradle') }}
Expand Down Expand Up @@ -108,7 +108,7 @@ jobs:
with:
java-version: 14
- uses: actions/cache@v1
name: Restore gradle chache
name: Restore gradle cache
with:
path: ~/.gradle/caches
key: ${{ runner.os }}-gradle-${{ hashFiles('**/*.gradle') }}
Expand Down Expand Up @@ -154,7 +154,7 @@ jobs:
with:
java-version: 14
- uses: actions/cache@v1
name: Restore gradle chache
name: Restore gradle cache
with:
path: ~/.gradle/caches
key: ${{ runner.os }}-gradle-${{ hashFiles('**/*.gradle') }}
Expand Down Expand Up @@ -204,7 +204,7 @@ jobs:
with:
java-version: 14
- uses: actions/cache@v1
name: Restore gradle chache
name: Restore gradle cache
with:
path: ~/.gradle/caches
key: ${{ runner.os }}-gradle-${{ hashFiles('**/*.gradle') }}
Expand Down
8 changes: 4 additions & 4 deletions build.gradle
Original file line number Diff line number Diff line change
Expand Up @@ -274,10 +274,10 @@ processResources {
expand(version: project.findProperty('projVersionInfo') ?: '100.0.0',
"year": String.valueOf(Calendar.getInstance().get(Calendar.YEAR)),
"maintainers": new File('MAINTAINERS').readLines().findAll { !it.startsWith("#") }.join(", "),
"azureInstrumentationKey": System.getenv('AzureInstrumentationKey'),
"springerNatureAPIKey": System.getenv('SpringerNatureAPIKey'),
"astrophysicsDataSystemAPIKey": System.getenv('AstrophysicsDataSystemAPIKey'),
"ieeeAPIKey": System.getenv('IEEEAPIKey')
"azureInstrumentationKey": System.getenv('AzureInstrumentationKey') ? System.getenv('AzureInstrumentationKey') : '',
"springerNatureAPIKey": System.getenv('SpringerNatureAPIKey') ? System.getenv('SpringerNatureAPIKey') : '',
"astrophysicsDataSystemAPIKey": System.getenv('AstrophysicsDataSystemAPIKey') ? System.getenv('AstrophysicsDataSystemAPIKey') : '',
"ieeeAPIKey": System.getenv('IEEEAPIKey') ? System.getenv('IEEEAPIKey') : ''
)
filteringCharset = 'UTF-8'
}
Expand Down
38 changes: 37 additions & 1 deletion docs/advanced-reading/fetchers.md
Original file line number Diff line number Diff line change
@@ -1,4 +1,4 @@
# Working on fetchers
# Information about Fetchers

Fetchers are the implementation of the [search using online services](https://docs.jabref.org/collect/import-using-online-bibliographic-database). Some fetchers require API keys to get them working. To get the fetchers running in a JabRef development setup, the keys need to be placed in the respective environment variable. The following table lists the respective fetchers, where to get the key from and the environment variable where the key has to be placed.

Expand All @@ -14,6 +14,42 @@ Fetchers are the implementation of the [search using online services](https://do

On Windows, you have to log-off and log-on to let IntelliJ know about the environment variable change. Execute the gradle task "processResources" in the group "others" within IntelliJ to ensure the values have been correctly written. Now, the fetcher tests should run without issues.

## Fulltext Fetchers

- all fulltext fetchers run in parallel
- the result with the highest priority wins
- `InterruptedException` | `ExecutionException` | `CancellationException` are ignored

### Trust Levels

- SOURCE (highest): definitive URL for a particular paper
- PUBLISHER: any publisher library
- PREPRINT: any preprint library that might include non final publications of a paper
- META_SEARCH: meta search engines
- UNKNOWN (lowest): anything else not fitting the above categories

### Current trust levels

All fetchers are contained in the package `org.jabref.logic.importer.fetcher`.
Here we list the trust levels of some of them:

- DOI: SOURCE, as the DOI is always forwarded to the correct publisher page for the paper
- ScienceDirect: Publisher
- Springer: Publisher
- ACS: Publisher
- IEEE: Publisher
- Google Scholar: META_SEARCH, because it is a search engine
- Arxiv: PREPRINT, because preprints are published there
- OpenAccessDOI: META_SEARCH

Reasoning:

- A DOI uniquely identifies a paper. Per definition, a DOI leads to the right paper. Everything else is good guessing.
- We assume the DOI resolution surely points to the correct paper and that publisher fetches may have errors: For instance, a title of a paper may lead to different publications of it. One the conference version, the other the journal version. --> the PDF could be chosen randomly


Code was first introduced at [PR#3882](https://github.com/JabRef/jabref/pull/3882).

## Background on embedding the keys in JabRef

The keys are placed into the `build.properties` file.
Expand Down
50 changes: 29 additions & 21 deletions src/main/java/org/jabref/logic/importer/fetcher/DoiResolution.java
Original file line number Diff line number Diff line change
@@ -1,9 +1,9 @@
package org.jabref.logic.importer.fetcher;

import java.io.IOException;
import java.net.MalformedURLException;
import java.net.URL;
import java.util.ArrayList;
import java.util.Arrays;
import java.util.List;
import java.util.Locale;
import java.util.Objects;
Expand All @@ -12,7 +12,6 @@

import org.jabref.logic.importer.FulltextFetcher;
import org.jabref.logic.net.URLDownload;
import org.jabref.logic.util.strings.StringSimilarity;
import org.jabref.model.entry.BibEntry;
import org.jabref.model.entry.field.StandardField;
import org.jabref.model.entry.identifier.DOI;
Expand All @@ -32,11 +31,6 @@
public class DoiResolution implements FulltextFetcher {
private static final Logger LOGGER = LoggerFactory.getLogger(DoiResolution.class);

/**
* Hosts for which tailored fetchers exist, so this fetcher is not needed.
*/
private final List<String> excludedHosts = Arrays.asList("link.springer.com", "ieeexplore.ieee.org");

@Override
public Optional<URL> findFullText(BibEntry entry) throws IOException {
Objects.requireNonNull(entry);
Expand Down Expand Up @@ -64,11 +58,14 @@ public Optional<URL> findFullText(BibEntry entry) throws IOException {
connection.timeout(10000);

Connection.Response response = connection.execute();
if (excludedHosts.contains(response.url().getHost())) {
return Optional.empty();
}

Document html = response.parse();
// citation pdf meta tag
Optional<URL> citationMetaTag = citationMetaTag(html);
if (citationMetaTag.isPresent()) {
return citationMetaTag;
}

// scan for PDF
Elements hrefElements = html.body().select("a[href]");

Expand All @@ -91,11 +88,11 @@ public Optional<URL> findFullText(BibEntry entry) throws IOException {

// return if only one link was found (high accuracy)
if (links.size() == 1) {
LOGGER.info("Fulltext PDF found @ " + doiLink);
LOGGER.info("Fulltext PDF found @ {}", doiLink);
return Optional.of(links.get(0));
}
// return if links are similar or multiple links are similar
return findSimilarLinks(links);
// return if links are equal
return findDistinctLinks(links);
} catch (UnsupportedMimeTypeException type) {
// this might be the PDF already as we follow redirects
if (type.getMimeType().startsWith("application/pdf")) {
Expand All @@ -109,7 +106,25 @@ public Optional<URL> findFullText(BibEntry entry) throws IOException {
return Optional.empty();
}

private Optional<URL> findSimilarLinks(List<URL> urls) {
/**
* Scan for <meta name="citation_pdf_url">
* See https://scholar.google.com/intl/de/scholar/inclusion.html#indexing
*/
private Optional<URL> citationMetaTag(Document html) {
Elements citationPdfUrlElement = html.head().select("meta[name='citation_pdf_url']");
Optional<String> citationPdfUrl = citationPdfUrlElement.stream().map(e -> e.attr("content")).findFirst();

if (citationPdfUrl.isPresent()) {
try {
return Optional.of(new URL(citationPdfUrl.get()));
} catch (MalformedURLException e) {
return Optional.empty();
}
}
return Optional.empty();
}

private Optional<URL> findDistinctLinks(List<URL> urls) {
List<URL> distinctLinks = urls.stream().distinct().collect(Collectors.toList());

if (distinctLinks.isEmpty()) {
Expand All @@ -119,13 +134,6 @@ private Optional<URL> findSimilarLinks(List<URL> urls) {
if (distinctLinks.size() == 1) {
return Optional.of(distinctLinks.get(0));
}
// similar
final String firstElement = distinctLinks.get(0).toString();
StringSimilarity similarity = new StringSimilarity();
List<URL> similarLinks = distinctLinks.stream().filter(elem -> similarity.isSimilar(firstElement, elem.toString())).collect(Collectors.toList());
if (similarLinks.size() == distinctLinks.size()) {
return Optional.of(similarLinks.get(0));
}

return Optional.empty();
}
Expand Down
Original file line number Diff line number Diff line change
@@ -1,5 +1,8 @@
package org.jabref.logic.importer.fetcher;

/**
* Discussion on the trust levels is available at our <a href="https://devdocs.jabref.org/advanced-reading/fetchers">documentation on fetchers</a>.
*/
public enum TrustLevel {
SOURCE(3),
PUBLISHER(2),
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -38,13 +38,13 @@ void linkWithPdfInTitleTag() throws IOException {
@Test
void linkWithPdfStringLeadsToFulltext() throws IOException {
entry.setField(StandardField.DOI, "10.1002/acr2.11101");
assertEquals(Optional.of(new URL("https://onlinelibrary.wiley.com/doi/epdf/10.1002/acr2.11101")), finder.findFullText(entry));
assertEquals(Optional.of(new URL("https://onlinelibrary.wiley.com/doi/pdf/10.1002/acr2.11101")), finder.findFullText(entry));
}

@Test
void multipleLinksWithSmallEditDistanceLeadToFulltext() throws IOException {
entry.setField(StandardField.DOI, "10.1002/acr2.11101");
assertEquals(Optional.of(new URL("https://onlinelibrary.wiley.com/doi/epdf/10.1002/acr2.11101")), finder.findFullText(entry));
void citationMetaTagLeadsToFulltext() throws IOException {
entry.setField(StandardField.DOI, "10.1007/978-3-319-89963-3_28");
assertEquals(Optional.of(new URL("https://link.springer.com/content/pdf/10.1007%2F978-3-319-89963-3_28.pdf")), finder.findFullText(entry));
}

@Test
Expand All @@ -53,18 +53,6 @@ void notReturnAnythingWhenMultipleLinksAreFound() throws IOException {
assertEquals(Optional.empty(), finder.findFullText(entry));
}

@Test
void notReturnAnythingWhenDOILeadsToSpringerLink() throws IOException {
entry.setField(StandardField.DOI, "https://doi.org/10.1007/978-3-319-89963-3_28");
assertEquals(Optional.empty(), finder.findFullText(entry));
}

@Test
void notReturnAnythingWhenDOILeadsToIEEE() throws IOException {
entry.setField(StandardField.DOI, "https://doi.org/10.1109/TTS.2020.2992669");
assertEquals(Optional.empty(), finder.findFullText(entry));
}

@Test
void notFoundByDOI() throws IOException {
entry.setField(StandardField.DOI, "10.1186/unknown-doi");
Expand Down
2 changes: 0 additions & 2 deletions src/test/java/org/jabref/logic/util/BuildInfoTest.java
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,6 @@
import org.junit.jupiter.api.Test;

import static org.junit.jupiter.api.Assertions.assertEquals;
import static org.junit.jupiter.api.Assertions.assertNotEquals;
import static org.junit.jupiter.api.Assertions.assertNotNull;

public class BuildInfoTest {
Expand All @@ -24,6 +23,5 @@ public void testFileImport() {
public void azureInstrumentationKeyIsNotEmpty() {
BuildInfo buildInfo = new BuildInfo();
assertNotNull(buildInfo.azureInstrumentationKey);
assertNotEquals("", buildInfo.azureInstrumentationKey);
}
}

0 comments on commit 8a66328

Please sign in to comment.