Skip to content

Commit

Permalink
Fix more fetchers (#6790)
Browse files Browse the repository at this point in the history
  • Loading branch information
koppor authored Aug 26, 2020
1 parent 92161af commit d052170
Show file tree
Hide file tree
Showing 14 changed files with 262 additions and 107 deletions.
1 change: 1 addition & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -106,6 +106,7 @@ Note that this project **does not** adhere to [Semantic Versioning](http://semve
- We fixed an issue where percent sign ('%') was not formatted properly by the HTML formatter [#6753](https://github.com/JabRef/jabref/issues/6753)
- We fixed an issue with the [SAO/NASA Astrophysics Data System](https://docs.jabref.org/collect/import-using-online-bibliographic-database/ads) fetcher where `\textbackslash` appeared at the end of the abstract.
- We fixed an issue with the Science Direct fetcher where PDFs could not be downloaded. Fixes [#5860](https://github.com/JabRef/jabref/issues/5860)
- We fixed an issue with the Library of Congress importer.

### Removed

Expand Down
104 changes: 104 additions & 0 deletions docs/adr/0014-separate-URL-creation-to-enable-proper-logging.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
# Separate URL creation to enable proper logging

## Context and Problem Statement

Fetchers are failing.
The reason why they are failing needs to be investigated.

* Claim 1: Knowing the URL which was used to query the fetcher eases debugging
* Claim 2: Somehow logging the URL eases debugging (instead of showing it in the debugger only)

How to properly log the URL used for fetching?

## Decision Drivers

* Code should be easy to read
* Include URL in the exception instead of logging in case an exception is thrown already (see <https://howtodoinjava.com/best-practices/java-exception-handling-best-practices/#6>)

## Considered Options

* Separate URL creation
* Create URL when logging the URL
* Include URL creation as statement before the stream creation in the try-with-resources block

## Decision Outcome

Chosen option: "Separate URL creation", because comes out best \(see below\).

## Pros and Cons of the Options

### Separate URL creation

```java
URL urlForQuery;
try {
urlForQuery = getURLForQuery(query);
} catch (URISyntaxException | MalformedURLException | FetcherException e) {
throw new FetcherException(String.format("Search URI %s is malformed", query), e);
}
try (InputStream stream = getUrlDownload(complexQueryURL).asInputStream()) {
...
} catch (IOException e) {
throw new FetcherException("A network error occurred while fetching from " + urlForQuery.toString(), e);
} catch (ParseException e) {
throw new FetcherException("An internal parser error occurred while fetching from " + urlForQuery.toString(), e);
}
```

* Good, because exceptions thrown at method are directly catched
* Good, because exceptions in different statements belong to different catch blocks
* Good, because code to determine URL is written once
* OK, because "Java by Comparison" does not state anything about it
* Bad, because multiple try/catch statements are required
* Bad, because this style seems to be uncommon to Java coders

### Create URL when logging the URL

The "logging" is done when throwing the exception.

Example code:

```java
try (InputStream stream = getUrlDownload(getURLForQuery(query)).asInputStream()) {
...
} catch (URISyntaxException | MalformedURLException | FetcherException e) {
throw new FetcherException(String.format("Search URI %s is malformed", query), e);
} catch (IOException e) {
try {
throw new FetcherException("A network error occurred while fetching from " + getURLForQuery(query), e);
} catch (URISyntaxException | MalformedURLException uriSyntaxException) {
// does not happen
throw new FetcherException("A network error occurred", e);
}
} catch (ParseException e) {
try {
throw new FetcherException("An internal parser error occurred while fetching from " + getURLForQuery(query), e);
} catch (URISyntaxException | MalformedURLException uriSyntaxException) {
// does not happen
throw new FetcherException("An internal parser error occurred", e);
}
}
```

* Good, because code inside the `try` statement stays the same
* OK, because "Java by Comparison" does not state anything about it
* Bad, because an additional try/catch-block is added to each catch statement
* Bad, because needs a `throw` statement in the `URISyntaxException` catch block (even though at this point the exception cannot be thrown), because Java otherwise misses a `return` statement.

### Include URL creation as statement before the stream creation in the try-with-resources block

```java
try (URL urlForQuery = getURLForQuery(query); InputStream stream = urlForQuery.asInputStream()) {
...
} catch (URISyntaxException | MalformedURLException | FetcherException e) {
throw new FetcherException(String.format("Search URI %s is malformed", query), e);
} catch (IOException e) {
throw new FetcherException("A network error occurred while fetching from " + urlForQuery.toString(), e);
} catch (ParseException e) {
throw new FetcherException("An internal parser error occurred while fetching from " + urlForQuery.toString(), e);
}
```

* Good, because the single try/catch-block can be kept
* Good, because logical flow is kept
* Bad, because does not compile (because URL is not an `AutoClosable`)
Original file line number Diff line number Diff line change
Expand Up @@ -15,9 +15,11 @@

/**
* Provides a convenient interface for search-based fetcher, which follow the usual three-step procedure:
* 1. Open a URL based on the search query
* 2. Parse the response to get a list of {@link BibEntry}
* 3. Post-process fetched entries
* <ol>
* <li>Open a URL based on the search query</li>
* <li>Parse the response to get a list of {@link BibEntry}</li>
* <li>Post-process fetched entries</li>
* </ol>
*/
public interface SearchBasedParserFetcher extends SearchBasedFetcher {

Expand All @@ -39,21 +41,14 @@ default List<BibEntry> performSearch(String query) throws FetcherException {
return Collections.emptyList();
}

try (InputStream stream = getUrlDownload(getURLForQuery(query)).asInputStream()) {
List<BibEntry> fetchedEntries = getParser().parseEntries(stream);

// Post-cleanup
fetchedEntries.forEach(this::doPostCleanup);

return fetchedEntries;
} catch (URISyntaxException e) {
throw new FetcherException("Search URI is malformed", e);
} catch (IOException e) {
// TODO: Catch HTTP Response 401/403 errors and report that user has no rights to access resource
throw new FetcherException("A network error occurred", e);
} catch (ParseException e) {
throw new FetcherException("An internal parser error occurred", e);
// ADR-0014
URL urlForQuery;
try {
urlForQuery = getURLForQuery(query);
} catch (URISyntaxException | MalformedURLException | FetcherException e) {
throw new FetcherException(String.format("Search URI crafted from query %s is malformed", query), e);
}
return getBibEntries(urlForQuery);
}

/**
Expand All @@ -65,17 +60,25 @@ default List<BibEntry> performSearch(String query) throws FetcherException {
*/
@Override
default List<BibEntry> performComplexSearch(ComplexSearchQuery complexSearchQuery) throws FetcherException {
try (InputStream stream = getUrlDownload(getComplexQueryURL(complexSearchQuery)).asInputStream()) {
// ADR-0014
URL urlForQuery;
try {
urlForQuery = getComplexQueryURL(complexSearchQuery);
} catch (URISyntaxException | MalformedURLException | FetcherException e) {
throw new FetcherException("Search URI crafted from complex search query is malformed", e);
}
return getBibEntries(urlForQuery);
}

private List<BibEntry> getBibEntries(URL urlForQuery) throws FetcherException {
try (InputStream stream = getUrlDownload(urlForQuery).asInputStream()) {
List<BibEntry> fetchedEntries = getParser().parseEntries(stream);
fetchedEntries.forEach(this::doPostCleanup);
return fetchedEntries;
} catch (URISyntaxException e) {
throw new FetcherException("Search URI is malformed", e);
} catch (IOException e) {
// TODO: Catch HTTP Response 401/403 errors and report that user has no rights to access resource
throw new FetcherException("A network error occurred", e);
throw new FetcherException("A network error occurred while fetching from " + urlForQuery, e);
} catch (ParseException e) {
throw new FetcherException("An internal parser error occurred", e);
throw new FetcherException("An internal parser error occurred while fetching from " + urlForQuery, e);
}
}

Expand All @@ -86,15 +89,16 @@ default URL getComplexQueryURL(ComplexSearchQuery complexSearchQuery) throws URI

/**
* Performs a cleanup of the fetched entry.
*
* <p>
* Only systematic errors of the fetcher should be corrected here
* (i.e. if information is consistently contained in the wrong field or the wrong format)
* but not cosmetic issues which may depend on the user's taste (for example, LateX code vs HTML in the abstract).
*
* <p>
* Try to reuse existing {@link Formatter} for the cleanup. For example,
* {@code new FieldFormatterCleanup(StandardField.TITLE, new RemoveBracesFormatter()).cleanup(entry);}
*
* <p>
* By default, no cleanup is done.
*
* @param entry the entry to be cleaned-up
*/
default void doPostCleanup(BibEntry entry) {
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,6 +3,7 @@
import java.net.MalformedURLException;
import java.net.URISyntaxException;
import java.net.URL;
import java.util.Arrays;

import org.jabref.logic.formatter.bibtexfields.RemoveDigitsFormatter;
import org.jabref.logic.formatter.bibtexfields.RemoveNewlinesFormatter;
Expand All @@ -14,7 +15,10 @@
import org.jabref.logic.importer.SearchBasedParserFetcher;
import org.jabref.model.cleanup.FieldFormatterCleanup;
import org.jabref.model.entry.BibEntry;
import org.jabref.model.entry.field.Field;
import org.jabref.model.entry.field.FieldFactory;
import org.jabref.model.entry.field.StandardField;
import org.jabref.model.entry.field.UnknownField;

import org.apache.http.client.utils.URIBuilder;

Expand All @@ -31,10 +35,10 @@ public CollectionOfComputerScienceBibliographiesFetcher(ImportFormatPreferences
@Override
public URL getURLForQuery(String query) throws URISyntaxException, MalformedURLException, FetcherException {
return new URIBuilder(BASIC_SEARCH_URL)
.addParameter("query", query)
.addParameter("sort", "score")
.build()
.toURL();
.addParameter("query", query)
.addParameter("sort", "score")
.build()
.toURL();
}

@Override
Expand All @@ -53,5 +57,32 @@ public void doPostCleanup(BibEntry entry) {
new FieldFormatterCleanup(StandardField.ABSTRACT, new ReplaceTabsBySpaceFormater()).cleanup(entry);
new FieldFormatterCleanup(StandardField.ABSTRACT, new RemoveRedundantSpacesFormatter()).cleanup(entry);
new FieldFormatterCleanup(StandardField.EDITOR, new RemoveDigitsFormatter()).cleanup(entry);
// identifier fields is a key-value field
// example: "urn:isbn:978-1-4503-5217-8; doi:10.1145/3129790.3129810; ISI:000505046100032; Scopus 2-s2.0-85037741580"
// thus, key can contain multiple ":"; sometimes value separated by " " instead of ":"
UnknownField identifierField = new UnknownField("identifier");
entry.getField(identifierField)
.stream()
.flatMap(value -> Arrays.stream(value.split("; ")))
.forEach(identifierKeyValue -> {
// check for pattern "Scopus 2-..."
String[] identifierKeyValueSplit = identifierKeyValue.split(" ");
if (identifierKeyValueSplit.length == 1) {
// check for pattern "doi:..."
identifierKeyValueSplit = identifierKeyValue.split(":");
}
int length = identifierKeyValueSplit.length;
if (length < 2) {
return;
}
// in the case "urn:isbn:", just "isbn" is used
String key = identifierKeyValueSplit[length - 2];
String value = identifierKeyValueSplit[length - 1];
Field field = FieldFactory.parseField(key);
if (!entry.hasField(field)) {
entry.setField(field, value);
}
});
entry.clearField(identifierField);
}
}
Loading

0 comments on commit d052170

Please sign in to comment.