-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Check duplicate DOI #6333
Check duplicate DOI #6333
Conversation
@Override | ||
public List<IntegrityMessage> check(BibEntry entry) { | ||
if (errors == null) { | ||
errors = new HashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not directly initialize that HashMap in the field declaration?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
MAybe you did not see that the later code uses the BibtexDatabase from the constructor. We decided to put the code into check - and not into the constructor - to have a lean constructor and not have high CPU usage when initializing the checker, but at the first time an entry is checked.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems like premature optimization for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you caching the errors anyway? Also in case you are worrying about performance, why do you choose to implement a solution that has O(n^2) (with n = entries). Why not implement a O(n) solution (e.g. https://stackoverflow.com/a/31341963/873661) which you call in checkDatabase
below?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See in line 30 that the method is called once per entry. Is is by design of the Checker
interface that there is method for checking each entry. There is no method for checking a complete database. There is also no interface for a complete database.
At the first call, the result map is filled. At all subsequent calls do not fill the map and reuse it.
We are O(n+m) where n is the size of the database and m is the number of entries with DOIs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make the code more reable, I moved the call to the error check to the constructor. See fd7621d
BiMap<DOI, List<BibEntry>> duplicateMap = HashBiMap.create(bibEntries.size()); | ||
for (BibEntry bibEntry : bibEntries) { | ||
bibEntry.getDOI().ifPresent(doi -> | ||
duplicateMap.computeIfAbsent(doi, x -> new ArrayList<>()).add(bibEntry)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is doi
, not entry
. However, doi
cannot be used as the variable name is already bound.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@koppor Please have a close look. The second argument of duplicateMap.computeIfAbsent refers to the value and this clearly is of type List. So then maybe name it entries
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Siedlerchr Please explain the JavaDoc to me. The Map is FROM doi TO list. https://www.geeksforgeeks.org/hashmap-computeifabsent-method-in-java-with-examples/ explains the computeIfAbsent
.
To me, it reads, that the key is passed to Function<? super K, ? extends V> remappingFunction
. The key is clearly (!) a DOI not a list. -- Am I wrong?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To make the code more reable, I called the variable absentDoi
. See 6f9b148
BiMap<DOI, List<BibEntry>> duplicateMap = HashBiMap.create(bibEntries.size()); | ||
for (BibEntry bibEntry : bibEntries) { | ||
bibEntry.getDOI().ifPresent(doi -> | ||
duplicateMap.computeIfAbsent(doi, x -> new ArrayList<>()).add(bibEntry)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is doi
, not entry
. However, doi
cannot be used as the variable name is already bound.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That seems way more complicated than it need to be: https://stackoverflow.com/a/31341963/873661
} | ||
|
||
public List<IntegrityMessage> checkDatabase() { | ||
private void initCheckers(BibDatabaseContext bibDatabaseContext, BibtexKeyPatternPreferences bibtexKeyPatternPreferences, JournalAbbreviationRepository journalAbbreviationRepository) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To be honest, I don't see any advantage of this refactoring. Just makes the code more complex and harder to maintain in my opinion.
If you really feel like the original code needs a refactoring, then you create a list of all checkers that need to be run a) always, b) bibtex c)biblatex (still I would create these lists only in the checkdatabase method). This would get ride of the repeated check
methods below. But to be honest, I don't think this yields a more readable code either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We initialize the checkers once per database check - not once per entry check. In case of a 20k database, approx 100k less java objects in memory per quality check.
We ensured that each checker is stateless - thus, it can be reused for acroess entries.
I agree with your proposal though.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With your proposal, the code is much more reable. 🎉 03dd2a0
@Override | ||
public List<IntegrityMessage> check(BibEntry entry) { | ||
if (errors == null) { | ||
errors = new HashMap<>(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are you caching the errors anyway? Also in case you are worrying about performance, why do you choose to implement a solution that has O(n^2) (with n = entries). Why not implement a O(n) solution (e.g. https://stackoverflow.com/a/31341963/873661) which you call in checkDatabase
below?
I dont get in the stack overlfow thing, how they check
Think, I have to implement a comparator comparing two bibentries by doi only. At the DOI comparison, I need to make a DOI object. I could cache that (which is unwanted) or I can reconstruct it every time (which is slower than oru approach above -- there, we construct the DOI object once per bib entry) We need to construct the DOI object to be able to compare two DOIs. We decided that |
Co-authored-by: Oliver Kopp <[email protected]>
Co-authored-by: Oliver Kopp <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very nice refactoring! Looks good to me, just a bit of nitpicking.
import org.jabref.model.entry.BibEntry; | ||
|
||
@FunctionalInterface | ||
public interface Checker { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rename to EntryChecker?
return duplicateMap.inverse().keySet().stream() | ||
.filter(list -> list.size() > 1) | ||
.flatMap(list -> list.stream()) | ||
.map(item -> new IntegrityMessage(Localization.lang("Unique DOI used in multiple entries"), item, StandardField.DOI)) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think "The same DOI is used..." is better
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just "Same DOI..." will do.
} | ||
|
||
public List<IntegrityMessage> checkDatabase() { | ||
List<IntegrityMessage> executeAllCheckers() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Name it simply check
?
# Conflicts: # src/main/java/org/jabref/logic/integrity/IntegrityCheck.java # src/test/java/org/jabref/logic/integrity/IntegrityCheckTest.java
…into check-duplicate-doi
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you
* upstream/master: (50 commits) Keep group pane size when resizing window (#6180) (#6423) Changelog: Fix missing citation for biblatex-mla Update AUTHORS Check duplicate DOI (#6333) Fix missing citation for biblatex-mla Change EasyBind dependency (#6480) Add testing of latest dev version as mandatory Squashed 'src/main/resources/csl-styles/' changes from 5dad23d..586e0b8 Fix libre office connection and other progress dialogs (#6478) Fix clear year and month field when converting to biblatex (#6434) Add truncate as a BibTex key modifier (#6427) Add new authors (not all - they need more work) Remove empty line Add simple Unit Tests for #6207 (#6240) Enforce LeftCurly rule (#6452) Implement task progress indicator (and dialog) in the toolbar (#6443) Consider empty brackets Changelog update Added a test Fixed brackets in regular expressions ...
# By dependabot-preview[bot] (18) and others # Via GitHub (17) and others * upstream/master: (77 commits) Reenable caching of gradle Refactor BibtexKeyPatternPreferences (#6489) Update CHANGELOG.md Add changelog entry and remove unnecessary code EasyBind revision part two Fix Drag and Drop on empty database Truncates DOIs and URLs in the column "Linked identifiers" in main table, if too long (#6498) Bump flexmark-ext-gfm-tasklist from 0.61.26 to 0.61.30 Bump flexmark from 0.61.26 to 0.61.30 Bump xmlunit-matchers from 2.6.4 to 2.7.0 Bump java-string-similarity from 1.2.1 to 2.0.0 Bump flexmark-ext-gfm-strikethrough from 0.61.26 to 0.61.30 Bump xmlunit-core from 2.6.4 to 2.7.0 Truncates the link and/or the link description in the column "linked files" in main table, if too long (#6179) Keep group pane size when resizing window (#6180) (#6423) Changelog: Fix missing citation for biblatex-mla Update AUTHORS Check duplicate DOI (#6333) Fix missing citation for biblatex-mla Change EasyBind dependency (#6480) ... # Conflicts: # src/main/java/org/jabref/gui/actions/ActionHelper.java # src/main/java/org/jabref/gui/customentrytypes/CustomEntryTypeDialogViewModel.java # src/main/java/org/jabref/gui/customentrytypes/CustomizeEntryTypeDialogView.java # src/main/java/org/jabref/model/entry/field/FieldFactory.java
f6c778e Update emerald-harvard.csl (#6335) d6c6a16 Fix Brazilian quotes on chicago-author-date.csl (#6317) a1549b6 Update medizinische-hochschule-hannover.csl (#6330) da88073 Update journal-of-the-american-college-of-cardiology.csl (#6334) a520d8e Bump nokogiri from 1.13.9 to 1.13.10 (#6333) ba54b44 Update royal-society-of-chemistry.csl (#6328) 1378ba7 LUSEM: Remove full stop (#6332) 9e3cf89 Create interpreting.csl (#6254) bef74ed Create conservation-science-and-practice.csl (#6258) 9fb7eb7 Bug fixes triangle.csl (#6251) e6112ba Update ucl-university-college-apa.csl (#6250) 6dcba3a Update engineering-technology-and-applied-science-research.csl (#6247) 00fe4a2 Create constructivist-foundations.csl (#6243) 03ad71b Create les-mondes-du-travail.csl (#6234) a2bce86 Corrections based on author instructions (#6306) git-subtree-dir: buildres/csl/csl-styles git-subtree-split: f6c778e
84dba23 Update international-union-of-crystallography.csl (#6279) 13dd9e8 Update zeitschrift-fur-deutsche-philologie.csl (#6340) d95b652 Create cahiers-mondes-anciens.csl (#6203) ded567c Create rassegna-degli-archivi-di-stato-bibliografia-generale.csl (#6275) 124777a Create scientific-online-letters-on-the-atmosphere.csl (#6261) 3c276e7 Create american-medical-association-no-url-alphabetical.csl (#6252) 595ad95 Bump mathieudutour/github-tag-action from 6.0 to 6.1 (#6287) 7008128 Create cambridge-a (#6336) 17e930c Update norsk-apa-manual-note.csl (#6338) b360859 Update norsk-apa-manual.csl (#6337) f6c778e Update emerald-harvard.csl (#6335) d6c6a16 Fix Brazilian quotes on chicago-author-date.csl (#6317) a1549b6 Update medizinische-hochschule-hannover.csl (#6330) da88073 Update journal-of-the-american-college-of-cardiology.csl (#6334) a520d8e Bump nokogiri from 1.13.9 to 1.13.10 (#6333) ba54b44 Update royal-society-of-chemistry.csl (#6328) 1378ba7 LUSEM: Remove full stop (#6332) 9e3cf89 Create interpreting.csl (#6254) bef74ed Create conservation-science-and-practice.csl (#6258) 9fb7eb7 Bug fixes triangle.csl (#6251) e6112ba Update ucl-university-college-apa.csl (#6250) 6dcba3a Update engineering-technology-and-applied-science-research.csl (#6247) 00fe4a2 Create constructivist-foundations.csl (#6243) 03ad71b Create les-mondes-du-travail.csl (#6234) a2bce86 Corrections based on author instructions (#6306) git-subtree-dir: buildres/csl/csl-styles git-subtree-split: 84dba23
Result of #latemob session with @skufer322.
Fixes JabRef#339.