Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove dash from illegal characters. #6300

Merged
merged 32 commits into from
May 6, 2020

Conversation

dextep
Copy link
Contributor

@dextep dextep commented Apr 16, 2020

Fixes #6295, Fixes #6257, Fixes #6252.

  • Change in CHANGELOG.md described (if applicable)
  • Tests created for changes (if applicable)
  • Manually tested changed features in running JabRef (always required)
  • Screenshots added in PR description (for UI changes)
  • Checked documentation: Is the information available and up to date? If not created an issue at https://github.com/JabRef/user-documentation/issues or, even better, submitted a pull request to the documentation repository.

Copy link
Member

@koppor koppor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the contribution. I walked through the linked issues and updated your code accordingly. A question still remains.

Could you also please add a CHANGELOG.md entry (as indicated in the checklist) and create a pull request for https://github.com/JabRef/user-documentation/blob/master/en/fields/README.md (as asked in the checklist)

@@ -27,7 +27,7 @@
*/
public static final String APPENDIX_CHARACTERS = "abcdefghijklmnopqrstuvwxyz";
private static final Logger LOGGER = LoggerFactory.getLogger(BibtexKeyGenerator.class);
private static final String KEY_ILLEGAL_CHARACTERS = "{}(),\\\"-#~^:'`ʹ";
private static final String KEY_ILLEGAL_CHARACTERS = "{}(),\\\"#~^:'`ʹ";
private static final String KEY_UNWANTED_CHARACTERS = "{}(),\\\"-";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why didn't you remove the - it here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

According to the recommendation in the README, I thought it was a mistake to include a dash in illegal signs, not those not recommended.

Some characters should not be used in bibtexkey as they are not compatible or not recommended: { } ( ) , \ " - # ~ ^ : '

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@koppor so should i remove dash from unwanted characters and make update on READEME file?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Remove it from illegal, but keep it in unwanted.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I checked the usage of KEY_UNWANTED_CHARACTERS.

The code comment reads as follows:

// User doesn't want us to enforce legal characters. We must still look
// for whitespace and some characters such as commas, since these would
// interfere with parsing:

Thus, the KEY_UNWANTED_CHARACTERS clearly is a subset of KEY_ILLEGAL_CHARACTERS. Not sure, whether KEY_ILLEGAL_CHARACTERS makes sense at all. Maybe because of tool interopability.

I see three options:

A) Remove KEY_UNWANTED_CHARACTERS completely - and also the preference for it.
B) Set KEY_UNWANTED_CHARACTERS to a minimum - as indicated in the comment. I would propose {}" since these are the characters causing issues at parsing. I would not add space since this did not cause any issues until now (at list that I am aware of)
C) Just remove - from KEY_UNWANTED_CHARACTERS

Please do option B). Please rename KEY_UNWANTED_CHARACTERS to KEY_CHARACTERS_CAUSING_PARSING_ERRORS. Please add a comment above that this is a subset of KEY_ILLEGAL_CHARACTERS.

@koppor
Copy link
Member

koppor commented Apr 16, 2020

Please also think of the failing unit tests:

grafik

Why was this unit test created? Why is - removed from the key? What impact does the change have on the fetchers? Do we need to add more tests there?

Thank you for digging deeper into this!

@dextep
Copy link
Contributor Author

dextep commented Apr 16, 2020

Thanks @koppor for your tips and attention i will do it asap.

@dextep
Copy link
Contributor Author

dextep commented Apr 17, 2020

Please also think of the failing unit tests:

grafik

Why was this unit test created? Why is - removed from the key? What impact does the change have on the fetchers? Do we need to add more tests there?

Thank you for digging deeper into this!

Ok, so unit test was created to correct operation of individual functions. We removed dash from the key because seems to be not illegal but still unwanted. I also check unit tests from BibtexKeyGeneratorTest and I think we have to fix existing test to accept changes with -.

Copy link
Member

@tobiasdiez tobiasdiez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my opinion, the test is correct. I would remove all illegal and all unwanted characters from an automatically generated key.

CHANGELOG.md Outdated Show resolved Hide resolved
CHANGELOG.md Outdated Show resolved Hide resolved
@koppor
Copy link
Member

koppor commented Apr 17, 2020

@tobiasdiez The test needs to be adapted as the semantics of KEY_UNWANTED_CHARACTERS is KEY_CHARACTERS_CAUSING_PARSING_ERRORS. Since - does not cause parsing errors, it needs to be kept.

If we want to have "beautiful" bibtex keys, we need to add a cleanup formatter removing - at the beginning and end of a bibtex key.

@Siedlerchr
Copy link
Member

Please note that we have preferences for allowing the unwanted characters there is a checkbox in the preferences

@koppor
Copy link
Member

koppor commented Apr 17, 2020

@Siedlerchr This is what I meant with "prefernces" in option A:

grafik

(Copied from #6300 (comment))

*/
public static final String APPENDIX_CHARACTERS = "abcdefghijklmnopqrstuvwxyz";
private static final Logger LOGGER = LoggerFactory.getLogger(BibtexKeyGenerator.class);
private static final String KEY_ILLEGAL_CHARACTERS = "{}(),\\\"-#~^:'`ʹ";
private static final String KEY_UNWANTED_CHARACTERS = "{}(),\\\"-";
private static final String KEY_ILLEGAL_CHARACTERS = "{}(),=\\\"#%~'`ʹ";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please add a comment here referencing https://tex.stackexchange.com/a/408548/9075?

Proposal:

// Source of illegal characters: https://tex.stackexchange.com/a/408548/9075

private static final String KEY_ILLEGAL_CHARACTERS = "{}(),\\\"-#~^:'`ʹ";
private static final String KEY_UNWANTED_CHARACTERS = "{}(),\\\"-";
private static final String KEY_ILLEGAL_CHARACTERS = "{}(),=\\\"#%~'`ʹ";
private static final String KEY_CHARACTERS_CAUSING_PARSING_ERRORS = "{}(),\\\"-";
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please remove - here, too. Otherwise, the subset property does not hold.

@tobiasdiez
Copy link
Member

tobiasdiez commented Apr 18, 2020

So apparently there are three classes of characters that we don't want to accept:

Right now we have the option to specify the Disallowed characters and toggle if they are included or not. But there is no option to specify Unwanted characters.

Proposal:

  • Always warn and remove all Illegal, Disallowed and Unwanted characters.
  • Illegal and Disallowed are hard-coded and not changeable
  • Unwanted can be user-specified and defaults to - (plus more characters?)

Alternatively:

  • Illegal hard-coded and not changeable
  • Disallowed and Unwanted are merged to one list that can be user-specified

This would be slightly more flexible but I couldn't come up with a use case where one really needs a key which contains Disallowed characters. I think the option to disable the "enforce legal characters" came from a time where illegal keys were complained about with a dialog upon loading the library.

@dextep @mlep @melTr @JoKalliauer @JabRef/developers what do you think?

@calixtus
Copy link
Member

I'm a big fan of the second proposal. But we should point to a help page or introduce a tooltip describing the difference between disallowed and unwanted characters. Maybe also a warning validation for disallowed characters.

Let the user decide, if he want's to break bibtex.

@JoKalliauer
Copy link
Contributor

JoKalliauer commented Apr 18, 2020

I would suggest

Edit: I removed ' from the unwanted, since it is already a disalowed/illegal character ( #6300 (comment) )
Edit: I added +

@koppor
Copy link
Member

koppor commented Apr 20, 2020

@JoKalliauer So you propose to merge illegal and disallowed (into disallowed). - And introduce a new configurable "unwanted". - Sounds good for me.

I would also put ' in the list of illegal characters - as it is illegal in BibLaTeX (See https://tex.stackexchange.com/questions/408530/what-characters-are-allowed-to-use-as-delimiters-for-bibtex-keys/408548#408548)

I would NOT make any difference between BibTeX and BibLaTeX here. Just one merged set of illegal characters for BibTeX and BibLaTeX.

@koppor
Copy link
Member

koppor commented Apr 20, 2020

I see that with my proposal keys such as "Dueck_(1998)" won't be possible any more. On the one hand, when used in the context of Word, this could cause WTF at the user's side. On the other hand, this would simplify our code.

@tobiasdiez
Copy link
Member

I also really like the proposal #6300 (comment) from @JoKalliauer.

@dextep could you please go ahead and implement it like that. The last point concerning the import, we can ignore for the moment I think. Sorry for the extended discussion concerning this issue. It appeared to be a clearly-defined issue which was good for a first contribution to JabRef, but then surprisingly got more complex. Do you have any questions concerning the implementation?

@JoKalliauer
Copy link
Contributor

Stupid question: What is the difference between illegal and disallowed character? Is there any specification? Do you have to try it with BibTeX,/BibLaTeX/Biber and check which ones (don't) work?

@dextep
Copy link
Contributor Author

dextep commented Apr 21, 2020

I'm sorry, but I've been a bit busy lately. I have read your comments and propsial seems to be the best solution. @tobiasdiez no problem i'm glad to be able to participate :) I tried to summarize the changes that should be made:

  • Always warn and remove all Illegal, Disallowed characters.

No code changes required

  • Illegal and Disallowed are hard-coded and not changeable

Edit: Merge illegal KEY_ILLEGAL_CHARACTERS and disallowed KEY_CHARACTERS_CAUSING_PARSING_ERRORS into KEY_DISALLOWED_CHARACTERS = "{}(),=\\\"#%~'`ʹ"

  • Unwanted can be user-specified and defaults to - and ` and ʹ and : and ! and ; and ? and ^ and +

Edit: New KEY_UNWANTED_CHARACTERS = "- ` ʹ : ! ; ? ^ +"

No code changes required

Add unwanred to reported tooltip.

Please correct my summary of the changes :)

@JoKalliauer
Copy link
Contributor

JoKalliauer commented Apr 21, 2020

Merge illegal KEY_ILLEGAL_CHARACTERS and disallowed KEY_CHARACTERS_CAUSING_PARSING_ERRORS into KEY_DISALLOWED_CHARACTERS = "{}(),=\\\"#%~'`ʹ";

@dextep According to https://tex.stackexchange.com/a/408548/110064:
I only know KEY_DISALLOWED_CHARACTERS = "{}(),=\\\"#%~'

  • no ` ʹ ;
  • \" and " is the same?

New KEY_UNWANTED_CHARACTERS = "- ʹ : ! ; ? ^ +"`

@dextep I assume you meat - ` ʹ : ! ; ? ^ + ( ` confuses the Syntax )


PS: unwated unicode-characters can be increased to infinity ∞ < > $ § ° . | @ * / & [ ] α ¡¢£¤¥¦¨©ª«¬­®¯°±²³´µ¶•¸¹º»¼½¾¿ ‰ ,...

PS2: Are $ and & and @ allowed characters?, or did just nobody test them?

@Siedlerchr
Copy link
Member

Siedlerchr commented Apr 21, 2020

The at symbol indicates the start of the bibtex key. Dollar sign may be okay.
Ampersand should be okay as well.
we should check the biblatex spec and biber

@dextep
Copy link
Contributor Author

dextep commented May 4, 2020

I would like to thank you very much @koppor and @tobiasdiez, it will definitely help me avoid such mistakes in the future ❗

@koppor
Copy link
Member

koppor commented May 4, 2020

@dextep I would suggest you also look into https://lostechies.com/joshuaflanagan/2010/09/03/use-gitk-to-understand-git/ and patiently read it and try to understand it paragraph-by-paragraph. After going through, you'll always have gitk --all& started and be very confident what's happening in your repository.

Fix unit tests.
@dextep dextep marked this pull request as ready for review May 5, 2020 01:50
Copy link
Member

@tobiasdiez tobiasdiez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cool. This looks very good now. Just a bit of final nitpicking from my side (mostly concerning the tests). Other than that, it's good to go from my side!

CHANGELOG.md Show resolved Hide resolved
CHANGELOG.md Show resolved Hide resolved
src/main/resources/journals/journalList.csv Outdated Show resolved Hide resolved
@tobiasdiez tobiasdiez added status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers status: changes required Pull requests that are not yet complete labels May 5, 2020
@dextep dextep requested a review from tobiasdiez May 5, 2020 19:53
Copy link
Member

@tobiasdiez tobiasdiez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice! LGTM

@tobiasdiez tobiasdiez removed the status: changes required Pull requests that are not yet complete label May 5, 2020
@dextep dextep requested a review from koppor May 5, 2020 21:29
Copy link
Member

@koppor koppor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A huge step forward. Thank you for the update.

Two nitpicks are remaining. See below.

CHANGELOG.md Show resolved Hide resolved
Comment on lines 81 to 92
List<Character> unwantedChars = new ArrayList<>();
for (char ch : unwantedCharacters.toCharArray()) {
unwantedChars.add(ch);
}

StringBuilder newKey = new StringBuilder();
for (int i = 0; i < key.length(); i++) {
char c = key.charAt(i);
if (KEY_ILLEGAL_CHARACTERS.indexOf(c) == -1) {
if (!DISALLOWED_CHARACTERS.contains(c) && !unwantedChars.contains(c)) {
newKey.append(c);
}
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please replace the code:

        String newKey = key.chars()
                .filter(c -> unwantedCharacters.indexOf(c) == -1)
                .filter(c -> !DISALLOWED_CHARACTERS.contains((char) c))
                .collect(StringBuilder::new,
                        StringBuilder::appendCodePoint, StringBuilder::append)
                .toString();

I came up while thinking as follows (and stack-overflow-ing for String filter characters, ...). Used answer: https://stackoverflow.com/a/20268845/873282

Why not a simple indexOf thing?

"abcdef".indexOf('c') != -1;

See https://stackoverflow.com/a/506107/873282.

Second reason: Conversion into the list is IMHO more CPU consuming than the indexOf call. I know, that it is not critical, but I think, the conversion (plus the readble !unwantedChars.contains(c)) is less readable then a simple (unwantedChars.indexOf(c) >= 0)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't think of it that way, thank you for your review @koppor 👍!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're welcome. Plese note that I used multiline comments:

grafik

Thus, more code can be deleted. See also my new review comment.

Sorry for being picky here. I hope with these tough comments here we won't need to touch this code the next months again 🎈

Copy link
Member

@koppor koppor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

More old code can be deleted 🎉

@@ -82,17 +82,15 @@ public static String removeUnwantedCharacters(String key, String unwantedCharact
for (char ch : unwantedCharacters.toCharArray()) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need the unwantedChars variable, do we? Can you please delete that for loop and the variable? --> the code on line 86 directly accesses the method parameter unwantedChararacters!

Side note: This is an example why similar variable names can be mixed up.

Copy link
Contributor Author

@dextep dextep May 6, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course, we don't need it, I'm sorry. Every time I get the impression that I didn't notice something. 🐞

@koppor koppor merged commit 2af41eb into JabRef:master May 6, 2020
@koppor koppor removed the status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers label May 6, 2020
@koppor
Copy link
Member

koppor commented May 6, 2020

Yeah. Side comment: It should have been "Fixes #... Fixes #... Fixes #..." not just "Fix" once. See https://help.github.com/en/enterprise/2.16/user/github/managing-your-work-on-github/closing-issues-using-keywords#closing-multiple-issues for details.

@dextep
Copy link
Contributor Author

dextep commented May 6, 2020

And anyway, thanks you guys for your patience and tips, I'm learn a lot from you! 🎉

@Siedlerchr
Copy link
Member

@dextep Thank you again for your contribution and your enduring patience! ;) Although sometimes it feels nasty, having to fix all those little code issues, but in the end it contributes to a good code standard which helps others and maybe your "future you" to more easily understand it.

Siedlerchr added a commit that referenced this pull request May 8, 2020
…ptyLib

* upstream/master: (23 commits)
  Make wrap fields also visible in entry editor (#6315)
  Add launcher to fix extension import in snap (#6439)
  add concise message when SaveException happen (#6444)
  Squashed 'src/main/resources/csl-locales/' changes from 79845b087b..cbb45961b8
  Squashed 'src/main/resources/csl-styles/' changes from 906cd6d..270cd32
  Addition to the early pull #6438 (#6441)
  Fix the bug #6421 (#6438)
  Cleanup dead code (#6436)
  Fixed entry duplication on file download (#6437)
  Add workaround for buildSrc issue (#6433)
  Remove dash from illegal characters. (#6300)
  Docs: For developers: New architectural decision added to list (#6397)
  Added a download checkbox to the import dialog (#6381)
  Bump jaxb-xjc from 2.3.2 to 2.3.3 (#6410)
  Bump flexmark-ext-gfm-strikethrough from 0.61.20 to 0.61.24 (#6413)
  Bump byte-buddy-parent from 1.10.9 to 1.10.10 (#6408)
  Bump flexmark-ext-gfm-tasklist from 0.61.20 to 0.61.24 (#6412)
  Bump org.beryx.jlink from 2.17.8 to 2.18.0 (#6411)
  Bump tika-core from 1.24 to 1.24.1 (#6409)
  Bump flexmark from 0.61.20 to 0.61.24 (#6416)
  ...
Siedlerchr added a commit that referenced this pull request May 9, 2020
* upstream/master:
  Addition to the early pull #6438 (#6441)
  Fix the bug #6421 (#6438)
  Cleanup dead code (#6436)
  Fixed entry duplication on file download (#6437)
  Add workaround for buildSrc issue (#6433)
  Remove dash from illegal characters. (#6300)
  Docs: For developers: New architectural decision added to list (#6397)
koppor pushed a commit that referenced this pull request Jan 1, 2023
43566f2 Update molecular-oncology.csl (#6354)
38f2b5f Update san-francisco-estuary-and-watershed-science.csl (#6350)
724cb12 Update australasian-journal-of-philosophy.csl (#6344)
df6af86 Fix webpage in-text citation for council-of-science-editors-author-date.csl (#6318)
6900b58 Add month to magazine for Bluebook
e0a8148 Update cambridge-university-press-author-date-cambridge-a.csl (#6345)
0823448 Add space & comma before pp (#6343)
ff38cd2 Update american-journal-of-respiratory-and-critical-care-medicine.csl (#6341)
8727dfb Update early-music-history.csl (#6323)
1154354 Update boletin-de-pediatria.csl (#6310)
f25438e Update pravnik.csl (#6309)
db7a4ae Update zoological-journal-of-the-linnean-society.csl (#6304)
6c043a7 Create polygraphia.csl (#6307)
255e00c Create intellect-newgen-books.csl (#6308)
323629f Update historical-materialism.csl (#6300)
f62b70d Create european-review-of-international-studies.csl (#6301)
dff2698 Update ucl-university-college-harvard.csl (#6298)
0ce09c9 Update sciences-po-ecole-doctorale-note-french.csl (#6290)
bdd53ec Update sciences-po-ecole-doctorale-author-date.csl (#6291)
efde4d4 Create journal-of-law-medicine-ethics.csl (#6296)
7539b2c Create theses-de-sorbonne-universite.csl (#6295)
905f25a Update biochemical-society-transactions.csl (#6292)
a76a3f5 Update smithsonian-institution-scholarly-press author-date and note (#6294)
e6b6c6c Create exploration-of-targeted-anti-tumor-therapy.csl (#6276)
024c9c8 Create nys-nydanske-studier.csl (#6331)
d9ac8e1 Update vox-sanguinis.csl (#6327)
9a98e92 Remove DOI from Genetics & Molecular Biology

git-subtree-dir: buildres/csl/csl-styles
git-subtree-split: 43566f2
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
7 participants