Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updates to institution citation keys #7210

Merged
merged 29 commits into from
Dec 28, 2020

Conversation

k3KAW8Pnf7mkmdSMPHz27
Copy link
Member

@k3KAW8Pnf7mkmdSMPHz27 k3KAW8Pnf7mkmdSMPHz27 commented Dec 20, 2020

Fixes #6942. Fixes #7199.

TL;DR

Authors only having a last name are abbreviated. #6942 is miss-abbreviated because name parts containing uni are assumed to be universities, e.g., United Airlines. #7199 is triggers the abbreviation method because, from BibTeX´s point-of-view, the author only has a last name. The name is later removed because Java does not classify the first character as an uppercase letter, hence the heuristic assumes it is an insignificant word. E.g., {eBay} gets removed, {JabRef} doesn't.

Background

Authors that are institutions can have very lengthy citation keys unless they are abbreviated in some fashion. The abbreviations are generated in

private static String generateInstitutionKey(String content) {

which is based on a heuristic that tries to determine a suitable abbreviation based on the content of the specific author.

Issue #6942 is due to a regexp being overly broad when determining if an author is a university.

Issue #7199 was created from #6706 where I changed how a company/institute is identified. In previous versions of JabRef, an author is an institute if it is enclosed in curly brackets (e.g., {JabRef}), and I changed it to also include authors' who only have a last name (e.g., JabRef), which means that the author in issue #7199 is treated as an institution.
This would not have been an issue if it wasn't because the heuristic for creating a citation key identifies words starting with a lowercase letter in the author's name as an invalid part of an institute abbreviation. Think of the "and" in "National Aeronautics and Space Administration"). Since the author's name only consist of one "word", it is removed. (this is why the workaround @tobiasdiez mentioned does not work, even if it should)

Changes for

#6942

  1. Identifying universities in the authors' names is arbitrary. The new regexp matches ^uni(v|b|$).* instead of ^uni. There are other suggestions in Readability for citation key patterns #6706 that can be implemented instead.
  2. The inline institute abbreviation should now work again (e.g., {JabRef ({JB})} should produce JB)
  3. I am ignoring the shortauthor field, because I find it unclear how and when to use the field (e.g., for multiple authors).

#7199

  1. I'll change how institutes are identified, so that the Author must only have a last name and that last name must contain a space. This should only happen if the Author originally were enclosed in brackets. In all other cases I'd argue it is acceptable to not abbreviate the author since the name is only one word.
    1. Drawbacks to this approach is that I am using the LatexToUnicodeAdapter to remove braces, and it doesn't like unbalanced ones.

Todo

  • Log a warning for unbalanced brackets in Authors
  • Change in CHANGELOG.md described (if applicable)
  • Tests created for changes (if applicable)
  • Manually tested changed features in running JabRef (always required)
  • Screenshots added in PR description (for UI changes)
  • Checked documentation: Is the information available and up to date? If not created an issue at https://github.com/JabRef/user-documentation/issues or, even better, submitted a pull request to the documentation repository.

@k3KAW8Pnf7mkmdSMPHz27 k3KAW8Pnf7mkmdSMPHz27 changed the title [WIP] Updates to institution citation keys Updates to institution citation keys Dec 21, 2020
@k3KAW8Pnf7mkmdSMPHz27 k3KAW8Pnf7mkmdSMPHz27 marked this pull request as ready for review December 21, 2020 13:25
@k3KAW8Pnf7mkmdSMPHz27
Copy link
Member Author

k3KAW8Pnf7mkmdSMPHz27 commented Dec 21, 2020

Except for logging when LatexToUnicodeAdapter fails (otherwise names such as "{Link{"{o}}ping University}}" will generate only Uni, due to the unbalanced brackets, and no warning, I don't intend any more code changes.

There will be some updates to the documentation, change log etc.

When generating a key from a university name it should contain at least
two parts, "university" and the university's name. If it does not it is
likely that the name contained latex that could not be resolved
correctly.
@tobiasdiez
Copy link
Member

I guess you can use parse directly? Is this ready for review?

@k3KAW8Pnf7mkmdSMPHz27
Copy link
Member Author

@tobiasdiez except for JavaDoc and changelog, yes, it is ready for review.

Which parse?

@k3KAW8Pnf7mkmdSMPHz27
Copy link
Member Author

k3KAW8Pnf7mkmdSMPHz27 commented Dec 21, 2020

@tobiasdiez sorry, I missed to push 1 local commit.

@k3KAW8Pnf7mkmdSMPHz27
Copy link
Member Author

k3KAW8Pnf7mkmdSMPHz27 commented Dec 21, 2020

O.o
That would indeed work. Should I put this PR back in draft then? I am not sure how to go about this one, should I create a LatexToUnicodeParseAdapter that throws an exception so that the rest of the code in LatexToUnicodeAdapter is not repeated? Edit I were thinking about the Formatter classes, which LatexToUnicodeAdapter is not.

@tobiasdiez
Copy link
Member

I have to admit I had not yet looked at your PR. It was just in response to your question. What do you want to do?

@k3KAW8Pnf7mkmdSMPHz27
Copy link
Member Author

k3KAW8Pnf7mkmdSMPHz27 commented Dec 21, 2020

@tobiasdiez good then :)

In my opinion it is better if the generation warns if there is incorrect latex. But I am not sure how to do it in practice. I don't have experience with Scala <-> Java but if I am not misreading the LaTeX2Unicode#parse I am going to have to check if the result is instanceof fastparse.Success in which case we need to add fastparse as a dependency?

@k3KAW8Pnf7mkmdSMPHz27
Copy link
Member Author

I don't know the drawbacks/advantages of that approach, but if this is the only part of the JabRef code that needs this, perhaps it is better to leave the code as-is?

@tobiasdiez
Copy link
Member

We already have farstparse as a dependency, so that's not a problem. I agree it would be nice to check for parsing errors, so I support you here (sadly only morally since I don't have any scala experience too).

@k3KAW8Pnf7mkmdSMPHz27
Copy link
Member Author

k3KAW8Pnf7mkmdSMPHz27 commented Dec 21, 2020

fastparse is an indirect dependency from latex2unicode?
The issue I am having is that latex2unicode#parse returns a fastparse.core.Parsed instance. Unless I am missing something it means I must be able to import fastparse.core which I cannot right now.
Should I add fastparse to module-info.java?

I am missing something in the JabRef building process, I'll see if I can figure out.

@k3KAW8Pnf7mkmdSMPHz27
Copy link
Member Author

k3KAW8Pnf7mkmdSMPHz27 commented Dec 22, 2020 via email

@Siedlerchr Siedlerchr added the status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers label Dec 26, 2020
@Siedlerchr
Copy link
Member

Is this ready for review? (I think it looks good)

@k3KAW8Pnf7mkmdSMPHz27
Copy link
Member Author

Is this ready for review? (I think it looks good)

Yup, it is ready.

Copy link
Member

@tobiasdiez tobiasdiez left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The change look good to me. Thanks a lot!

Could you please also add the examples from the issues as test to verify that the issue doesn't reoccur. Thanks!

@tobiasdiez tobiasdiez added the status: changes required Pull requests that are not yet complete label Dec 27, 2020
@k3KAW8Pnf7mkmdSMPHz27
Copy link
Member Author

k3KAW8Pnf7mkmdSMPHz27 commented Dec 27, 2020

@tobiasdiez good catch, apparently I did not examine my assumptions enough. My concerns regarding including it is,

1.

Based on #6706 (comment) and others,

(i) when using in the bibtex key generator, the non-ascii characters are converted to a "good" representation (as in your example: Göd -> Goe)
...
, because the bibtex tooling is not fully UTF8 (biblatex+biber is, but not the more commonly used bibtex tool).

my thoughts were that, to the largest extent possible, BibTeX citation keys should be ASCII/ANSI only, in which case #7199 should not be added as a test case. If {eBay} is successful, it is likely that #7199 will be as well (from BibTeX/Java´s point-of-view both contain only a last name that starts with a lowercase letter, and it was the institute abbreviation method that removed 杨秀群).

2.

I am not convinced that the currently generated citation key is correct, but you seem to know more about this particular issue than I, so feel free to correct me 😛
I based my decision on @xiaodongcentury's comment

"杨 秀群", but not "杨秀群", the former has a space between last name and first name of a Chinese name.

In my opinion, the "correct" citation key should be either 杨 or 秀群 (i.e., based on the documentation, the citation key shoul be the last name). With eBay I know the citation key is correct.


However, all of that being said (and my apologies about the rather lengthy reading), if this is how JabRef is used, perhaps the test case should be added anyway?

@tobiasdiez
Copy link
Member

As we have quite a lot of Chinese users, it would be good to support this as far as possible. Thus, if the user enters Chinese authors, then I would say we should generate a key with chinese characters. In the best case, one would actually replace the Chinese character by their romanization, but I'm not sure if there exist a convenient library for this.

For now, I would say we simply add 杨秀群 as a test (with whatever result it currently produces).

@k3KAW8Pnf7mkmdSMPHz27
Copy link
Member Author

For now, I would say we simply add 杨秀群 as a test (with whatever result it currently produces).

Sure!

In the best case, one would actually replace the Chinese character by their romanization, but I'm not sure if there exist a convenient library for this.

pinyin4j seem to be the most popular Java alternative. However, given that there isn't a need of high performance (clean up operation?), there are more popular javascript libraries. I don't know if there is an interest for this?

@tobiasdiez
Copy link
Member

Thanks, also for the research concerning pinyin libraries. Let's see if users report this as a feature request.

@tobiasdiez tobiasdiez merged commit 78b08b5 into JabRef:master Dec 28, 2020
@k3KAW8Pnf7mkmdSMPHz27
Copy link
Member Author

@Siedlerchr @tobiasdiez thank you both for your time, reviews and suggestions, they are appreciated!

@xiaodongcentury
Copy link

Thank you ALL!
The JabRef 5.3--2020-12-28--78b08b5 works perfectly with Chinese names.
Thanks again.

Siedlerchr added a commit that referenced this pull request Dec 29, 2020
…dtask

* upstream/master:
  Improved detection of long DOI's within text (#7260)
  Add missing author and fix name
  Fix style of highlighted checkboxes while searching in preferences (#7258)
  Updates to institution citation keys (#7210)
Siedlerchr added a commit that referenced this pull request Dec 30, 2020
* upstream/master: (201 commits)
  Only disable  move to file dir when path equals (#7269)
  Improved detection of long DOI's within text (#7260)
  Add missing author and fix name
  Fix style of highlighted checkboxes while searching in preferences (#7258)
  Updates to institution citation keys (#7210)
  Bump xmlunit-core from 2.8.1 to 2.8.2 (#7251)
  Bump classgraph from 4.8.97 to 4.8.98 (#7250)
  Bump bcprov-jdk15on from 1.67 to 1.68 (#7249)
  Bump xmlunit-matchers from 2.8.1 to 2.8.2 (#7252)
  Bump unirest-java from 3.11.06 to 3.11.09 (#7254)
  Bump org.beryx.jlink from 2.23.0 to 2.23.1 (#7253)
  Bump pascalgn/automerge-action from v0.12.0 to v0.13.0 (#7255)
  Added a check to integrate with the flatpak package (#7248)
  New translations JabRef_en.properties (Chinese Traditional) (#7247)
  Update code-howtos.md
  GitBook: [master] 5 pages and 25 assets modified
  New Crowdin updates (#7246)
  add language mapping for chinese
  remove chinese content
  fix hamcrest link
  ...

# Conflicts:
#	src/main/java/org/jabref/gui/JabRefFrame.java
#	src/main/java/org/jabref/gui/dialogs/AutosaveUiManager.java
#	src/main/java/org/jabref/gui/exporter/SaveAction.java
#	src/main/java/org/jabref/gui/exporter/SaveDatabaseAction.java
#	src/main/java/org/jabref/logic/autosaveandbackup/BackupManager.java
#	src/test/java/org/jabref/gui/exporter/SaveDatabaseActionTest.java
Siedlerchr added a commit that referenced this pull request Jan 1, 2021
* updateGradle7: (39 commits)
  try upgrading gradle
  Only disable  move to file dir when path equals (#7269)
  Improved detection of long DOI's within text (#7260)
  Add missing author and fix name
  Fix style of highlighted checkboxes while searching in preferences (#7258)
  Updates to institution citation keys (#7210)
  Bump xmlunit-core from 2.8.1 to 2.8.2 (#7251)
  Bump classgraph from 4.8.97 to 4.8.98 (#7250)
  Bump bcprov-jdk15on from 1.67 to 1.68 (#7249)
  Bump xmlunit-matchers from 2.8.1 to 2.8.2 (#7252)
  Bump unirest-java from 3.11.06 to 3.11.09 (#7254)
  Bump org.beryx.jlink from 2.23.0 to 2.23.1 (#7253)
  Bump pascalgn/automerge-action from v0.12.0 to v0.13.0 (#7255)
  Added a check to integrate with the flatpak package (#7248)
  New translations JabRef_en.properties (Chinese Traditional) (#7247)
  Update code-howtos.md
  GitBook: [master] 5 pages and 25 assets modified
  New Crowdin updates (#7246)
  add language mapping for chinese
  remove chinese content
  ...
Siedlerchr added a commit that referenced this pull request Jan 3, 2021
* upstream/master:
  fix checsktyle
  Fix for application dialogs opening in wrong displays (#7273)
  Only disable  move to file dir when path equals (#7269)
  Improved detection of long DOI's within text (#7260)
  Add missing author and fix name
  Fix style of highlighted checkboxes while searching in preferences (#7258)
  Updates to institution citation keys (#7210)
  Bump xmlunit-core from 2.8.1 to 2.8.2 (#7251)
  Bump classgraph from 4.8.97 to 4.8.98 (#7250)
  Bump bcprov-jdk15on from 1.67 to 1.68 (#7249)
  Bump xmlunit-matchers from 2.8.1 to 2.8.2 (#7252)
  Bump unirest-java from 3.11.06 to 3.11.09 (#7254)
  Bump org.beryx.jlink from 2.23.0 to 2.23.1 (#7253)
  Bump pascalgn/automerge-action from v0.12.0 to v0.13.0 (#7255)

# Conflicts:
#	src/main/java/org/jabref/gui/openoffice/OpenOfficePanel.java
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
status: changes required Pull requests that are not yet complete status: ready-for-review Pull Requests that are ready to be reviewed by the maintainers
Projects
None yet
4 participants