Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add quality check and cleanup for problematic unicode characters #10506

Closed
tobiasdiez opened this issue Oct 16, 2023 · 16 comments
Closed

Add quality check and cleanup for problematic unicode characters #10506

tobiasdiez opened this issue Oct 16, 2023 · 16 comments
Assignees
Labels
cleanup-ops FirstTimeCodeContribution Triggers GitHub Greeter Workflow good first issue An issue intended for project-newcomers. Varies in difficulty. integrity-checker type: feature

Comments

@tobiasdiez
Copy link
Member

tobiasdiez commented Oct 16, 2023

Is your suggestion for improvement related to a problem? Please describe.

Some unicode characters make problems, even with biblatex support (eg pdflatex still not completely supporting unicode). For example, Garcı́a gives

Package inputenc Error: Unicode character ́ (U+0301)

A few of such problematic characters are:

Describe the solution you'd like

As these characters are hard to recognize, it would be nice if there would be an integrity check warning about them, and an automatic cleanup to convert them to their unproblematic equivalents (e.g. 0131 + 0301 to 00ED).

Additional context
Might be helpful: https://github.com/zepinglee/citeproc-lua/blob/ab3ce712cc92073f12be26ff0b22b30eb906092d/citeproc/citeproc-latex-data.lua#L517

@Siedlerchr
Copy link
Member

Have you tried converting them to latex? We have latex2unicode and vice versa conversion already

@tobiasdiez
Copy link
Member Author

It actually came from latex code that I converted to unicode (I want all my stuff in unicode). This is also not very helpful in recognizing which entry/field has the problematic character.

@koppor
Copy link
Member

koppor commented Oct 28, 2023

The feature will be listed in the Check integrity dialog of JabRef.

The implementation will be similar to org.jabref.logic.integrity.AmpersandChecker.

@koppor koppor added the good first issue An issue intended for project-newcomers. Varies in difficulty. label Oct 28, 2023
@github-project-automation github-project-automation bot moved this to Free to take in Good First Issues Oct 28, 2023
@yuyan-z
Copy link
Contributor

yuyan-z commented Nov 21, 2023

Hi koppor, thank you for suggesting this issue to me! I hope to take it.
I try to reproduce this problem:

  1. create an example test.bib with a problematic unicode character
@Article{test,
  author = {Garcı́a},
  title  = {Test Article},
}
  1. import the test.bib into the library in JabRef. There‘s no error in this step
  2. create an example document.tex
\documentclass[12pt]{article}
{
	\begin{document}
		\begin{enumerate}
			\item Sample Citation: \cite{test}
		\end{enumerate}
		
		\bibliographystyle{apalike}
		\bibliography{test.bib}
	\end{document}
}
  1. build document.tex
$ pdflatex document.tex
$ bibtex document
$ pdflatex document.tex
$ pdflatex document.tex

There's the error

! LaTeX Error: Unicode character ́ (U+0301)
               not set up for use with LaTeX.

I wonder if the goal is to automatically convert the problematic unicode character when importing or adding bib files in JabRef ?

@ThiloteE ThiloteE moved this from Free to take to Reserved in Good First Issues Nov 21, 2023
@ThiloteE ThiloteE moved this from Free to take to Reserved in Candidates for University Projects Nov 21, 2023
@koppor
Copy link
Member

koppor commented Nov 21, 2023

Perfectly reproduced! 👍

Did you see my comment #10506 (comment)?


  1. Click

image


  1. Issue appears

image

  • Side TODO: Please let JabRef focus the tab where the issue occurs

image


I think, what @tobiasdiez would like to have, is some warning at a field - if the field misses an integrity check:

image

Note that the non-ascii check should be on only at bibtex mode, not in biblatex mode.

Note that the integrity checks should be turned on/off per library (maybe too much for this PR).


If one wants to get it compiling:

Try biber instead of bibtex. Or try bibtex8. The normal bibtex tool doesn't handle utf8 properly.

You can also try to use biber. See https://tex.stackexchange.com/a/34136/9075 for a hint.

@koppor
Copy link
Member

koppor commented Nov 21, 2023

integrity check warning about them
...
and an automatic cleanup
..
It actually came from latex code that I converted to unicode (I want all my stuff in unicode). This is also not very helpful in recognizing which entry/field has the problematic character.

@tobiasdiez

  1. JabRef has a check for non-ASCII-characters. See my screenshot at Add quality check and cleanup for problematic unicode characters #10506 (comment). I think, this fulfills your "integrity check warning" wish. Could you retry with your JabRef
  2. We have the unicode-to-latex conversion. We also do have automatic save. Please try to activate the converter "on save"
    image

On save, JabRef pops up "file was modified externally". Then, you even have a character diff.

Does that work for you?


@tobiasdiez @Siedlerchr I am not sure how to guide the student. I recommended him to put the checkers into the entry editor on type. Because it did not work there. Is this OK - Or should we find yet another issue?

@ThiloteE
Copy link
Member

The goal is not to automatically convert the symbols, because while unicode engines like LuaTeX and XeTeX can read the unicode characters, there are problems with older engines like pdfTeX. We can bridge the gap by detecting these characters in JabRef and hope PDFTeX will eventually catch up, or what is more likely: Users will stop using pdfTeX.

Given the choice, I would assume most people would prefer not having to convert an in their text to

\`{a}

just because their font engine can't read it. They probably would prefer an engine that simply works without having to do magic conversions. By the way it was hard to cite this non-precomposed character in markdown xD

I am not sure why we would force users that already use more modern unicode engines to convert their precomposed unicode characters like back into non-precomposed characters like

\`{a}

in their entries. Manual conversion is fine, but no need for automatic conversion I think, no?

PdfTeX is still maintained, but there are not a lot of updates to their repo. See here: https://tug.org/applications/pdftex/. Postscript fonts, which are natively supported by pdfTeX seem outdated and being dropped by many operating systems and applications, so at one point the reason for pdfTeX's existence will fade away and people will move to other font enginges. I think we should make it hard for users to stick to the outdated pdfTeX and incentivise users switching to unicode compatible engines.

I propose the path forward for JabRef should be as follows:

  1. Have a (long) grace period with a warning that these characters are not supported by pdfTeX and offer converting characters to their unproblematic equivalents, but do not do so automatically, instead offer manual conversion in the cleanup dialogue. The warning should include pointing to alternative modern engines like LuaTeX or XeTeX that support unicode.
  2. In a future version of JabRef (very far in the future), drop support for manual conversion and only offer unicode characters.

@tobiasdiez
Copy link
Member Author

tobiasdiez commented Nov 21, 2023

Note that the issue goes beyond the usual "bibtex is not compatible with unicode". As @ThiloteE correctly analyzed, the problem is the combination pdftex + biber (in particular the ascii check is not helpful).

The simplest solution would be indeed an automatic conversion of unicode characters to the Normal Form C, or at least combine unicode characters if they have an single-character equivalent. So can stay the same but 0131 + 0301 is converted to 00ED (but not to its latex code). Since by definition these unicode representations are the equivalent, lualatex/xetex will display the same character - it's just to help pdftex.
Alternatively, implement it as a save-action that is on by default.

@koppor
Copy link
Member

koppor commented Nov 21, 2023

Ah, I see.

Naively, this can be achieved, by running unicode-to-latex and latex-to-unicode, because our unicode tables use the normal form c. -- However, this is much effort (See #6155)

Similar functionality as org.jabref.logic.layout.format.ReplaceUnicodeLigaturesFormatter, but for the character "compression".

image


@tobiasdiez Do you propose a manual table as our org.jabref.logic.util.strings.UnicodeLigaturesMap#UnicodeLigaturesMap, but for Normal Form C? -- If yes, then this is a good first issue. Otherwise, I need to take back the assingment as good-first-issue.

@tobiasdiez
Copy link
Member Author

@tobiasdiez Do you propose a manual table as our org.jabref.logic.util.strings.UnicodeLigaturesMap#UnicodeLigaturesMap, but for Normal Form C? -- If yes, then this is a good first issue. Otherwise, I need to take back the assingment as good-first-issue.

Yes. It also doesn't have to cover all known character compressions. The ones containing some of the problematic linked in the issue description should be good enough for now.

@Siedlerchr
Copy link
Member

Siedlerchr commented Nov 21, 2023

Latex2Unicode library also uses NFC, or at least we use it

public static String format(String inField) {
Objects.requireNonNull(inField);
return parse(inField).orElse(Normalizer.normalize(inField, Normalizer.Form.NFC));
}

if (parsingResult instanceof Parsed.Success) {
String text = parsingResult.get().value();
toFormat = Normalizer.normalize(text, Normalizer.Form.NFC);
return Optional.of(UNDERSCORE_PLACEHOLDER_MATCHER.matcher(toFormat).replaceAll("_"));

@koppor
Copy link
Member

koppor commented Dec 4, 2023

Two things to do

  1. New Integrity check: String result = normalize-to-fc(input); raise error if result != input
  2. New Cleanup/FieldFormatter/ ...: result = normalize-to-fc(input);

@yuyan-z yuyan-z removed their assignment Jan 10, 2024
@harsh1898
Copy link
Contributor

harsh1898 commented Jan 22, 2024

Hi @koppor
As you have mentioned, we need to do two things,so can you elaborate which result you are pointing or just elaborate your last comment

I have sucessfully reproduced the bug/issue and figure out with the help of above thread comments.

@koppor
Copy link
Member

koppor commented Jan 22, 2024

@harsh1898 Do you know Ctrl+Shift+F in IntelliJ? Here, you can search for code.

Integrity Check

The class is org.jabref.logic.integrity.IntegrityCheck. With Alt+F1 and then Enter, you can navigate to the package in the project view. Then, you find other integrity checks. I browsed around and found ValueChecker. Think, the implementation is as follows:

  1. Implement UnicodeNormalFormCCheck in package org.jabref.logic.integrity. It implements interface ValueChecker.
  2. Add the UnicodeNormalFormCCheck to org.jabref.logic.integrity.FieldCheckers#getAllMap (in the biblatex mode branch).
  3. Check if it appears in the UI and test it with the example

New cleanup action

  1. Create a new formatter NormalizeUnicodeFormatter in org.jabref.logic.formatter.bibtexfields. Also create test cases
  2. Add it to org.jabref.logic.formatter.Formatters#getOthers.
  3. Check if it appears in the UI and test it with the example

@koppor koppor added the FirstTimeCodeContribution Triggers GitHub Greeter Workflow label Jan 22, 2024
Copy link
Contributor

As a general advice for newcomers: check out Contributing for a start. Also, guidelines for setting up a local workspace is worth having a look at.

Feel free to ask here at GitHub, if you have any issue related questions. If you have questions about how to setup your workspace use JabRef's Gitter chat. Try to open a (draft) pull-request early on, so that people can see you are working on the issue and so that they can see the direction the pull request is heading towards. This way, you will likely receive valuable feedback.

@harsh1898
Copy link
Contributor

Hi @koppor
As per your suggestion, I have tried to fix this issue with some update in code repository.

You can review this #10817 to see my updates and Pull Request.

harsh1898 pushed a commit to harsh1898/jabref that referenced this issue Jan 23, 2024
harsh1898 pushed a commit to harsh1898/jabref that referenced this issue Jan 23, 2024
harsh1898 pushed a commit to harsh1898/jabref that referenced this issue Jan 23, 2024
harsh1898 pushed a commit to harsh1898/jabref that referenced this issue Jan 24, 2024
@ThiloteE ThiloteE moved this from Reserved to In Progress in Candidates for University Projects Feb 28, 2024
github-merge-queue bot pushed a commit that referenced this issue Mar 19, 2024
* #10506 Added new integrity and clean up for non NFC

* #10506 Add key in JabRef_en.properties

* #10506 Fixed checkstyle and  markedown error

* #10506 fixed CHANGELOG.md error

* Update CHANGELOG.md

* Removed whitespaces unnecessary whitespace changes

* Removed unnecessary whitespace change

* Update PersonNamesChecker.java

* Update PersonNamesChecker.java

* #10506 Remove unnecessary comment line in FieldCheckers

* Fix issues

* Fix CHANGELOG.md

---------

Co-authored-by: Harshit.Gupta7 <[email protected]>
Co-authored-by: Harshit Gupta <[email protected]>
Co-authored-by: Carl Christian Snethlage <[email protected]>
@koppor koppor closed this as completed Mar 25, 2024
@github-project-automation github-project-automation bot moved this from Normal priority to Done in Features & Enhancements Mar 25, 2024
@github-project-automation github-project-automation bot moved this from Reserved to Done in Good First Issues Mar 25, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cleanup-ops FirstTimeCodeContribution Triggers GitHub Greeter Workflow good first issue An issue intended for project-newcomers. Varies in difficulty. integrity-checker type: feature
Projects
Archived in project
Archived in project
6 participants