Add quality check and cleanup for problematic unicode characters #10506

tobiasdiez · 2023-10-16T16:46:44Z

Is your suggestion for improvement related to a problem? Please describe.

Some unicode characters make problems, even with biblatex support (eg pdflatex still not completely supporting unicode). For example, Garcı́a gives

Package inputenc Error: Unicode character ́ (U+0301)

A few of such problematic characters are:

Describe the solution you'd like

As these characters are hard to recognize, it would be nice if there would be an integrity check warning about them, and an automatic cleanup to convert them to their unproblematic equivalents (e.g. 0131 + 0301 to 00ED).

Additional context
Might be helpful: https://github.com/zepinglee/citeproc-lua/blob/ab3ce712cc92073f12be26ff0b22b30eb906092d/citeproc/citeproc-latex-data.lua#L517

The text was updated successfully, but these errors were encountered:

Siedlerchr · 2023-10-16T17:07:04Z

Have you tried converting them to latex? We have latex2unicode and vice versa conversion already

tobiasdiez · 2023-10-16T17:15:07Z

It actually came from latex code that I converted to unicode (I want all my stuff in unicode). This is also not very helpful in recognizing which entry/field has the problematic character.

koppor · 2023-10-28T23:12:37Z

The feature will be listed in the Check integrity dialog of JabRef.

The implementation will be similar to org.jabref.logic.integrity.AmpersandChecker.

yuyan-z · 2023-11-21T10:36:46Z

Hi koppor, thank you for suggesting this issue to me! I hope to take it.
I try to reproduce this problem:

create an example test.bib with a problematic unicode character

@Article{test,
  author = {Garcı́a},
  title  = {Test Article},
}

import the test.bib into the library in JabRef. There‘s no error in this step
create an example document.tex

\documentclass[12pt]{article}
{
	\begin{document}
		\begin{enumerate}
			\item Sample Citation: \cite{test}
		\end{enumerate}
		
		\bibliographystyle{apalike}
		\bibliography{test.bib}
	\end{document}
}

build document.tex

$ pdflatex document.tex
$ bibtex document
$ pdflatex document.tex
$ pdflatex document.tex

There's the error

! LaTeX Error: Unicode character ́ (U+0301)
               not set up for use with LaTeX.

I wonder if the goal is to automatically convert the problematic unicode character when importing or adding bib files in JabRef ?

koppor · 2023-11-21T12:03:19Z

Perfectly reproduced! 👍

Did you see my comment #10506 (comment)?

Click

Issue appears

Side TODO: Please let JabRef focus the tab where the issue occurs

I think, what @tobiasdiez would like to have, is some warning at a field - if the field misses an integrity check:

Note that the non-ascii check should be on only at bibtex mode, not in biblatex mode.

Note that the integrity checks should be turned on/off per library (maybe too much for this PR).

If one wants to get it compiling:

Try biber instead of bibtex. Or try bibtex8. The normal bibtex tool doesn't handle utf8 properly.

You can also try to use biber. See https://tex.stackexchange.com/a/34136/9075 for a hint.

koppor · 2023-11-21T12:07:28Z

integrity check warning about them
...
and an automatic cleanup
..
It actually came from latex code that I converted to unicode (I want all my stuff in unicode). This is also not very helpful in recognizing which entry/field has the problematic character.

@tobiasdiez

JabRef has a check for non-ASCII-characters. See my screenshot at Add quality check and cleanup for problematic unicode characters #10506 (comment). I think, this fulfills your "integrity check warning" wish. Could you retry with your JabRef
We have the unicode-to-latex conversion. We also do have automatic save. Please try to activate the converter "on save"

On save, JabRef pops up "file was modified externally". Then, you even have a character diff.

Does that work for you?

@tobiasdiez @Siedlerchr I am not sure how to guide the student. I recommended him to put the checkers into the entry editor on type. Because it did not work there. Is this OK - Or should we find yet another issue?

ThiloteE · 2023-11-21T12:13:41Z

The goal is not to automatically convert the symbols, because while unicode engines like LuaTeX and XeTeX can read the unicode characters, there are problems with older engines like pdfTeX. We can bridge the gap by detecting these characters in JabRef and hope PDFTeX will eventually catch up, or what is more likely: Users will stop using pdfTeX.

Given the choice, I would assume most people would prefer not having to convert an à in their text to

\`{a}

just because their font engine can't read it. They probably would prefer an engine that simply works without having to do magic conversions. By the way it was hard to cite this non-precomposed character in markdown xD

I am not sure why we would force users that already use more modern unicode engines to convert their precomposed unicode characters like à back into non-precomposed characters like

\`{a}

in their entries. Manual conversion is fine, but no need for automatic conversion I think, no?

PdfTeX is still maintained, but there are not a lot of updates to their repo. See here: https://tug.org/applications/pdftex/. Postscript fonts, which are natively supported by pdfTeX seem outdated and being dropped by many operating systems and applications, so at one point the reason for pdfTeX's existence will fade away and people will move to other font enginges. I think we should make it hard for users to stick to the outdated pdfTeX and incentivise users switching to unicode compatible engines.

I propose the path forward for JabRef should be as follows:

Have a (long) grace period with a warning that these characters are not supported by pdfTeX and offer converting characters to their unproblematic equivalents, but do not do so automatically, instead offer manual conversion in the cleanup dialogue. The warning should include pointing to alternative modern engines like LuaTeX or XeTeX that support unicode.
In a future version of JabRef (very far in the future), drop support for manual conversion and only offer unicode characters.

tobiasdiez · 2023-11-21T13:51:32Z

Note that the issue goes beyond the usual "bibtex is not compatible with unicode". As @ThiloteE correctly analyzed, the problem is the combination pdftex + biber (in particular the ascii check is not helpful).

The simplest solution would be indeed an automatic conversion of unicode characters to the Normal Form C, or at least combine unicode characters if they have an single-character equivalent. So à can stay the same but 0131 + 0301 is converted to 00ED (but not to its latex code). Since by definition these unicode representations are the equivalent, lualatex/xetex will display the same character - it's just to help pdftex.
Alternatively, implement it as a save-action that is on by default.

koppor · 2023-11-21T14:01:23Z

Ah, I see.

Naively, this can be achieved, by running unicode-to-latex and latex-to-unicode, because our unicode tables use the normal form c. -- However, this is much effort (See #6155)

Similar functionality as org.jabref.logic.layout.format.ReplaceUnicodeLigaturesFormatter, but for the character "compression".

@tobiasdiez Do you propose a manual table as our org.jabref.logic.util.strings.UnicodeLigaturesMap#UnicodeLigaturesMap, but for Normal Form C? -- If yes, then this is a good first issue. Otherwise, I need to take back the assingment as good-first-issue.

tobiasdiez · 2023-11-21T14:11:03Z

@tobiasdiez Do you propose a manual table as our org.jabref.logic.util.strings.UnicodeLigaturesMap#UnicodeLigaturesMap, but for Normal Form C? -- If yes, then this is a good first issue. Otherwise, I need to take back the assingment as good-first-issue.

Yes. It also doesn't have to cover all known character compressions. The ones containing some of the problematic linked in the issue description should be good enough for now.

Siedlerchr · 2023-11-21T22:45:55Z

Latex2Unicode library also uses NFC, or at least we use it

jabref/src/main/java/org/jabref/model/strings/LatexToUnicodeAdapter.java

Lines 28 to 31 in 4718930

    
           public static String format(String inField) { 
        
               Objects.requireNonNull(inField); 
        
               return parse(inField).orElse(Normalizer.normalize(inField, Normalizer.Form.NFC)); 
        
           }

jabref/src/main/java/org/jabref/model/strings/LatexToUnicodeAdapter.java

Lines 43 to 46 in 4718930

    
           if (parsingResult instanceof Parsed.Success) { 
        
               String text = parsingResult.get().value(); 
        
               toFormat = Normalizer.normalize(text, Normalizer.Form.NFC); 
        
               return Optional.of(UNDERSCORE_PLACEHOLDER_MATCHER.matcher(toFormat).replaceAll("_"));

koppor · 2023-12-04T20:42:01Z

Two things to do

New Integrity check: String result = normalize-to-fc(input); raise error if result != input
New Cleanup/FieldFormatter/ ...: result = normalize-to-fc(input);

harsh1898 · 2024-01-22T06:25:58Z

Hi @koppor
As you have mentioned, we need to do two things,so can you elaborate which result you are pointing or just elaborate your last comment

I have sucessfully reproduced the bug/issue and figure out with the help of above thread comments.

koppor · 2024-01-22T07:32:35Z

@harsh1898 Do you know Ctrl+Shift+F in IntelliJ? Here, you can search for code.

Integrity Check

The class is org.jabref.logic.integrity.IntegrityCheck. With Alt+F1 and then Enter, you can navigate to the package in the project view. Then, you find other integrity checks. I browsed around and found ValueChecker. Think, the implementation is as follows:

Implement UnicodeNormalFormCCheck in package org.jabref.logic.integrity. It implements interface ValueChecker.
- See Add quality check and cleanup for problematic unicode characters #10506 (comment) for an implementation hint.
- Use Ctrl+Shift+T to generate a skeleton of a test class. You can see other test classes outlining how to implement (e.g., org.jabref.logic.integrity.BibStringCheckerTest)
Add the UnicodeNormalFormCCheck to org.jabref.logic.integrity.FieldCheckers#getAllMap (in the biblatex mode branch).
Check if it appears in the UI and test it with the example

New cleanup action

Create a new formatter NormalizeUnicodeFormatter in org.jabref.logic.formatter.bibtexfields. Also create test cases
Add it to org.jabref.logic.formatter.Formatters#getOthers.
Check if it appears in the UI and test it with the example

github-actions · 2024-01-22T07:33:29Z

As a general advice for newcomers: check out Contributing for a start. Also, guidelines for setting up a local workspace is worth having a look at.

Feel free to ask here at GitHub, if you have any issue related questions. If you have questions about how to setup your workspace use JabRef's Gitter chat. Try to open a (draft) pull-request early on, so that people can see you are working on the issue and so that they can see the direction the pull request is heading towards. This way, you will likely receive valuable feedback.

harsh1898 · 2024-01-23T06:35:40Z

Hi @koppor
As per your suggestion, I have tried to fix this issue with some update in code repository.

You can review this #10817 to see my updates and Pull Request.

* #10506 Added new integrity and clean up for non NFC * #10506 Add key in JabRef_en.properties * #10506 Fixed checkstyle and markedown error * #10506 fixed CHANGELOG.md error * Update CHANGELOG.md * Removed whitespaces unnecessary whitespace changes * Removed unnecessary whitespace change * Update PersonNamesChecker.java * Update PersonNamesChecker.java * #10506 Remove unnecessary comment line in FieldCheckers * Fix issues * Fix CHANGELOG.md --------- Co-authored-by: Harshit.Gupta7 <[email protected]> Co-authored-by: Harshit Gupta <[email protected]> Co-authored-by: Carl Christian Snethlage <[email protected]>

tobiasdiez added type: feature cleanup-ops integrity-checker labels Oct 16, 2023

tobiasdiez added this to Candidates for University Projects, Good First Issues and Features & Enhancements Oct 16, 2023

github-project-automation bot moved this to Free to take in Good First Issues Oct 16, 2023

github-project-automation bot moved this to Free to take in Candidates for University Projects Oct 16, 2023

github-project-automation bot moved this to Normal priority in Features & Enhancements Oct 16, 2023

Siedlerchr removed the status in Good First Issues Oct 16, 2023

Siedlerchr added the needs-refinement label Oct 16, 2023

Siedlerchr removed this from Good First Issues Oct 16, 2023

tobiasdiez removed the needs-refinement label Oct 17, 2023

koppor added the good first issue An issue intended for project-newcomers. Varies in difficulty. label Oct 28, 2023

koppor added this to Good First Issues Oct 28, 2023

github-project-automation bot moved this to Free to take in Good First Issues Oct 28, 2023

koppor mentioned this issue Nov 20, 2023

Entry editor should focus entry selected in the main table JabRef/jabref-koppor#541

Open

ThiloteE assigned yuyan-z Nov 21, 2023

ThiloteE moved this from Free to take to Reserved in Good First Issues Nov 21, 2023

ThiloteE moved this from Free to take to Reserved in Candidates for University Projects Nov 21, 2023

yuyan-z mentioned this issue Nov 21, 2023

Handling non-ascii characters in entry editor, related to issue#10506 #10662

Closed

6 tasks

yuyan-z removed their assignment Jan 10, 2024

koppor assigned harsh1898 Jan 22, 2024

koppor added the FirstTimeCodeContribution Triggers GitHub Greeter Workflow label Jan 22, 2024

harsh1898 pushed a commit to harsh1898/jabref that referenced this issue Jan 23, 2024

JabRef#10506 Added new integrity and clean up for non NFC

8aeace2

harsh1898 mentioned this issue Jan 23, 2024

Fix issue "Add quality check and cleanup for problematic unicode characters" #10817

Closed

6 tasks

harsh1898 pushed a commit to harsh1898/jabref that referenced this issue Jan 23, 2024

JabRef#10506 Add key in JabRef_en.properties

01496e2

harsh1898 pushed a commit to harsh1898/jabref that referenced this issue Jan 23, 2024

JabRef#10506 Fixed checkstyle and markedown error

5a90ec8

harsh1898 pushed a commit to harsh1898/jabref that referenced this issue Jan 23, 2024

JabRef#10506 fixed CHANGELOG.md error

0c5acfd

harsh1898 pushed a commit to harsh1898/jabref that referenced this issue Jan 24, 2024

JabRef#10506 Remove unnecessary comment line in FieldCheckers

ab88b55

ThiloteE moved this from Reserved to In Progress in Candidates for University Projects Feb 28, 2024

Siedlerchr mentioned this issue Mar 19, 2024

Add new integrity and clean up for non NFC #11056

Merged

6 tasks

koppor closed this as completed Mar 25, 2024

github-project-automation bot moved this from In Progress to Done in Candidates for University Projects Mar 25, 2024

github-project-automation bot moved this from Normal priority to Done in Features & Enhancements Mar 25, 2024

github-project-automation bot moved this from Reserved to Done in Good First Issues Mar 25, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add quality check and cleanup for problematic unicode characters #10506

Add quality check and cleanup for problematic unicode characters #10506

tobiasdiez commented Oct 16, 2023 •

edited

Loading

Siedlerchr commented Oct 16, 2023

tobiasdiez commented Oct 16, 2023

koppor commented Oct 28, 2023

yuyan-z commented Nov 21, 2023

koppor commented Nov 21, 2023

koppor commented Nov 21, 2023

ThiloteE commented Nov 21, 2023

tobiasdiez commented Nov 21, 2023 •

edited

Loading

koppor commented Nov 21, 2023

tobiasdiez commented Nov 21, 2023

Siedlerchr commented Nov 21, 2023 •

edited

Loading

koppor commented Dec 4, 2023

harsh1898 commented Jan 22, 2024 •

edited

Loading

koppor commented Jan 22, 2024

github-actions bot commented Jan 22, 2024

harsh1898 commented Jan 23, 2024

Add quality check and cleanup for problematic unicode characters #10506

Add quality check and cleanup for problematic unicode characters #10506

Comments

tobiasdiez commented Oct 16, 2023 • edited Loading

Siedlerchr commented Oct 16, 2023

tobiasdiez commented Oct 16, 2023

koppor commented Oct 28, 2023

yuyan-z commented Nov 21, 2023

koppor commented Nov 21, 2023

koppor commented Nov 21, 2023

ThiloteE commented Nov 21, 2023

tobiasdiez commented Nov 21, 2023 • edited Loading

koppor commented Nov 21, 2023

tobiasdiez commented Nov 21, 2023

Siedlerchr commented Nov 21, 2023 • edited Loading

koppor commented Dec 4, 2023

harsh1898 commented Jan 22, 2024 • edited Loading

koppor commented Jan 22, 2024

Integrity Check

New cleanup action

github-actions bot commented Jan 22, 2024

harsh1898 commented Jan 23, 2024

tobiasdiez commented Oct 16, 2023 •

edited

Loading

tobiasdiez commented Nov 21, 2023 •

edited

Loading

Siedlerchr commented Nov 21, 2023 •

edited

Loading

harsh1898 commented Jan 22, 2024 •

edited

Loading