-
-
Notifications
You must be signed in to change notification settings - Fork 2.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add quality check and cleanup for problematic unicode characters #10506
Comments
Have you tried converting them to latex? We have latex2unicode and vice versa conversion already |
It actually came from latex code that I converted to unicode (I want all my stuff in unicode). This is also not very helpful in recognizing which entry/field has the problematic character. |
The feature will be listed in the Check integrity dialog of JabRef. The implementation will be similar to |
Hi koppor, thank you for suggesting this issue to me! I hope to take it.
There's the error
I wonder if the goal is to automatically convert the problematic unicode character when importing or adding bib files in JabRef ? |
Perfectly reproduced! 👍 Did you see my comment #10506 (comment)?
I think, what @tobiasdiez would like to have, is some warning at a field - if the field misses an integrity check: Note that the non-ascii check should be on only at bibtex mode, not in biblatex mode. Note that the integrity checks should be turned on/off per library (maybe too much for this PR). If one wants to get it compiling: Try You can also try to use biber. See https://tex.stackexchange.com/a/34136/9075 for a hint. |
On save, JabRef pops up "file was modified externally". Then, you even have a character diff. Does that work for you? @tobiasdiez @Siedlerchr I am not sure how to guide the student. I recommended him to put the checkers into the entry editor on type. Because it did not work there. Is this OK - Or should we find yet another issue? |
The goal is not to automatically convert the symbols, because while unicode engines like LuaTeX and XeTeX can read the unicode characters, there are problems with older engines like pdfTeX. We can bridge the gap by detecting these characters in JabRef and hope PDFTeX will eventually catch up, or what is more likely: Users will stop using pdfTeX. Given the choice, I would assume most people would prefer not having to convert an
just because their font engine can't read it. They probably would prefer an engine that simply works without having to do magic conversions. By the way it was hard to cite this non-precomposed character in markdown xD I am not sure why we would force users that already use more modern unicode engines to convert their precomposed unicode characters like
in their entries. Manual conversion is fine, but no need for automatic conversion I think, no? PdfTeX is still maintained, but there are not a lot of updates to their repo. See here: https://tug.org/applications/pdftex/. Postscript fonts, which are natively supported by pdfTeX seem outdated and being dropped by many operating systems and applications, so at one point the reason for pdfTeX's existence will fade away and people will move to other font enginges. I think we should make it hard for users to stick to the outdated pdfTeX and incentivise users switching to unicode compatible engines. I propose the path forward for JabRef should be as follows:
|
Note that the issue goes beyond the usual "bibtex is not compatible with unicode". As @ThiloteE correctly analyzed, the problem is the combination pdftex + biber (in particular the ascii check is not helpful). The simplest solution would be indeed an automatic conversion of unicode characters to the Normal Form C, or at least combine unicode characters if they have an single-character equivalent. So |
Ah, I see. Naively, this can be achieved, by running unicode-to-latex and latex-to-unicode, because our unicode tables use the normal form c. -- However, this is much effort (See #6155) Similar functionality as org.jabref.logic.layout.format.ReplaceUnicodeLigaturesFormatter, but for the character "compression". @tobiasdiez Do you propose a manual table as our |
Yes. It also doesn't have to cover all known character compressions. The ones containing some of the problematic linked in the issue description should be good enough for now. |
Latex2Unicode library also uses NFC, or at least we use it
|
Two things to do
|
Hi @koppor I have sucessfully reproduced the bug/issue and figure out with the help of above thread comments. |
@harsh1898 Do you know Ctrl+Shift+F in IntelliJ? Here, you can search for code. Integrity CheckThe class is
New cleanup action
|
As a general advice for newcomers: check out Contributing for a start. Also, guidelines for setting up a local workspace is worth having a look at. Feel free to ask here at GitHub, if you have any issue related questions. If you have questions about how to setup your workspace use JabRef's Gitter chat. Try to open a (draft) pull-request early on, so that people can see you are working on the issue and so that they can see the direction the pull request is heading towards. This way, you will likely receive valuable feedback. |
* #10506 Added new integrity and clean up for non NFC * #10506 Add key in JabRef_en.properties * #10506 Fixed checkstyle and markedown error * #10506 fixed CHANGELOG.md error * Update CHANGELOG.md * Removed whitespaces unnecessary whitespace changes * Removed unnecessary whitespace change * Update PersonNamesChecker.java * Update PersonNamesChecker.java * #10506 Remove unnecessary comment line in FieldCheckers * Fix issues * Fix CHANGELOG.md --------- Co-authored-by: Harshit.Gupta7 <[email protected]> Co-authored-by: Harshit Gupta <[email protected]> Co-authored-by: Carl Christian Snethlage <[email protected]>
Is your suggestion for improvement related to a problem? Please describe.
Some unicode characters make problems, even with biblatex support (eg pdflatex still not completely supporting unicode). For example,
Garcı́a
givesA few of such problematic characters are:
Describe the solution you'd like
As these characters are hard to recognize, it would be nice if there would be an integrity check warning about them, and an automatic cleanup to convert them to their unproblematic equivalents (e.g. 0131 + 0301 to 00ED).
Additional context
Might be helpful: https://github.com/zepinglee/citeproc-lua/blob/ab3ce712cc92073f12be26ff0b22b30eb906092d/citeproc/citeproc-latex-data.lua#L517
The text was updated successfully, but these errors were encountered: