Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Valid utf-8 title flagged as non utf-8 by check integrity #8022

Closed
2 tasks done
crystalfp opened this issue Aug 23, 2021 · 11 comments · Fixed by #8359
Closed
2 tasks done

Valid utf-8 title flagged as non utf-8 by check integrity #8022

crystalfp opened this issue Aug 23, 2021 · 11 comments · Fixed by #8359

Comments

@crystalfp
Copy link

JabRef version

Latest development branch build (please note build date below)

Operating system

Windows

Details on version and operating system

JabRef 5.4--2021-08-21--644e48d Windows 10 10.0 amd64 Java 16.0.2 JavaFX 16+8

Checked with the latest development build

  • I made a backup of my libraries before testing the latest development version.
  • I have tested the latest development version and the problem persists

Steps to reproduce the behaviour

  1. Load the attached database (two entries) as biblatex after verifying that is indeed utf-8
  2. Run Quality -> check integrity
  3. Atmatzidou2016 entry: "title" flagged as "not utf-8 field found" (but the quote after students is utf-8)
  4. RosaZipitria2018: entries "booktitle" and "author" flagged as "not utf-8 field found" (but the quote after CSERC and the í in Zipitría are utf-8)

To be honest, I tried only in the development version.

@Article{Atmatzidou2016,
  author       = {Soumela Atmatzidou and Stavros Demetriadis},
  date         = {2016-01},
  journaltitle = {Robotics and Autonomous Systems},
  title        = {Advancing students’ computational thinking skills through educational robotics: A study on age and gender relevant differences},
  doi          = {10.1016/j.robot.2015.10.008},
  pages        = {661--670},
  volume       = {75},
  langid       = {english},
  publisher    = {Elsevier {BV}},
}

@InProceedings{RosaZipitria2018,
  author    = {Sylvia {da Rosa Zipitría}},
  booktitle = {Proceedings of the 7th Computer Science Education Research Conference ({CSERC}’18)},
  date      = {2018-10},
  title     = {{Piaget and Computational Thinking}},
  doi       = {10.1145/3289406.3289412},
  location  = {Saint-Petersburg, Russia},
  publisher = {{ACM} Press},
  langid    = {english},
}

Appendix

...

Log File ``` Paste an excerpt of your log file here ```
@Siedlerchr
Copy link
Member

Hm, I cannot reproduce the issue I only get warnings in bibtex mode
Have you checked the Library -> Library properties are indeed utf8 and biblatex?

@Siedlerchr Siedlerchr added the status: waiting-for-feedback The submitter or other users need to provide more information about the issue label Sep 9, 2021
@crystalfp
Copy link
Author

Yes indeed, the library is utf-8 biblatex. Here is the library that gives the problem:
UTF8error.txt

Rename it and load in:
JabRef 5.4--2021-08-23--cef4151
Windows 10 10.0 amd64
Java 16.0.2
JavaFX 16+8

Thanks for looking!

@Siedlerchr
Copy link
Member

This is odd: looks fine:
grafik

@crystalfp
Copy link
Author

Really don't know what to think. If the Jabref version is the same and the library is the same Could be something related to Java or the OS. Have you tried on Window? Java 32 or 64 bits? When I will be in the office, I will try an installation on a Linux VM to see if this makes a difference. Or maybe there is a way to output some debug information to help diagnose where the misclassification happens.

@ThiloteE
Copy link
Member

I can confirm this for
the current release
JabRef 5.3--2021-07-05--50c96a2
Windows 10 10.0 amd64
Java 16.0.1
JavaFX 16+8

and for the development version
JabRef 5.4--2021-09-10--6c145fc
Windows 10 10.0 amd64
Java 16.0.2
JavaFX 17+18

image

I tried to use Quality>Cleanup entries (Alt+F8) and then the field formaters All-text-fields with Latex to unicode as well as html to unicode, but that did not help either.

@Siedlerchr
Copy link
Member

Thanks for the insight. I tested on mac, so it may be somehow a Windows problem then

@Siedlerchr Siedlerchr added integrity-checker os: windows and removed status: waiting-for-feedback The submitter or other users need to provide more information about the issue labels Sep 10, 2021
@crystalfp
Copy link
Author

Yes, seems it is Windows specific.

@k3KAW8Pnf7mkmdSMPHz27
Copy link
Member

On Windows,

Charset charset = Charset.forName(System.getProperty("file.encoding"));
reports windows-1252

@Siedlerchr
Copy link
Member

Thanks for the investigation. Seems to be not so easy to find a correct solution, but I guess this one seems more suited https://stackoverflow.com/a/59167280/3450689

@k3KAW8Pnf7mkmdSMPHz27
Copy link
Member

k3KAW8Pnf7mkmdSMPHz27 commented Dec 23, 2021

I am struggling a bit with the Gradle setup on Windows so I can't really run tests/debug at the moment. But, if we assume

CharsetDecoder decoder = StandardCharsets.UTF_8.newDecoder();
try {
decoder.decode(ByteBuffer.wrap(data));
} catch (CharacterCodingException ex) {
return false;
}
return true;

works, we can use bibDatabaseContext.getMetaData().getEncoding() instead of Charset charset = Charset.forName(System.getProperty("file.encoding"));.

We can also add those extra configs.

@crystalfp
Copy link
Author

I confirm that it is solved in
JabRef 5.5--2021-12-27--32223a4
Windows 10 10.0 amd64
Java 16.0.2
JavaFX 17.0.1+1
Thanks a lot!

@koppor koppor moved this to Done in Prioritization Nov 10, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

4 participants