-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
New quality check and cleanup for & #585
New quality check and cleanup for & #585
Comments
Hi, seems like this is still free to take. We would like to take this as a first issue and post a PR soon :) |
Thank you for choosing JabRef! Looking forward to your PR! - Would it be possible to tell us which course recommended JabRef? 🤩 - Here or via email (which you find on my profile) As a general advice for newcomers: check out Contributing for a start. Also, guidelines for setting up a local workspace is worth having a look at. Feel free to ask here at GitHub, if you have any issue related questions. If you have questions about how to setup your workspace use JabRef's Gitter chat. Try to open a (draft) pull-request early on, so that people can see you are working on the issue and so that they can see the direction the pull request is heading towards. This way, you will likely receive valuable feedback. |
Quality check is now implemented: JabRef#9758 |
This is a part of JabRef#8712. When working on this, maybe completely fix JabRef#8712? |
Hello, looking into this, I just realized that the cleanup action for unescaped ampersands is already implemented in jabref. So I am definitely up for a new task. :) As far as I understand we need to add missing integrity checks for %, _, $ to warn the user if they are not escaped. We could easily do that extracting the ampersand check as a template. We additionally need to decide whether we alter the integrity check for # to be usable with the template. As of right now it functions differently from the & check "firing" only if there is an uneven amount of unescaped #. I guess when implementing it, it was assumed that the user rather wants the formatting functionality than the sign. Further, we need cleanup actions for % and # (whereas the # cleanup action should only be implemented if the integrity check follows the same pattern as the ampersand integrity check does: assuming the user wants plain signs over formatter functionalities). And as far as the integrity check for table 2 symbols goes. Do we check the latex commands or input unicode chars? I mean: Do we warn the user if he uses a symbol that's correctly displayed in jabref but not in latex (when inputting unicode chars), such as '○' or vice versa? Or do we check the latex commands for spelling mistakes? (I think I did not fully understand this part) |
Be careful with the "#" sign. The # is used for BibTex-Strings! https://docs.jabref.org/advanced/strings |
While you are talking about, it is more complicated:
Example: Now, the thing with Maybe, this detailed parsing is too much. Idea: A LaTeX parser should be used to check if the title parses correctly. SnuggleTeX (Example: https://github.com/davemckain/snuggletex/blob/development_1_2_x/snuggletex-core/src/main/java/uk/ac/ed/ph/snuggletex/samples/MinimalExample.java) seems to be good. A fork of it can be retrieved from maven central: https://github.com/rototor/snuggletex. Note: Playing around with that will take some time... In the long run, this could replace our latex2unicode converter. -- In the short run, it should only be checked if the content can be parsed using LaTeX.
This is a hard one. As Christoph stated:
Cleanup for No straight-forward cleanup for
I would leave that part as is. No checking whatever. I could neverthless try to understand it: Take an arbitrary command of the table. If it renders in JabRef correctly: No worries. If it does not render correctly: Add it to a check. -- However, this heavily depends on the latex2unicode converter. Instead of checking for things it cannot handle, the converter should be made better (or replaced by SnuggleTeX as outlined above). -- In other words: The user should not adapt their bibtex library because JabRef cannot handle it. |
Thank you for the detailed response. I will look into it and try out snuggletex this weekend.
In case we decide against snuggletex and need to check it manually, we can surely do that. But why check for spaces at the beginning and end? As far as I know they will not effect the functionality and the user may just leave them out. In any case, we can use the BibStringChecker ( |
I am sorry, initially I missunderstood the task: I thought the task was simply to warn the users whenever they use latex special characters, but now I realized that this would be very restricting and that the real task is to check whether it's valid latex to guarantee a pain free workflow for the user when integrating the bib files into his own documents. |
It's a mix of both. But I would start with the low hanging fruits:
For the moment I would not consider # and the more complex cases. That should be fine. |
Okay, but I would like to use a proper parser like snuggletex "from the get go", since the code I write without using it would need to be replaced eventually anyways. Further I just now had some time to try it out and really like it! :-) I have already tried some integration testing with snuggletex in jabref - it does not seem to immediately break any Tests: Using snuggletex it could be as simple as this (basicly takes care of all integrity checks):
This is jsut a mock and has to be improved but just for fun I edited one of the IntegrityChecks to use Snuggletex, seems promising: We'd need to copy and slightly adjust the libraries properties file into our own for custom translation, but that's no big deal. And Since If you want to try how snuggletex integrates in jabref yourself, you can add these lines to the
and this line to
If you @koppor , @Siedlerchr and the other maintainers allow it, I'd like to integrate it. That way, the integrity checks are easily implemented, and we do not rely on handwritten comparisons or checking edge cases, but instead use a proper parser. The only thing left to do would be cleanup actions. PS: I have not checked if all functions of it work, but tokenization and parsing seem fine. |
Thanks for the investigation. We can try, but I think we run into problems with jlink + jpackage and the method too large error. You can try this locally, run a gradlew jpackage and see if it works. https://bugs.openjdk.org/browse/JDK-8240567 If this doesn't work we need your "manual" fix.
|
The error messages are really great! Looking forward to see it integrated. Regarding the jlink issue, we need to spend energy again to get openjdk/jdk#10704 done. |
It did indeed fail, unfortunately. :( I see you are working on this fix since quite a while, just out of curiosity when do you hope to have this resolved? Regarding the snuggletex implementation, unfortunately it can not be used to handle all our use cases (for example it can not check for # out of the box) - but we have access to the parsed tokens which means we can implement our own functionality on top of that. (This also raises the question, whether we should consider rewriting the cleanup actions based on that.)
I am currently working on getting it cleaned up (as of now it's barely a hack).
Just started doing checks for _, %, $ and cleanup for $, I am aiming to get this ready over the upcoming holidays (~2 weeks) |
No worries regarding method too large. More motivation to get it fixed. Regarding #, maybe it's a SnuggleTex limitation? Is it able to parse |
No matter the constellation, if there is at least one # in the text, Snuggletex does treat it as error TTEG04, with this message: "Argument placeholder tokens (e.g. # 1) may only appear in command and environment definitions". It can parse it, but it treats # differently than we do. We can easily work around this since we have access to the tokens. For starters, since the # integrity check is already implemented, we may as well just exclude TTEG04. |
Excluding TTEG04 is OK for me. Can you file an issue at https://github.com/davemckain/snuggletex/issues? |
I've looked into the library, there is no bibtex reference anywhere. It does not seem like it was ever ment to support including / validating bibtex files. It's build just for plain latex and hence only allows # in "inside command/environment definitions" LaTeXTokeniser.java. Their website states: "SnuggleTeX’s parser pretends that TeX never happened and may behave slightly differently to what experienced LaTeX users might expect." And since this is a bibtex and no latex feature, I do not think this change should be made to the source, nor that the maintainer would do it. Should I still file the issue? |
Just created a PR with what is ready so far. It is by no means perfect and does break some unit tests (because of localization). But its functional. The error messages are ported from the original snuggletex repo, we may need to rewrite them and can probably exclude some. |
Ah, thank you, I was too quick in writing my thought. THank you for digging deeper! You must not pass the field value directly to SnuggleTeX. The handling of See test cases for examples
Seeing the code of org.jabref.logic.bibtex.FieldWriter#formatAndResolveStrings, you should just ignore checking if the field value contains Future work can provide a more proper data model for a BibTeX field |
"method too large" problem is fixed in the current development version. |
Thats good news! The requested test cases in the koppor repo for the snuggletex integration where also added. So I guess we are good to go? |
But we need to disable it for file field |
Addition to JabRef#8948
&
-- current user documentation: https://docs.jabref.org/finding-sorting-and-cleaning-entries/checkintegrity&
(when unescaped) -- current user documentation: https://docs.jabref.org/finding-sorting-and-cleaning-entries/cleanupentriesThe text was updated successfully, but these errors were encountered: