Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unicode update for Julia, XML, HTML, YAML, CSS and Javascript lexers. #1537

Merged
merged 5 commits into from
Jul 4, 2020
Merged

Unicode update for Julia, XML, HTML, YAML, CSS and Javascript lexers. #1537

merged 5 commits into from
Jul 4, 2020

Conversation

BenjaminGalliot
Copy link
Contributor

@BenjaminGalliot BenjaminGalliot commented Jun 8, 2020

Hello,

As you asked me in this issue, I now make a PR. It was my first time to do some Ruby so I tried to be as conservative as possible, and following as much as possible the documentations and recommendations.

Edit: because it was not accepted yet, I added a minor correction.

Sincerely.

(This fixes #1534.)

@pyrmont pyrmont self-assigned this Jun 9, 2020
@pyrmont pyrmont added the needs-review The PR needs to be reviewed label Jun 9, 2020
Copy link
Contributor

@pyrmont pyrmont left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for going through and adding these! It looks like it must have been a bit of a pain >_<

A concern I have is the human readability of some of these identifiers. I like having Rouge be more correct but I also worry about maintenance. Is it possible to write simpler expressions (even if they may not be completely accurate) or does that cause things to break? Either by using POSIX bracket expressions or the more descriptive character properties (e.g. \p{Word})?

@pyrmont pyrmont added author-action The PR has been reviewed but action by the author is needed and removed needs-review The PR needs to be reviewed labels Jun 19, 2020
@BenjaminGalliot
Copy link
Contributor Author

BenjaminGalliot commented Jun 20, 2020

OK, I made a small commit between to have a corrected and more accurate version and the last commit is the simplified one (more permissive on rare characters).

I kept the small commit because it is a kind of flag if one day someone bumps into a problem due to simplified version (so around \p{Me} and \p{No}, not normally in more complete and longer regexes but added by \p{Word})…

If you think class subtraction is not too hard to read (it seems Ruby 1.9+ allows it), I can try, but it becomes quite longer…

Compare:
most accurate:
[\p{L}\p{Nl}\p{Mc}\p{Mn}\p{Nd}\p{Pc}\p{Cf}-]
simplified:
[\p{Word}\p{Cf}-]
most accurate with subtractions:
[[\p{Word}\p{Cf}-]&&[^\p{Me}\p{No}]]

@pyrmont pyrmont added needs-review The PR needs to be reviewed and removed author-action The PR has been reviewed but action by the author is needed labels Jul 4, 2020
@pyrmont
Copy link
Contributor

pyrmont commented Jul 4, 2020

@BenjaminGalliot Thanks for making those changes. This looks good and I think that reads a easier. I don't think the subtraction is necessary (at least not until we here from people hitting problems caused by it).

I'll merge and this will be part of the v3.21.0 release that's scheduled to be pushed to RubyGems on Tuesday 14 July 🎉

@pyrmont
Copy link
Contributor

pyrmont commented Jul 4, 2020

@BenjaminGalliot Sorry—in my haste, I realised that this is missing examples. Could you add some simple examples to the visual samples for each language? These files are in spec/visual/samples/ under the name of the lexer.

@pyrmont pyrmont added author-action The PR has been reviewed but action by the author is needed and removed needs-review The PR needs to be reviewed labels Jul 4, 2020
@BenjaminGalliot
Copy link
Contributor Author

BenjaminGalliot commented Jul 4, 2020

Added! Thank you, @pyrmont!

@pyrmont pyrmont merged commit 730208c into rouge-ruby:master Jul 4, 2020
@pyrmont pyrmont removed the author-action The PR has been reviewed but action by the author is needed label Jul 4, 2020
@pyrmont
Copy link
Contributor

pyrmont commented Jul 4, 2020

@BenjaminGalliot Thanks for all your work on this PR. I've merged it in and it will be part of v3.21.0 of Rouge. That's set for release on Tuesday 14 July. I'm not sure when GitLab will start using it but they're usually pretty good at updating to the latest gem so I don't expect it will take too long. Thanks again! 🎉

mattt pushed a commit to NSHipster/rouge that referenced this pull request May 19, 2021
…#1537)

Most of Rouge's lexers use rules that only match ASCII characters. This
is often not strictly correct as many languages support the use of
non-ASCII characters in their identifiers.

This commit adds support for non-ASCII characters to the CSS, HTML,
JavaScript, Julia, XML and YAML lexers. The regular expressions used
are more permissive than they should be if they were to be completely
correct but this is intentional. Ease of maintenance has been
prioritised over syntactic correctness.

Co-authored-by: Michael Camilleri <[email protected]>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Allow Unicode identifiers in several languages (Julia, XML, HTML, YML, CSS, JS, etc.)
2 participants