-
Notifications
You must be signed in to change notification settings - Fork 745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perl highlighter broken for object with many keys #981
Comments
Ugh it's because it is interpreting the last sentence as a regex. # Perl allows any non-whitespace character to delimit a regex when `m` is used.
rule %r(m(\S).*\1[msixpodualngc]*), re_tok |
Thanks for finding this. I’ll take a look and see how to fix it.
…On Fri, Sep 7, 2018 at 11:39 PM María Inés Parnisari < ***@***.***> wrote:
Ugh it's because it is interpreting the last sentence as a regex.
use constant TESTING => {
hello => 'world',
cost => "22",
hello2 => 'world',
mov => 'vom',
cost2 => 2,
hello3 => 'hi'
}
# Perl allows any non-whitespace character to delimit a regex when `m` is used.
rule %r(m(\S).*\1[msixpodualngc]*), re_tok
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#981 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ACSKlMCQFMIu0N-r5mv-eL4inwY-0iBRks5uY0nwgaJpZM4WcQKV>
.
|
This was introduced by #974 |
Yeah, reverting that PR fixes it. Let me know if there's anyway I can help fix it. |
I propose that we take a practical approach - who uses letters as delimiters? Nobody in their right mind. So, let's only allow symbols. I.e. instead of having
We could have
|
The change in #974 was to fix a catastrophic backtrack issue. Whether or
not many people use letters as delimiters the problem is if they do, and we
don’t match it, it will go on to cause the catastrophic backtrack later.
See https://gitlab.com/gitlab-org/gitlab-ce/issues/49474 in GitLab.
…On Fri, Nov 23, 2018 at 10:20 PM María Inés Parnisari < ***@***.***> wrote:
I propose that we take a practical approach - who uses letters as
delimiters? Nobody in their right mind. So, let's only allow symbols.
I.e. instead of having
rule %r(m(\S).*\1[msixpodualngc]*), re_tok
We could have
rule %r(m([\/\\!\{\}\(\)@<>,;%&]).*\1[msixpodualngc]*), re_tok
—
You are receiving this because you commented.
Reply to this email directly, view it on GitHub
<#981 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/ACSKlAGJfDVdNxwbXJ63SnOd8Fm2BmgAks5uyMkHgaJpZM4WcQKV>
.
|
I see. What if we move this section towards the very end of the root? # Perl allows any non-whitespace character to delimit
# a regex when `m` is used.
rule %r(m(\S).*\1[msixpodualngc]*), re_tok
rule %r(((?<==~)|(?<=\())\s*/(\\\\|\\/|[^/])*/[msixpodualngc]*),
re_tok, :balanced_regex I just tried this and it fixes the problem. EDIT: ah no wait it doesn't, it shows the regexes incorrectly... |
So the current bug with incorrect syntax highlighting is super minor compared to the backtrack error, but it's affecting us at Booking.com using GitLab too. I agree that avoiding all kinds of catastrophic backtracks would be great though. I think @miparnisari's suggestion is great, and I've locally done the following
which fixes the following example
|
Related fun reading: https://perldoc.perl.org/perlop.html#Regexp-Quote-Like-Operators > With the m you can use any pair of non-whitespace (ASCII) characters as > delimiters. This is particularly useful for matching path names that > contain "/" , to avoid LTS (leaning toothpick syndrome). If "?" is the > delimiter, then a match-only-once rule applies, described in m?PATTERN? > below. If "'" (single quote) is the delimiter, no variable interpolation is > performed on the PATTERN. When using a delimiter character valid in an > identifier, whitespace is required after the m. I tried to find some code in the Perl source code to understand how they're differentiated from, but didn't have much luck. https://metacpan.org/pod/PPI#Background I'll add some local samples for the Perl code though. Fixes: rouge-ruby#981 Signed-off-by: William Stewart <[email protected]>
I don't see the point of moving these rules around if they break syntax highlighting for regexes. The whole point of those rules is to highlight regexes! :P |
So my suggestion, based on your first suggestion, still highlights regexes, and doesn't move things around too much. I still need to see how badly it fails if someone uses a really strange regex delimiter with a file as big as the one that caused the Gitlab issue, like a letter someone can use for a variable, which I think is what it's trying to highlight in your original version. |
I'd much rather have no regex highlighting than have broken hash highlighting. |
As far as I can tell, the current rule matches all sorts of things are interpreted as starting a regex:
While Perl does allow any characters to delimit regexes, notice this from perlop:
For here:
And in fact, the whitespace is optional on all of the quote-like operators, like When working on long legacy code files, the current state of highlighting is much more likely to be wrong than correct as errors tend to pile up or misquote hundreds of lines. The current state here provides so much more noise than signal, that removing Perl highlighting support entirely would make my code review work easier. But I think it's fixable. |
Hi all :) I wanted to apologise for having this be outstanding for so long. We've recently got things moving again on the project but have generally been focusing on PRs. I'm afraid I hadn't noticed this issue until @labster's comment. I will say the current behaviour sounds pretty bad. I'll have a look myself and try to report back ASAP on what approach looks best to take. Thanks for all the points raised so far; it's all been helpful for me in understanding better what's gone wrong. |
I should also note that in a lexer pretty much every |
Perl allows arbitrary non-whitespace delimiters in regular expressions. This commit fixes the rule to capture these regular expressions so that it does not capture identifiers, hash keys and other elements of syntax that begin with 'm' but are not regular expressions. It fixes rouge-ruby#981.
OK, there's a lot going on here. TLDR: I basically think @labster's suggestion is correct and it's been submitted as PR #1160. Problem 1: Regular expressions with arbitrary delimitersAs has been discussed, Perl allows for regular expressions to be defined with arbitrary non-whitespace delimiters. The initial fix for this led to broken behaviour where it would capture certain elements of syntax that began with One way of addressing this problem is to not extend Rouge's syntax highlighting to cover all these possible expressions. In particular, Rouge could only cover delimiters that are not word characters. This has the virtue of simplicity. Given how rarely these types of regular expressions are used, this seems like it might be a worthwhile trade-off. Chroma, the popular syntax highlighter written in Go, takes this approach. However, this is really an admission of defeat and, moreover, an unnecessary one. The difficulty of lexing this kind of construct is being able to distinguish between regular expressions and other syntax. But this concern is based on a misunderstanding of how these regular expressions work in Perl. As @labster noted, delimiters that are valid characters for an identifier must be separated by whitespace from the PR #1160 follows @labster's suggestion. It modifies that code to allow for backslash escape characters, to use non-greedy matches and to match across lines. Problem 2: Rouge hanging on parsing backtick-quoted stringsAs @dblessing noted, the impetus for the original fix was a report made to GitLab about source code that was causing Rouge to hang indefinitely. Honestly, that discussion is quite long and so I might be wrong about this but it looked to me like the analysis misunderstood what was causing the problem. The file that caused Rouge to hang was As best as I can determine, this problem is caused by the combination of character classes in the rule for backtick-quoted strings being mashed together in a single regular expression. The problem can be solved by using a separate state for these strings. This is common in other lexers but is not the approach that the Perl lexer takes (neither does the one in Pygments on which Rouge's is based). This problem is not merely theoretical. If the PR proposed above is accepted, a pathological case can be fed through to Rouge by removing the space between a regular expression and a character that can be part of a valid identifier (eg. For this reason, PR #1161 has been submitted that fixes this problem. I have not been able to craft a pathological case for single-quoted strings and double-quotes strings but I would presume they can be similarly affected. This also raises the question of whether any rule that tries to capture an expression using delimiters (whether a regular expression or a quoted string) would suffer from this issue. I'm not sure. I worry that it would but if anyone else is confident about the answer to this, please let us know. ConclusionI have submitted PR #1160 to address the issue identified by @miparnisari and #1161 to provide a more comprehensive fix for #981. These PRs are intended to work, and be accepted, together. |
Thanks for the quick response, at least from my point of view. I'd have proposed the PR myself, but I still need to learn Ruby, and I'm not quite up for that level of yak shaving today. I think that in general, there are still improvements to be made to the Perl lexer, which is still about half the length of the Ruby lexer. I may try out some small PRs there in the future, or maybe even attempt a Perl 6 lexer. But great progress for now, thanks! |
Perl allows arbitrary non-whitespace delimiters in regular expressions. This commit fixes the rule to capture these regular expressions so that it does not capture identifiers, hash keys and other elements of syntax that begin with 'm' but are not regular expressions. It fixes #981.
@labster Yeah, definitely think Perl doesn't get much love when it comes to a lot of the syntax highlighting libraries. Hopefully we can make it better :) My knowledge of Perl is undoubtedly worse than your knowledge of Ruby but please feel free to suggest any changes you think of! |
Thank you so much for the fixes, folks. So excited to see this issue closed. 🙏 Can we get a release, so I can make an MR to GitLab to upgrade their version of rouge? |
@zoidbergwill New release just went out :) |
The text was updated successfully, but these errors were encountered: