-
Notifications
You must be signed in to change notification settings - Fork 444
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
consider providing \< and \> #469
Comments
Note that in PCRE2 (as discussed in BurntSushi/ripgrep#1151) PCRE2 does not support Just some food for thought. |
Grep supports them, as well as other gnu tools like sed, and emacs. Vim obviously supports them. The primary reason why I'd like them implemented isn't that \b doesn't work for most use cases but mostly because I habitually type < instead of \b due to use in vim which doesn't support \b.. I also find it easier to think about wrapping a word in < > mentally verses using the same boundary key for left and right side.. It's not a huge deal as much as it required retraining to use different sequences in different programs. It can be worked around well enough... I understand that specific word boundaries aren't all that useful because you generally follow up with either a word or non word character on at least one side. So yea they don't necessarily hold their weight in terms of adding useful features... It's more about compatibility with other tools. |
I think I'm just going to say no to this feature now. I'm happy to revisit it in the future, but I think I'd want something compelling. It is also worth nothing that even GNU grep supports
So it kind of seems like vim is the oddball here that doesn't support |
See this thread on HN for a discussion with a user that feels strongly enough about using |
I would like to argue for this, because there is a common workflow (for me at least) that this gets in the way:
Given there is a patch (the URL from HN points to hvdijk@511c648 to make it easier to find, I find HN to be a difficult to read wall of text), is there any reason why we couldn't merge that patch? Could I help? But also, if you really don't want to add |
@KoviRobi I'm not sure what you mean by ignoring them. Ignoring them would mean that, e.g., |
I'll re-open this for now. We can consider adding this once #656 is done. |
Ah I misunderstood the issue I linked, I thought it ignored the escape characters at all, so
Excellent, thank you -- let me know if I can be of any help |
In addition to
I like The I've thought for some time about how to implement something like One downside of this is that adding new look-around assertions will require breaking change releases of both My plan at this point is to think a bit more about working around this in ripgrep without adding the new word boundary assertions to the regex engine. If I can come up with something that makes me happy, then I might forgo this for now. |
I've been thinking about names because I really don't like |
This is in preparation for adding 8 new word boundary look-around assertions: \b{start}, \b{end}, \b{start-half} and \b{end-half}, along with Unicode and ASCII-only variants of each. Ref #469
This adds Ast and Hir support for the following new assertions: \b{start}, \b{end}, \b{start-half}, \b{end-half}, \< and \>. The last two, \< and \>, are aliases for \b{start} and \b{end}. The parsing for this is a little suspect since there's a little ambiguity between, e.g., \b{5} and \b{start}, but we handle it by allowing the parser to look for one of the new special assertions, and then back-up if it fails to find one so that it can try to parse a counted repetition. Ref #469
This adds AST support for the following new assertions: \b{start}, \b{end}, \b{start-half}, \b{end-half}, \< and \>. The last two, \< and \>, are aliases for \b{start} and \b{end}. The parsing for this is a little suspect since there's a little ambiguity between, e.g., \b{5} and \b{start}, but we handle it by allowing the parser to look for one of the new special assertions, and then back-up if it fails to find one so that it can try to parse a counted repetition. Ref #469
This builds on the previous commit to bring word boundary support to the HIR, and updates AST->HIR translation to produce them from the corresponding AST elements. Ref #469
This is in preparation for adding 8 new word boundary look-around assertions: \b{start}, \b{end}, \b{start-half} and \b{end-half}, along with Unicode and ASCII-only variants of each. Ref #469
This adds AST support for the following new assertions: \b{start}, \b{end}, \b{start-half}, \b{end-half}, \< and \>. The last two, \< and \>, are aliases for \b{start} and \b{end}. The parsing for this is a little suspect since there's a little ambiguity between, e.g., \b{5} and \b{start}, but we handle it by allowing the parser to look for one of the new special assertions, and then back-up if it fails to find one so that it can try to parse a counted repetition. Ref #469
This builds on the previous commit to bring word boundary support to the HIR, and updates AST->HIR translation to produce them from the corresponding AST elements. Ref #469
This was substantially easier. Coupling, private abstractions and slow code are so much easier to deal with. Ref #469
I have this implemented in #1098 and I'm currently planning to include it in the
|
This is in preparation for adding 8 new word boundary look-around assertions: \b{start}, \b{end}, \b{start-half} and \b{end-half}, along with Unicode and ASCII-only variants of each. Ref #469
This adds AST support for the following new assertions: \b{start}, \b{end}, \b{start-half}, \b{end-half}, \< and \>. The last two, \< and \>, are aliases for \b{start} and \b{end}. The parsing for this is a little suspect since there's a little ambiguity between, e.g., \b{5} and \b{start}, but we handle it by allowing the parser to look for one of the new special assertions, and then back-up if it fails to find one so that it can try to parse a counted repetition. Ref #469
This builds on the previous commit to bring word boundary support to the HIR, and updates AST->HIR translation to produce them from the corresponding AST elements. Ref #469
This was substantially easier. Coupling, private abstractions and slow code are so much easier to deal with. Ref #469
I haven't looked through the patch series yet, but this caught my eye. |
Thanks for checking! No, it doesn't. This change doesn't make
|
I guess the problem then is that the code linked above is going to strip the I would definitely recommend people rather use the But yeah, I concur that adding it as a meta-character would seem to be a mistake as it would become impossible to encode So I don't believe there are any major issues or objections on my side of this. |
Eh? No. The regex Is there some existing valid regex pattern whose behavior or interpretation changes as a result of this PR? Also, where and how are you using this crate? (Out of curiosity.) |
Sorry, for not making things explicitly clear, I'm dealing with 2 layers of escaping: quoting from 1
There is a simple testcase usage here2, and an example where we use I use the crates in my lex/yacc lsp implementation4, instead of codegen it builds parse tables at runtime and so you can modify grammars and modify/parse input sources at runtime from a traditional lex/yacc specification. Anyhow, from what I can tell the existing code I linked should be acceptable without changes, with the escaping woes stemming from the layer of escaping on top of regex. Footnotes
|
Yes, you shouldn't have to change anything. Nothing in this PR should change the behavior or semantics of any existing valid regex. And since |
This is in preparation for adding 8 new word boundary look-around assertions: \b{start}, \b{end}, \b{start-half} and \b{end-half}, along with Unicode and ASCII-only variants of each. Ref #469
This adds AST support for the following new assertions: \b{start}, \b{end}, \b{start-half}, \b{end-half}, \< and \>. The last two, \< and \>, are aliases for \b{start} and \b{end}. The parsing for this is a little suspect since there's a little ambiguity between, e.g., \b{5} and \b{start}, but we handle it by allowing the parser to look for one of the new special assertions, and then back-up if it fails to find one so that it can try to parse a counted repetition. Ref #469
This builds on the previous commit to bring word boundary support to the HIR, and updates AST->HIR translation to produce them from the corresponding AST elements. Ref #469
This was substantially easier. Coupling, private abstractions and slow code are so much easier to deal with. Ref #469
Thank you for this, I was going to give it a try by patching ripgrep/$ cargo r -- \<foo\>
Finished dev [unoptimized + debuginfo] target(s) in 0.03s
Running `target/debug/rg '\<foo\>'`
regex parse error:
(?:\<foo\>)
^^
error: unrecognized escape sequence possibly I have just misunderstood something (my understanding is ripgrep uses this engine, but maybe it also has some syntax checking that needs to be patched?) |
@KoviRobi You might need to explicitly patch |
Note also that ripgrep's |
@BurntSushi thank you, I have updated https://github.com/KoviRobi/ripgrep/tree/use-word-boundary-regex by updating the ripgrep/$ cargo r -- \<oo\>
Finished dev [unoptimized + debuginfo] target(s) in 0.03s
Running `target/debug/rg '\<oo\>'`
crates/regex/src/word.rs
289: assert_eq!(Some((3, 7)), find(r"f?oo!?", "##\nfoo!##")); And being able to use |
Aye yeah, both |
This is in preparation for adding 8 new word boundary look-around assertions: \b{start}, \b{end}, \b{start-half} and \b{end-half}, along with Unicode and ASCII-only variants of each. Ref #469
This adds AST support for the following new assertions: \b{start}, \b{end}, \b{start-half}, \b{end-half}, \< and \>. The last two, \< and \>, are aliases for \b{start} and \b{end}. The parsing for this is a little suspect since there's a little ambiguity between, e.g., \b{5} and \b{start}, but we handle it by allowing the parser to look for one of the new special assertions, and then back-up if it fails to find one so that it can try to parse a counted repetition. Ref #469
This builds on the previous commit to bring word boundary support to the HIR, and updates AST->HIR translation to produce them from the corresponding AST elements. Ref #469
This was substantially easier. Coupling, private abstractions and slow code are so much easier to deal with. Ref #469
@KoviRobi You should be good to build ripgrep from master now! |
@BurntSushi thanks! Also just saw your last commit message about fixing up past commits being an effort, have you come across https://github.com/tummychow/git-absorb ? It makes the fixup workflow extremely simple |
Yes, I use it all the time. The issue there wasn't finding the right commit, but the conflicts that I guessed would result from inserting the change in the Cargo.lock file. (I didn't actually try it though, it was just my sense of things at the time.) |
The
\<
and\>
escape sequences correspond to "left word boundary" and "right word boundary," respectively. They are more specific variants of\b
.There was a request to add them to ripgrep, but in reality, this is more a request for the regex engine itself.
It's not clear whether these are worthwhile things to add, but they are present in other regex engines. There's a secondary question of whether their negation should be support too, although I don't think that's a common thing.
Since unrecognized escape sequences produce a syntax error today, these can be added in a backwards compatible way at any time.
The text was updated successfully, but these errors were encountered: