POSIX character classes supported outside of char ranges `[...]` #175

birkenfeld · 2016-02-22T11:55:43Z

This crate's regex syntax allows the POSIX-like character classes outside of brackets. In all other regex incarnations I know, you have to write [[:alpha:]]. This crate seems to allow [:alpha:].

I think this is an extension to usual syntax extension, but sadly it makes the syntax needlessly incompatible. I assume you don't allow [: to start a normal character range?

The text was updated successfully, but these errors were encountered:

BurntSushi · 2016-02-22T12:02:24Z

I'm laughing, because it looks like it is treated specially in the parser (so that, e.g., [:alphaa:] is just a regular character class). Here's the method in question:

    // Parses an ASCII class, e.g., `[:alnum:]+`.
    //
    // Start: `[`
    // End:   `+`
    //
    // Also supports negation, e.g., `[:^alnum:]`.
    //
    // This parsing routine is distinct from the others in that it doesn't
    // actually report any errors. Namely, if it fails, then the parser should
    // fall back to parsing a regular class.
    //
    // This method will only make progress in the parser if it succeeds.
    // Otherwise, the input remains where it started.
    fn maybe_parse_ascii(&mut self) -> Option<CharClass>

Honestly, I don't remember explicitly thinking that we should diverge from the rest of the world on this point. I think this is what happened:

I personally never ever use [[:alpha:]] and their ilk, so I have zero experience with them.
Given my abundance of experience, I think I just naturally assumed that [:alpha:] would work.

I think you are indeed right that it doesn't.

I feel like the downside here is pretty minimal. It's true that it technically makes the syntax incompatible, but if you're porting a regex from some other regex engine to this one, then you're unlikely to trip over it. The reverse direction on the other hand...

Is this something worth changing for 1.0? I feel like the only compelling reason is to be consistent with other regex engines because more consistency is better when possible.

birkenfeld · 2016-02-22T12:09:23Z

I agree that if it reverts to normal ranges, the incompatibility is minimal. The only instance where I could see [:alpha:] meant as a character range is in generated regexes. Otherwise nobody would put the a and the : in there twice.

Still it is an incompatibility -- if there were no restraints on this crate's backwards compatibility I would remove it. There probably are though, so - your call :)

If you keep it, please explicitly mention this in the docs though. It will trip up people familiar with other regex libs otherwise.

BurntSushi · 2016-02-22T12:12:56Z

I plan on making a small number of breaking changes for 1.0, but I think this is the first one that breaks the semantics of matching. Nevertheless, it's pretty small, and consistency is important, so I'll add it to the 1.0 milestone for now. :-)

(If it doesn't get changed in 1.0 then I'd we're probably stuck with this for good.)

BurntSushi · 2016-12-30T01:44:34Z

Interestingly, some regex engines (like Go and presumably RE2) support formulations like [[:upper:][:blank:]] while Python's does not. I think that's a fine thing to do.

I will still fix this particular issue by treating [:upper:] as a character class containing the characters :upper:.

regex 0.2 0.2.0 ===== This is a new major release of the regex crate, and is an implementation of the [regex 1.0 RFC](https://github.com/rust-lang/rfcs/blob/master/text/1620-regex-1.0.md). We are releasing a `0.2` first, and if there are no major problems, we will release a `1.0` shortly. For `0.2`, the minimum *supported* Rust version is 1.12. There are a number of **breaking changes** in `0.2`. They are split into two types. The first type correspond to breaking changes in regular expression syntax. The second type correspond to breaking changes in the API. Breaking changes for regex syntax: * POSIX character classes now require double bracketing. Previously, the regex `[:upper:]` would parse as the `upper` POSIX character class. Now it parses as the character class containing the characters `:upper:`. The fix to this change is to use `[[:upper:]]` instead. Note that variants like `[[:upper:][:blank:]]` continue to work. * The character `[` must always be escaped inside a character class. * The characters `&`, `-` and `~` must be escaped if any one of them are repeated consecutively. For example, `[&]`, `[\&]`, `[\&\&]`, `[&-&]` are all equivalent while `[&&]` is illegal. (The motivation for this and the prior change is to provide a backwards compatible path for adding character class set notation.) * A `bytes::Regex` now has Unicode mode enabled by default (like the main `Regex` type). This means regexes compiled with `bytes::Regex::new` that don't have the Unicode flag set should add `(?-u)` to recover the original behavior. Breaking changes for the regex API: * `find` and `find_iter` now **return `Match` values instead of `(usize, usize)`.** `Match` values have `start` and `end` methods, which return the match offsets. `Match` values also have an `as_str` method, which returns the text of the match itself. * The `Captures` type now only provides a single iterator over all capturing matches, which should replace uses of `iter` and `iter_pos`. Uses of `iter_named` should use the `capture_names` method on `Regex`. * The `replace` methods now return `Cow` values. The `Cow::Borrowed` variant is returned when no replacements are made. * The `Replacer` trait has been completely overhauled. This should only impact clients that implement this trait explicitly. Standard uses of the `replace` methods should continue to work unchanged. * The `quote` free function has been renamed to `escape`. * The `Regex::with_size_limit` method has been removed. It is replaced by `RegexBuilder::size_limit`. * The `RegexBuilder` type has switched from owned `self` method receivers to `&mut self` method receivers. Most uses will continue to work unchanged, but some code may require naming an intermediate variable to hold the builder. * The free `is_match` function has been removed. It is replaced by compiling a `Regex` and calling its `is_match` method. * The `PartialEq` and `Eq` impls on `Regex` have been dropped. If you relied on these impls, the fix is to define a wrapper type around `Regex`, impl `Deref` on it and provide the necessary impls. * The `is_empty` method on `Captures` has been removed. This always returns `false`, so its use is superfluous. * The `Syntax` variant of the `Error` type now contains a string instead of a `regex_syntax::Error`. If you were examining syntax errors more closely, you'll need to explicitly use the `regex_syntax` crate to re-parse the regex. * The `InvalidSet` variant of the `Error` type has been removed since it is no longer used. * Most of the iterator types have been renamed to match conventions. If you were using these iterator types explicitly, please consult the documentation for its new name. For example, `RegexSplits` has been renamed to `Split`. A number of bugs have been fixed: * [BUG #151](#151): The `Replacer` trait has been changed to permit the caller to control allocation. * [BUG #165](#165): Remove the free `is_match` function. * [BUG #166](#166): Expose more knobs (available in `0.1`) and remove `with_size_limit`. * [BUG #168](#168): Iterators produced by `Captures` now have the correct lifetime parameters. * [BUG #175](#175): Fix a corner case in the parsing of POSIX character classes. * [BUG #178](#178): Drop the `PartialEq` and `Eq` impls on `Regex`. * [BUG #179](#179): Remove `is_empty` from `Captures` since it always returns false. * [BUG #276](#276): Position of named capture can now be retrieved from a `Captures`. * [BUG #296](#296): Remove winapi/kernel32-sys dependency on UNIX. * [BUG #307](#307): Fix error on emscripten.

birkenfeld · 2017-01-01T21:27:36Z

Interestingly, some regex engines (like Go and presumably RE2) support formulations like [[:upper:][:blank:]] while Python's does not. I think that's a fine thing to do.

That's because Python's engine does not support character classes at all.

birkenfeld mentioned this issue Feb 22, 2016

implement matching on byte strings #85

Closed

BurntSushi added this to the 1.0 milestone Feb 22, 2016

BurntSushi mentioned this issue Sep 27, 2016

Clarify how to use ascii character classes in [..] #286

Closed

BurntSushi mentioned this issue Dec 31, 2016

regex 0.2 #310

Merged

bors closed this as completed in bc06024 Dec 31, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

POSIX character classes supported outside of char ranges `[...]` #175

POSIX character classes supported outside of char ranges `[...]` #175

birkenfeld commented Feb 22, 2016

BurntSushi commented Feb 22, 2016

birkenfeld commented Feb 22, 2016

BurntSushi commented Feb 22, 2016

BurntSushi commented Dec 30, 2016

birkenfeld commented Jan 1, 2017

POSIX character classes supported outside of char ranges [...] #175

POSIX character classes supported outside of char ranges [...] #175

Comments

birkenfeld commented Feb 22, 2016

BurntSushi commented Feb 22, 2016

birkenfeld commented Feb 22, 2016

BurntSushi commented Feb 22, 2016

BurntSushi commented Dec 30, 2016

birkenfeld commented Jan 1, 2017

POSIX character classes supported outside of char ranges `[...]` #175

POSIX character classes supported outside of char ranges `[...]` #175