Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

LINQ-like regular expressions #471

Closed
DartBot opened this issue Nov 16, 2011 · 9 comments
Closed

LINQ-like regular expressions #471

DartBot opened this issue Nov 16, 2011 · 9 comments
Assignees
Labels
area-language Dart language related items (some items might be better tracked at github.com/dart-lang/language). closed-not-planned Closed as we don't intend to take action on the reported issue type-enhancement A request for a change that isn't a bug

Comments

@DartBot
Copy link

DartBot commented Nov 16, 2011

This issue was originally filed by [email protected]


Regular expressions are cryptic. Everybody has problems trying to remember how to construct a regular expression and a day later they forget what the strange thing means. There is no argument about this. The best approach to solve regular expressions that I can think of is:

var regExp = startsWith 3 letters
             andThen 6 numbers or "blah"
             andThen atLeast 2 "-";

Easily readable.

@floitschG
Copy link
Contributor

Putting into 'Area-Language' for now (due to the LINQ reference). However the particular RegExp example is probably more library related than language related.


Removed Type-Defect label.
Added Type-Enhancement, Area-Language, Triaged labels.

@DartBot
Copy link
Author

DartBot commented Nov 16, 2011

This comment was originally written by [email protected]


Everybody has problems trying to remember how to construct a regular expression and a day later they forget what the strange thing means. There is no argument about this.

I have to disagree. Common regexp syntax is indeed very terse, but that is actually a good thing. Regexp is not cryptic at all, it is in fact precisely the string that you want to find (!!) with some special syntax for expressing that more variants are possible on certain places. A regular expression for matching string "abc" is exactly this, "abc". A regular expression for matching one digit is "[0-9]", for one or more digits "[0-9]+" -- and I really wouldn't like to write "atLeast 1 number" instead. (And what exactly does "number" mean here? An integer or decimal number? A digit only? Every Unicode codepoint that denotes a digit in some language or 0-9 only?)

A lot of people have troubles constructing and understanding regexps, sure, but a lot of people do not. And from my experience, people get used to the regexp syntax pretty quickly. Some alternative means of constructing regexps might be good, but maybe allowing comments in regexps like Perl allows would be good enough (but that would allow regexp literals, I think).

@DartBot
Copy link
Author

DartBot commented Nov 16, 2011

This comment was originally written by [email protected]


"but a lot of people do not"

http://stackoverflow.com/questions/tagged/regex

It is in fact one of the top tags on stackoverflow.

"And from my experience, people get used to the regexp syntax pretty quickly"

Yes, but that only applies if you are using regex every single day, otherwise you will forget the vast array of special codes it uses.

@DartBot
Copy link
Author

DartBot commented Dec 1, 2011

This comment was originally written by [email protected]


This should be a library (I imagine it using chaining), not a language construct. I highly disagree your example is "easily readable". You're introducing new operators and precedence rules; your formatting suggests you read it one way, but I read it completely differently.

var regExp = (startsWith 3 letters
             andThen 6 numbers) or ("blah"
             andThen atLeast 2 "-");

@DartBot
Copy link
Author

DartBot commented Dec 1, 2011

This comment was originally written by [email protected]


These are minor complaints that are easy to fix. I meant "digits" not numbers, which clears up the first problem mentioned. As for precendence, some kind of convention where it is understood that each line starts a new section would work. The operators are English words that are already understood and easier to remember. Regex is an entire alphabet of strange codes that nobody understands, so you can hardly say Regex isn't worse with respect to this issue.

Your high disagreement does not correspond to the issues you then bought up, which are minor issues about precendence. I am pretty sure you understood the operators even if they are "new".

Assuming precedence was cleared up, and the digits issue was cleared up, any programmer could determine if a string met the criteria in the expression. If you convert that to regex and ask people who don't use regex every day, they will not be able to tell you what it means.

@gbracha
Copy link
Contributor

gbracha commented Dec 14, 2011

Set owner to @gbracha.
Added Accepted label.

@DartBot
Copy link
Author

DartBot commented Jan 15, 2012

This comment was originally written by [email protected]


I have now created a specification of how the language could look if anyone is interested.

All of the character classes and so on in traditional regular expressions can obviously just be represented as sets. There may be an object containing all predefined expressions such as:

class Pre {
  Set<String> lowerLetters;
  Set<String> upperLetters;
  Set<String> digits;
  Set<String> whitespace;
}

or something like that. Here are some examples.

var regex = start "$"
            next min 1 digits
        next “.”
            next 2 digits
            end;

Quantifiers can be expressed with the following syntax:

Given the example "googledartgoogledartgoogledart", we could have:

Greedy quantifier that matches the entire string:

var regex = start min 1 letters
            until last "dart" inclusive;

Reluctant quantifier that matches the first "googledart":

var regex = start min 1 letters
            until first "dart" inclusive;

Posessive quantifier:

var googleLetters = Set.from(["g", "o", "l", "e"]);
var regex = start min 1 max 6 googleLetters
            until not googleLetters;

There are examples of lookaheads even though they say before. Dart and javascript do not have lookbehinds.

var regex = "google" before "dart";

var regex = "google" before not "dart";

Some other things:

var regex = 1 not digits;

var regex = start 1 (digits - ["0", "1", "2", "3"])
            end 3 letters;

var regex = 3 letters
            next 2 digits next 3 "-"
            or
            next 10 digits next 1 "-"
            next (3 upperLetters) as discard;

"or" applies to one line above and below if on a line by itself.
    
Anyway, that was a fun exercise, back to more productive things :)

@DartBot
Copy link
Author

DartBot commented Jan 16, 2012

This comment was originally written by [email protected]


Actually that or idea is terrible :) Also, could alternatively use

var regex = start min 3 letters
            until first "dart"
            next "dart"
            next "google"
            next 1 not letters;

to remove need for "inclusive" as a keyword.

@anders-sandholm
Copy link
Contributor

May be suitable for a library - not a core language feature.


Added WontFix label.

@DartBot DartBot added Type-Enhancement area-language Dart language related items (some items might be better tracked at github.com/dart-lang/language). labels May 2, 2012
@kevmoo kevmoo added closed-not-planned Closed as we don't intend to take action on the reported issue type-enhancement A request for a change that isn't a bug and removed resolution-wont_fix labels Mar 1, 2016
copybara-service bot pushed a commit that referenced this issue Oct 31, 2022
Changes:
```
> git log --format="%C(auto) %h %s" 93d0eee..49eefd2
 https://dart.googlesource.com/markdown.git/+/49eefd2 Refactor AutolinkExtensionSyntax (#471)
 https://dart.googlesource.com/markdown.git/+/07e2683 Optimise TableSyntax (#472)
 https://dart.googlesource.com/markdown.git/+/9b61871 Make helper class private that should not have been exposed (#476)
 https://dart.googlesource.com/markdown.git/+/299964e Return list for link nodes creation (#452)
 https://dart.googlesource.com/markdown.git/+/aee6a40 validate code coverage on CI (#474)
 https://dart.googlesource.com/markdown.git/+/88f3f8a Fix html entity and numeric character references (#467)

```

Diff: https://dart.googlesource.com/markdown.git/+/93d0eee771f6355be6737c2a865f613f6b105bf1~..49eefd211e7840bac7e11257cd966435ae3cb07f/
Change-Id: I2a88d7c386f567738226701be4edcd7c4818744f
Reviewed-on: https://dart-review.googlesource.com/c/sdk/+/266760
Auto-Submit: Devon Carew <[email protected]>
Commit-Queue: Oleh Prypin <[email protected]>
Reviewed-by: Oleh Prypin <[email protected]>
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area-language Dart language related items (some items might be better tracked at github.com/dart-lang/language). closed-not-planned Closed as we don't intend to take action on the reported issue type-enhancement A request for a change that isn't a bug
Projects
None yet
Development

No branches or pull requests

5 participants