Perl-style shorthands (like `\d`) not recognized, only POSIX ones (like `[[:digit:]]`) #36

asarkar · 2022-07-07T09:03:39Z

Pattern \\d+|\\b[a-zA-Z']+\\b fails to find the digits in input "testing, 1, 2 testing". The regex is correct as can be tested here https://regex101.com/r/griuTm/1.

Changing the pattern to \\b[0-9a-zA-Z']+\\b works, but it changes the intent because that makes input "123abc" would be valid. \\b[0-9]+\\b|\\b[a-zA-Z']+\\b works too.

The text was updated successfully, but these errors were encountered:

andreasabel · 2022-07-07T11:47:39Z

Could you submit a small Haskell program demonstrating the problem?
Then it would be easy to compare the behavior of regex-tdfa to the other implementations, like regex-pcre, regex-posix etc.

asarkar · 2022-07-07T17:17:28Z

Perhaps this will help, taken from my StackOverflow question.

module WordCount (wordCount) where

import qualified Data.Char as C
import qualified Data.List as L
import Text.Regex.TDFA as R

wordCount :: String -> [(String, Int)]
wordCount xs =
  do
    let zs = R.getAllTextMatches (xs =~ "\\d+|\\b[a-zA-Z']+\\b") :: [String]
    g <- L.group $ L.sort [map C.toLower w | w <- zs]
    return (head g, length g)

andreasabel · 2022-07-07T18:42:22Z

What the others do:

regex-pcre does what you want (finds the digits).
regex-posix finds no matches

Concerning regex-tdfa, if you look up the documentation at https://hackage.haskell.org/package/regex-tdfa under section Special characters, \d is not included. Thus, you should not be surprised it is not supported.

It even says explicitly:

regex-tdfa only supports a small set of special characters and is much less featureful than some other regex engines you might be used to, such as PCRE.

So, the easiest solution for you might be to use regex-pcre.
(Not sure what your intention with filing this report was, maybe you want to PR.)

asarkar · 2022-07-07T19:43:50Z

I found this library looking for a regex package, and saw it mentioned in the Haskell wiki, and in a blog that’s now part of the README. I compared various libraries based on their maintainability (last commit date) and popularity (GitHub stars, issues addressed promptly), and this one came out at the top. Because of that, I’m indeed surprised that something as common as `\d‘ isn’t supported. I’m a Haskell freshman and don’t have the skills yet to start making PRs on a general-purpose library.

andreasabel · 2022-07-14T11:55:31Z

Predefined character classes we could support are listed here:
https://en.wikipedia.org/w/index.php?title=Regular_expression&section=13#Character_classes

One could recognize them either directly in the parser:

regex-tdfa/lib/Text/Regex/TDFA/ReadRegex.hs

Line 94 in 95d47cb

    
           p_escaped = char '\\' >> anyChar >>= \c -> char_index >>= return . (`PEscape` c)

Maybe it is better to handle them in the translation:

regex-tdfa/lib/Text/Regex/TDFA/CorePattern.hs

Lines 537 to 538 in 95d47cb

    
           -- otherwise escape codes are just the escaped character 
        
           PEscape {} -> one

andreasabel · 2022-07-14T12:32:06Z

There seems to be already code for POSIX character classes:

regex-tdfa/lib/Text/Regex/TDFA/TNFA.hs

Lines 798 to 805 in 95d47cb

    
           -- | This returns the distinct ascending list of characters 
        
           -- represented by [: :] values in legalCharacterClasses; unrecognized 
        
           -- class names return an empty string 
        
           decodeCharacterClass :: PatternSetCharacterClass -> String 
        
           decodeCharacterClass (PatternSetCharacterClass s) = 
        
             case s of 
        
               "alnum" -> ['0'..'9']++['a'..'z']++['A'..'Z'] 
        
               "digit" -> ['0'..'9']

These can be given to Patterns PAny and PAnyNot:

regex-tdfa/lib/Text/Regex/TDFA/Pattern.hs

Lines 45 to 46 in 95d47cb

    
           | PAny    {getDoPa::DoPa,getPatternSet::PatternSet} -- Square bracketed things 
        
           | PAnyNot {getDoPa::DoPa,getPatternSet::PatternSet} -- Inverted square bracketed things

@asarkar: The syntax accepted by regex-tdfa is [[:digit:]] instead of \d. ~~See https://regex101.com/r/griuTm/1 for your whole regex.~~

asarkar · 2022-07-14T17:26:43Z

https://regex101.com/r/griuTm/1 shows \d, is that the correct link?

andreasabel · 2022-07-15T08:09:25Z

regex101.com/r/griuTm/1 shows \d, is that the correct link?

No, \d should be replaced by [[:digit:]]. I updated the regex, but the link didn't update.

Supporting Perl-style regexes like \d would not be hard to implement, but it would be a backward-incompatible change, because currently \d means simply d. So, I am not sure whether it is worth it. While \d is quicker to type, [[:digit:]] is easier to comprehend if you look at a regex. What is your application of regex-tdfa?

asarkar · 2022-07-15T08:19:44Z

I intend to use regex-tdfa to solve some exercises from https://exercism.org/tracks/haskell. An alternative is using a parser combinator like Megaparsec, that is significantly harder.

For example, given below is a question that I previously solved in Rust using a regex library. The pattern I used was (?:[2-9][0-9]{2}){2}(?:[0-9]{4}).

They have a predefined list of packages they allow; regex-tdfa is not currently in that list, but I've submitted a PR to get it included.

If you're reluctant in making this change, and I'm not talking about \d only, I'll be happy to use any other regex package, but like I said before, it doesn't seem like there are a lot of great options.

Clean up user-entered phone numbers so that they can be sent SMS messages.

The North American Numbering Plan (NANP) is a telephone numbering system used by many countries in North America like the United States, Canada or Bermuda. All NANP-countries share the same international country code: 1.

NANP numbers are ten-digit numbers consisting of a three-digit Numbering Plan Area code, commonly known as area code, followed by a seven-digit local number. The first three digits of the local number represent the exchange code, followed by the unique four-digit number which is the subscriber number.

The format is usually represented as

(NXX)-NXX-XXXX
where N is any digit from 2 through 9 and X is any digit from 0 through 9.

Your task is to clean up differently formatted telephone numbers by removing punctuation and the country code (1) if present.

For example, the inputs

+1 (613)-995-0253
613-995-0253
1 613 995 0253
613.995.0253
should all produce the output

6139950253

Note: As this exercise only deals with telephone numbers used in NANP-countries, only 1 is considered a valid country code.

andreasabel · 2022-07-15T08:28:16Z

This exercise would be https://exercism.org/tracks/haskell/exercises/phone-number .

Please bear with me, I still have trouble understanding the importance of supporting \d etc.

For example, given below is a question that I previously solved in Rust using a regex library. The pattern I used was (?:[2-9][0-9]{2}){2}(?:[0-9]{4}).

Ok, but this should be fine, as \d is here spelled out as [0-9].

They have a predefined list of packages they allow; regex-tdfa is not currently in that list, but I've submitted a PR to get it included.

Please share the link to the PR if that's fine with you.

Would supporting \d etc. be a requirement to have regex-tdfa included?

asarkar · 2022-07-15T08:37:45Z

Would supporting \d etc. be a requirement to have regex-tdfa included?

No, the PR's been merged.
exercism/haskell-test-runner#52

the importance of supporting \d etc.

The importance, at least to me, is brevity and conciseness. If, in your opinion, what I said so far doesn't justify the change, I've nothing further to add to this discussion. Please make a decision, and either proceed to implement this ticket, or don't, I'm going to get my coat.

andreasabel · 2022-07-15T08:45:05Z

Ok, thanks for your input, @asarkar ! I need to balance between convenience and stability. I'll leave this open and see if other users chime in.

asarkar changed the title ~~Doesn't honor OR~~ Doesn't honor OR or predefined groups Jul 7, 2022

asarkar changed the title ~~Doesn't honor OR or predefined groups~~ Doesn't honor predefined groups Jul 7, 2022

andreasabel added the info-needed More information (like MWE) is needed (e.g. from reporter) label Jul 7, 2022

andreasabel added faq User question and removed info-needed More information (like MWE) is needed (e.g. from reporter) labels Jul 7, 2022

andreasabel changed the title ~~Doesn't honor predefined groups~~ Perl-style shorthands (like \d) not recognized, only POSIX ones (like [[:digit:]]) Jul 14, 2022

andreasabel self-assigned this Jul 14, 2022

andreasabel added the enhancement New feature or request label Jul 15, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Perl-style shorthands (like `\d`) not recognized, only POSIX ones (like `[[:digit:]]`) #36

Perl-style shorthands (like `\d`) not recognized, only POSIX ones (like `[[:digit:]]`) #36

asarkar commented Jul 7, 2022 •

edited

Loading

andreasabel commented Jul 7, 2022

asarkar commented Jul 7, 2022 •

edited by andreasabel

Loading

andreasabel commented Jul 7, 2022

asarkar commented Jul 7, 2022

andreasabel commented Jul 14, 2022 •

edited

Loading

andreasabel commented Jul 14, 2022 •

edited

Loading

asarkar commented Jul 14, 2022

andreasabel commented Jul 15, 2022

asarkar commented Jul 15, 2022

andreasabel commented Jul 15, 2022

asarkar commented Jul 15, 2022

andreasabel commented Jul 15, 2022

Perl-style shorthands (like \d) not recognized, only POSIX ones (like [[:digit:]]) #36

Perl-style shorthands (like \d) not recognized, only POSIX ones (like [[:digit:]]) #36

Comments

asarkar commented Jul 7, 2022 • edited Loading

andreasabel commented Jul 7, 2022

asarkar commented Jul 7, 2022 • edited by andreasabel Loading

andreasabel commented Jul 7, 2022

asarkar commented Jul 7, 2022

andreasabel commented Jul 14, 2022 • edited Loading

andreasabel commented Jul 14, 2022 • edited Loading

asarkar commented Jul 14, 2022

andreasabel commented Jul 15, 2022

asarkar commented Jul 15, 2022

andreasabel commented Jul 15, 2022

asarkar commented Jul 15, 2022

andreasabel commented Jul 15, 2022

Perl-style shorthands (like `\d`) not recognized, only POSIX ones (like `[[:digit:]]`) #36

Perl-style shorthands (like `\d`) not recognized, only POSIX ones (like `[[:digit:]]`) #36

asarkar commented Jul 7, 2022 •

edited

Loading

asarkar commented Jul 7, 2022 •

edited by andreasabel

Loading

andreasabel commented Jul 14, 2022 •

edited

Loading

andreasabel commented Jul 14, 2022 •

edited

Loading