Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Perl-style shorthands (like \d) not recognized, only POSIX ones (like [[:digit:]]) #36

Open
asarkar opened this issue Jul 7, 2022 · 12 comments
Assignees
Labels
enhancement New feature or request faq User question

Comments

@asarkar
Copy link

asarkar commented Jul 7, 2022

Pattern \\d+|\\b[a-zA-Z']+\\b fails to find the digits in input "testing, 1, 2 testing". The regex is correct as can be tested here https://regex101.com/r/griuTm/1.

Changing the pattern to \\b[0-9a-zA-Z']+\\b works, but it changes the intent because that makes input "123abc" would be valid. \\b[0-9]+\\b|\\b[a-zA-Z']+\\b works too.

@asarkar asarkar changed the title Doesn't honor OR Doesn't honor OR or predefined groups Jul 7, 2022
@asarkar asarkar changed the title Doesn't honor OR or predefined groups Doesn't honor predefined groups Jul 7, 2022
@andreasabel
Copy link
Member

Could you submit a small Haskell program demonstrating the problem?
Then it would be easy to compare the behavior of regex-tdfa to the other implementations, like regex-pcre, regex-posix etc.

@andreasabel andreasabel added the info-needed More information (like MWE) is needed (e.g. from reporter) label Jul 7, 2022
@asarkar
Copy link
Author

asarkar commented Jul 7, 2022

Perhaps this will help, taken from my StackOverflow question.

module WordCount (wordCount) where

import qualified Data.Char as C
import qualified Data.List as L
import Text.Regex.TDFA as R

wordCount :: String -> [(String, Int)]
wordCount xs =
  do
    let zs = R.getAllTextMatches (xs =~ "\\d+|\\b[a-zA-Z']+\\b") :: [String]
    g <- L.group $ L.sort [map C.toLower w | w <- zs]
    return (head g, length g)

@andreasabel
Copy link
Member

What the others do:

  • regex-pcre does what you want (finds the digits).
  • regex-posix finds no matches

Concerning regex-tdfa, if you look up the documentation at https://hackage.haskell.org/package/regex-tdfa under section Special characters, \d is not included. Thus, you should not be surprised it is not supported.

It even says explicitly:

regex-tdfa only supports a small set of special characters and is much less featureful than some other regex engines you might be used to, such as PCRE.

So, the easiest solution for you might be to use regex-pcre.
(Not sure what your intention with filing this report was, maybe you want to PR.)

@andreasabel andreasabel added faq User question and removed info-needed More information (like MWE) is needed (e.g. from reporter) labels Jul 7, 2022
@asarkar
Copy link
Author

asarkar commented Jul 7, 2022

I found this library looking for a regex package, and saw it mentioned in the Haskell wiki, and in a blog that’s now part of the README. I compared various libraries based on their maintainability (last commit date) and popularity (GitHub stars, issues addressed promptly), and this one came out at the top. Because of that, I’m indeed surprised that something as common as `\d‘ isn’t supported. I’m a Haskell freshman and don’t have the skills yet to start making PRs on a general-purpose library.

@andreasabel
Copy link
Member

andreasabel commented Jul 14, 2022

Predefined character classes we could support are listed here:
https://en.wikipedia.org/w/index.php?title=Regular_expression&section=13#Character_classes

One could recognize them either directly in the parser:

p_escaped = char '\\' >> anyChar >>= \c -> char_index >>= return . (`PEscape` c)

Maybe it is better to handle them in the translation:
-- otherwise escape codes are just the escaped character
PEscape {} -> one

@andreasabel
Copy link
Member

andreasabel commented Jul 14, 2022

There seems to be already code for POSIX character classes:

-- | This returns the distinct ascending list of characters
-- represented by [: :] values in legalCharacterClasses; unrecognized
-- class names return an empty string
decodeCharacterClass :: PatternSetCharacterClass -> String
decodeCharacterClass (PatternSetCharacterClass s) =
case s of
"alnum" -> ['0'..'9']++['a'..'z']++['A'..'Z']
"digit" -> ['0'..'9']

These can be given to Patterns PAny and PAnyNot:
| PAny {getDoPa::DoPa,getPatternSet::PatternSet} -- Square bracketed things
| PAnyNot {getDoPa::DoPa,getPatternSet::PatternSet} -- Inverted square bracketed things

@asarkar: The syntax accepted by regex-tdfa is [[:digit:]] instead of \d. See https://regex101.com/r/griuTm/1 for your whole regex.

@andreasabel andreasabel changed the title Doesn't honor predefined groups Perl-style shorthands (like \d) not recognized, only POSIX ones (like [[:digit:]]) Jul 14, 2022
@andreasabel andreasabel self-assigned this Jul 14, 2022
@asarkar
Copy link
Author

asarkar commented Jul 14, 2022

https://regex101.com/r/griuTm/1 shows \d, is that the correct link?

@andreasabel
Copy link
Member

regex101.com/r/griuTm/1 shows \d, is that the correct link?

No, \d should be replaced by [[:digit:]]. I updated the regex, but the link didn't update.

Supporting Perl-style regexes like \d would not be hard to implement, but it would be a backward-incompatible change, because currently \d means simply d. So, I am not sure whether it is worth it. While \d is quicker to type, [[:digit:]] is easier to comprehend if you look at a regex. What is your application of regex-tdfa?

@asarkar
Copy link
Author

asarkar commented Jul 15, 2022

I intend to use regex-tdfa to solve some exercises from https://exercism.org/tracks/haskell. An alternative is using a parser combinator like Megaparsec, that is significantly harder.

For example, given below is a question that I previously solved in Rust using a regex library. The pattern I used was (?:[2-9][0-9]{2}){2}(?:[0-9]{4}).

They have a predefined list of packages they allow; regex-tdfa is not currently in that list, but I've submitted a PR to get it included.

If you're reluctant in making this change, and I'm not talking about \d only, I'll be happy to use any other regex package, but like I said before, it doesn't seem like there are a lot of great options.

Clean up user-entered phone numbers so that they can be sent SMS messages.

The North American Numbering Plan (NANP) is a telephone numbering system used by many countries in North America like the United States, Canada or Bermuda. All NANP-countries share the same international country code: 1.

NANP numbers are ten-digit numbers consisting of a three-digit Numbering Plan Area code, commonly known as area code, followed by a seven-digit local number. The first three digits of the local number represent the exchange code, followed by the unique four-digit number which is the subscriber number.

The format is usually represented as

(NXX)-NXX-XXXX
where N is any digit from 2 through 9 and X is any digit from 0 through 9.

Your task is to clean up differently formatted telephone numbers by removing punctuation and the country code (1) if present.

For example, the inputs

+1 (613)-995-0253
613-995-0253
1 613 995 0253
613.995.0253
should all produce the output

6139950253

Note: As this exercise only deals with telephone numbers used in NANP-countries, only 1 is considered a valid country code.

@andreasabel
Copy link
Member

This exercise would be https://exercism.org/tracks/haskell/exercises/phone-number .

Please bear with me, I still have trouble understanding the importance of supporting \d etc.

For example, given below is a question that I previously solved in Rust using a regex library. The pattern I used was (?:[2-9][0-9]{2}){2}(?:[0-9]{4}).

Ok, but this should be fine, as \d is here spelled out as [0-9].

They have a predefined list of packages they allow; regex-tdfa is not currently in that list, but I've submitted a PR to get it included.

Please share the link to the PR if that's fine with you.

Would supporting \d etc. be a requirement to have regex-tdfa included?

@asarkar
Copy link
Author

asarkar commented Jul 15, 2022

Would supporting \d etc. be a requirement to have regex-tdfa included?

No, the PR's been merged.
exercism/haskell-test-runner#52

the importance of supporting \d etc.

The importance, at least to me, is brevity and conciseness. If, in your opinion, what I said so far doesn't justify the change, I've nothing further to add to this discussion. Please make a decision, and either proceed to implement this ticket, or don't, I'm going to get my coat.

@andreasabel
Copy link
Member

Ok, thanks for your input, @asarkar ! I need to balance between convenience and stability. I'll leave this open and see if other users chime in.

@andreasabel andreasabel added the enhancement New feature or request label Jul 15, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request faq User question
Projects
None yet
Development

No branches or pull requests

2 participants