Matches email addresses (within a larger string) with default options #5

surlyhacker · 2021-01-30T21:00:27Z

If you take this string: "Please send an email to [email protected], providing ONLY your first name", this module will return "test.com" as a match.

Is this intended behavior and I am just misunderstanding how it's supposed to work?

I see that the "auth" option (if set to true) may result in undesired matches to email addresses, but it defaults to false.
~~I don't see a test case in the tests for this scenario.~~ (see comment below)

What's interesting is that you commented on this exact problem in a relevant url-regex issue:

We're using this library to pull urls out of a user input string. We're currently running into a problem where it returns email addresses.
'This is an [email protected]'.match(urlRegex({ exact: false, strict: false }));
// returns ['[email protected]']
I think it should return no matches in this case.

...

This issue is fixed in my maintained and modern version of this package at https://github.com/niftylettuce/url-regex-safe. You should be able to switch from url-regex to url-regex-safe now. See the updated list of options as I added some new ones, and changed a few defaults to more sensible ones (since not everyone is parsing Markdown for instance).

Here are the relevant options that might impact this test, although they are all just their default values so wouldn't need to be explicitly set in a test:
exact: false
strict: false
auth: false

The text was updated successfully, but these errors were encountered:

surlyhacker · 2021-01-30T22:29:40Z

Actually there is a test similar to this that seems counterintuitive given my limited understanding of the interplay between the options and the aforementioned comment on that issue with url-regex:

test('parses similar to Gmail by default', (t) => {
  t.deepEqual(
    "[email protected] [foo]@bar.com foo bar @foob.com '[email protected], some text'".match(
      urlRegex()
    ),
    ['bar.com', 'bar.com', 'foob.com', 'example.com']
  );
});

Is that test case supposed to be finding:

URLs with an Authority component (hereafter referred to as "Authority URLs") that are missing a scheme and have no password? (e.g. normally http://[email protected])
or
Email address URLs that are missing a scheme? (e.g. normally mailto:[email protected])

I would suggest more detailed documentation to help eliminate ambiguity if this is in fact not a bug. Although if it is not a bug, that brings up the question of whether this module can currently be configured to behave as desired in this issue report, or if that would be an enhancement. Personally I see the value in use cases that separate email address from other types of URLs, and would want an option to explicitly include/exclude them.

Moreover (and perhaps most importantly), in a string like "Please use this Authority URL with username: [email protected] to access the site", I personally would not even consider that to be a valid Authority URL in any context whatsoever. A string with that syntax is an email address. In the modern world in the limited scenarios where Authority URLs are still used, they should always be be prefixed with a scheme and if they are not, they should not be considered Authority URLs regardless of what the spec (RFCs) say (in a context where email addresses may be present).

Edit: Also I don't see why "bar.com" is the matched substring for "[email protected]" either. If "[email protected]" is considered a URL, shouldn't the entire thing be matched and not just the part after the "@"? Another way of saying this is that if you think you have found a valid URL within a given string, but that URL is preceded by an "@" sign, then you have either not found a valid URL or you have found only a portion of a URL. There are very few non-whitespace characters that I would say are valid to be directly preceding a URL. Perhaps bracket/quote characters might be acceptable e.g. ( [ { ` ' " <

I guess it (characters that can precede a URL) depends on whether the string being parsed is generally human-readable stuff or if it's code or machine-readable.

niftylettuce · 2021-02-15T04:21:10Z

I think you're looking for email-regex-safe instead. Check out https://spamscanner.net and its source/tests for more insight into how I workaround parsing things out.

niftylettuce · 2021-02-15T04:21:35Z

You could use email-regex-safe to parse out the emails with blank spaces, e.g. .replace(EMAIL_REGEX, ' ') and then follow it up with your result.match(URL_REGEX)

niftylettuce closed this as completed Feb 15, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Matches email addresses (within a larger string) with default options #5

Matches email addresses (within a larger string) with default options #5

surlyhacker commented Jan 30, 2021 •

edited

Loading

surlyhacker commented Jan 30, 2021 •

edited

Loading

niftylettuce commented Feb 15, 2021

niftylettuce commented Feb 15, 2021 •

edited

Loading

Matches email addresses (within a larger string) with default options #5

Matches email addresses (within a larger string) with default options #5

Comments

surlyhacker commented Jan 30, 2021 • edited Loading

surlyhacker commented Jan 30, 2021 • edited Loading

niftylettuce commented Feb 15, 2021

niftylettuce commented Feb 15, 2021 • edited Loading

surlyhacker commented Jan 30, 2021 •

edited

Loading

surlyhacker commented Jan 30, 2021 •

edited

Loading

niftylettuce commented Feb 15, 2021 •

edited

Loading