Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Matches email addresses (within a larger string) with default options #5

Closed
surlyhacker opened this issue Jan 30, 2021 · 3 comments
Closed

Comments

@surlyhacker
Copy link

surlyhacker commented Jan 30, 2021

If you take this string: "Please send an email to [email protected], providing ONLY your first name", this module will return "test.com" as a match.

Is this intended behavior and I am just misunderstanding how it's supposed to work?

I see that the "auth" option (if set to true) may result in undesired matches to email addresses, but it defaults to false.
I don't see a test case in the tests for this scenario. (see comment below)

What's interesting is that you commented on this exact problem in a relevant url-regex issue:

We're using this library to pull urls out of a user input string. We're currently running into a problem where it returns email addresses.

'This is an [email protected]'.match(urlRegex({ exact: false, strict: false }));
// returns ['[email protected]']

I think it should return no matches in this case.

...

This issue is fixed in my maintained and modern version of this package at https://github.com/niftylettuce/url-regex-safe. You should be able to switch from url-regex to url-regex-safe now. See the updated list of options as I added some new ones, and changed a few defaults to more sensible ones (since not everyone is parsing Markdown for instance).

Here are the relevant options that might impact this test, although they are all just their default values so wouldn't need to be explicitly set in a test:
exact: false
strict: false
auth: false

@surlyhacker
Copy link
Author

surlyhacker commented Jan 30, 2021

Actually there is a test similar to this that seems counterintuitive given my limited understanding of the interplay between the options and the aforementioned comment on that issue with url-regex:

test('parses similar to Gmail by default', (t) => {
  t.deepEqual(
    "[email protected] [foo]@bar.com foo bar @foob.com '[email protected], some text'".match(
      urlRegex()
    ),
    ['bar.com', 'bar.com', 'foob.com', 'example.com']
  );
});

Is that test case supposed to be finding:

  1. URLs with an Authority component (hereafter referred to as "Authority URLs") that are missing a scheme and have no password? (e.g. normally http://[email protected])
    or
  2. Email address URLs that are missing a scheme? (e.g. normally mailto:[email protected])

I would suggest more detailed documentation to help eliminate ambiguity if this is in fact not a bug. Although if it is not a bug, that brings up the question of whether this module can currently be configured to behave as desired in this issue report, or if that would be an enhancement. Personally I see the value in use cases that separate email address from other types of URLs, and would want an option to explicitly include/exclude them.

Moreover (and perhaps most importantly), in a string like "Please use this Authority URL with username: [email protected] to access the site", I personally would not even consider that to be a valid Authority URL in any context whatsoever. A string with that syntax is an email address. In the modern world in the limited scenarios where Authority URLs are still used, they should always be be prefixed with a scheme and if they are not, they should not be considered Authority URLs regardless of what the spec (RFCs) say (in a context where email addresses may be present).

Edit: Also I don't see why "bar.com" is the matched substring for "[email protected]" either. If "[email protected]" is considered a URL, shouldn't the entire thing be matched and not just the part after the "@"? Another way of saying this is that if you think you have found a valid URL within a given string, but that URL is preceded by an "@" sign, then you have either not found a valid URL or you have found only a portion of a URL. There are very few non-whitespace characters that I would say are valid to be directly preceding a URL. Perhaps bracket/quote characters might be acceptable e.g. ( [ { ` ' " <

I guess it (characters that can precede a URL) depends on whether the string being parsed is generally human-readable stuff or if it's code or machine-readable.

@niftylettuce
Copy link
Collaborator

I think you're looking for email-regex-safe instead. Check out https://spamscanner.net and its source/tests for more insight into how I workaround parsing things out.

@niftylettuce
Copy link
Collaborator

niftylettuce commented Feb 15, 2021

You could use email-regex-safe to parse out the emails with blank spaces, e.g. .replace(EMAIL_REGEX, ' ') and then follow it up with your result.match(URL_REGEX)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants