Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Verify regex crash doesn't happen #2

Closed
bbottema opened this issue Feb 25, 2016 · 4 comments
Closed

Verify regex crash doesn't happen #2

bbottema opened this issue Feb 25, 2016 · 4 comments

Comments

@bbottema
Copy link
Owner

With the original Les version, there was a big performance issue (bbottema/simple-java-mail#3) where some addresses would make the regex parser freeze and blow up the server.

The below address should be parsed without issues, if this library is to have any merit in production-like environments:

[email protected]

@chconnor
Copy link
Contributor

Hi! -- Casey here. Glad to know that people are finding this code useful! Thanks for keeping it alive.

I can confirm that that address parses fine:

local part: 309d4696df38ff12c023600e3bc2bd4b
domain: fakedomain.com

The company I used to work for has parsed many tens of billions of addresses in the wild with this code, much gnarlier than that example. As mentioned in the stackoverflow answer, there are occasional stack overflows from truly crazy addresses. Frankly I'm not sure how solveable a problem that is: the regex is terribly complicated, RFC 2822 is pretty insane, and there may be some deep reason why it's impossible to prevent 100% of them. But we didn't worry too much about it because the addresses that caused this were always very ridiculous; e.g. some spammer with a buggy spam script that concatenates 400 addresses, each with unclosed braces, into a single string that this class then tries to parse as a single email address, etc. It's obviously much less than ideal that it occasionally breaks on an address, but debugging that stuff got very hairy and quickly wasn't worth the effort: parsing a very high percentage of addresses was more important to us than reducing the stack overflows to zero. It'd be great if someone wanted to take the time to iron it out.

I'm not aware of any remaining infinite hangs, though. IIRC, the stack overflows happen relatively quickly so they aren't burning CPU for a long time. Been a while since I dealt with this stuff, though, and I don't have example failing addresses, unfortunately.

In general the class is very robust (and is by far the best I'm aware of at extracting the desired info from spurious headers). E.g. I'm not aware of any legitimate addresses failing to parse in any way with this class, and the failures were addresses that no code anywhere could parse. If you like, you can throw a try/catch around it and think of the stack overflow as an ugly way of reporting a parsing failure. :-)

@bbottema
Copy link
Owner Author

If I remember correctly, the issue was with some regex look ahead modifier combination in which it found no solution... I'm not expert on the matter, but the regex scare me more than the actual RFC's.

It occurs to me that it should not be impossible to reduce the RFC to a lexical grammar with which you can parse email strings. If it is possible with an entire programming language syntax, it should be possible with an email format, right? The former I've done in the past before.... if only I had the time. Oh well.

@chconnor
Copy link
Contributor

Are you referring to an issue in Les' version or mine? The regexes were basically completely rewritten in my version, so I wouldn't draw any conclusions from Les' code in terms of this code or vice versa. (Les has also included 1.12 in his repo as EmailAddress2.java, but this repo is based off the latest (1.13), which does address a rare long-recursion bug that cropped up in 1.12.) But yeah, whatever stack overflows remain are certainly due to some kind of recursion issue as the regex traverses some nearly-infinite tree of possibilities. I fixed some by changing the greediness of certain matching groups and so forth, but didn't iron them all out.

The regex scares me too. :-) I was indeed advised long ago by Scott Kitterman, and believe his advice to be wise, that a lexer would be the smarter way to go on this, just as you suggest.

But if we could find an example address that causes the overflow, that would at least be a place for someone to start. Unfortunately I don't have access to the enormous corpus of spam messages I once did at that company. If someone does, they could run this against all the discovered addresses and see if they could make it break.

bbottema added a commit that referenced this issue Feb 26, 2016
@bbottema
Copy link
Owner Author

Verified as well in the new demo class.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants