Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Performance Issue: InternetAddressList.TryParse Slow with Large Recipient Lists #1106

Closed
kamilpastwa opened this issue Nov 26, 2024 · 6 comments
Labels
performance Improvements to speed or memory consumption

Comments

@kamilpastwa
Copy link

Describe the problem
The InternetAddressList.TryParse method hangs for an extended time when processing a large number of input data. For example, when parsing a list of 100,000 participants, the method takes more than 10 minutes to complete.

Here's how the method scales with the size of the input data:
image

Platform:

  • OS: Windows 10
  • .NET Runtime: CoreCLR
  • .NET Framework: .NET 6.0
  • MimeKit Version: 4.8.0

To Reproduce
Run this unit test:

[TestCase(100000)]
public void RsmfLotOfParticipantsTest(int participantsCount)
{
	// Arrange

	// Prepare recipients 
	StringBuilder recipientsStringBuilder = new ();
	for (int i = 0; i < participantsCount; i++)
	{
		recipientsStringBuilder.Append($"\"Fake User{i+1} <>\"");
		if (i < participantsCount - 1)
		{
			recipientsStringBuilder.Append(", ");
		}
	}
	string recipients = recipientsStringBuilder.ToString();

	// Act
	var stopwatch = Stopwatch.StartNew();
	bool result = InternetAddressList.TryParse(new ParserOptions { AllowAddressesWithoutDomain = true }, recipients, out InternetAddressList addressList);

	// Assert
	Assert.IsTrue(result);
	Assert.AreEqual(addressList.Count, participantsCount);
	Console.WriteLine($"{participantsCount};{stopwatch.Elapsed}");
}

Expected behavior
The method should handle large input data efficiently without significant delays.

Additional context
I am attaching the original test file that led me to this problem:
100k-participants.zip

@jstedfast jstedfast added the performance Improvements to speed or memory consumption label Nov 26, 2024
@jstedfast
Copy link
Owner

I took a quick look at this last night and nothing stood out as obviously O(N^2) (which is what your test results would indicate), but I did find at least 1 thing that could improve performance (so I'll be digging into that at some point).

That said, I have not yet stepped through this in the debugger to see what is going on. Maybe it's the performance issue that I mentioned, or it could be something else that I just didn't spot in my quick glance through the code.

That said, when I was looking through the code, it looked like the parser would fail to parse <> as a valid address, so that might be related. I know, for example, that there are places in the code where if it fails to parse an address, it backtracks and tries to treat what it tried parsing in the first pass as a display-name and then proceeds to tokenize the next segment as the address.

That backtracking could easily cause an O(N^2) situation if that is indeed what is going on. I don't think it is, but I need to step through stuff in a debugger to truly know.

I'll try to get to actually debugging this as soon as I can. Hopefully I'll be able to carve out some time during the Thanksgiving holiday.

BTW, kudos to creating the graph. I love it :-)

@kamilpastwa
Copy link
Author

First thing - my test results I presented in the issue was ran on MimeKit 4.0.0. Apologies for that; it was my mistake.
I noticed a significant performance improvement after upgrading to 4.8.0:
image

However, to investigate case with email and with empty email (<>), I think we should compare performance between these two situations:

  1. Fake User1 [email protected]
  2. Fake User1 <>

Test code

[TestCase(1000, "")]
[TestCase(1000, "[email protected]")]
[TestCase(20000, "")]
[TestCase(20000, "[email protected]")]
[TestCase(40000, "")]
[TestCase(40000, "[email protected]")]
[TestCase(60000, "")]
[TestCase(60000, "[email protected]")]
public void ManyRecipientsTest(int participantsCount, string email)
{
	// Arrange
	StringBuilder recipientsStringBuilder = new();
	for (int i = 0; i < participantsCount; i++)
	{
		recipientsStringBuilder.Append($"\"Fake User{i + 1} <{email}>\"");
		if (i < participantsCount - 1)
		{
			recipientsStringBuilder.Append(", ");
		}
	}
	string recipients = recipientsStringBuilder.ToString();

	// Act
	var stopwatch = Stopwatch.StartNew();
	bool result = InternetAddressList.TryParse(new ParserOptions { AllowAddressesWithoutDomain = true }, recipients, out InternetAddressList addressList);

	// Assert
	Assert.IsTrue(result);
	Assert.AreEqual(addressList.Count, participantsCount);
	Console.WriteLine($"{participantsCount};{stopwatch.Elapsed}");
}

Results:

image

So it doesn't look like empty email (<>) causes a performance problem.

@jstedfast
Copy link
Owner

Out of curiosity, how do the results change if you convert the recipients string into a byte[] using Encoding.UTF8.GetBytes() before starting the stopwatch?

The first thing that InternetAddressList.TryParse() does is to convert the input string into byte[] and then calls the TryParse() overload that takes a byte[].

@jstedfast
Copy link
Owner

jstedfast commented Nov 28, 2024

Okay, well, I'm stepping through this in the debugger and I see what is going on now... or at least part of what is going on.

Each address is fully quoted: recipientsStringBuilder.Append($"\"Fake User{i + 1} <{email}>\"");

This means that InternetAddress.TryParse() treats this entire string as a quoted-string token. Then, that code notices that the very next char is a comma and decides that the sending client clearly constructed a malformed address header by not including the comma in the quoted-string token, so gobbles that up and proceeds to the next token which is yet again a quoted-string token. The InternetAddress.TryParse() method continues like this, gobbling up every quoted-string token into 1 massive display-name because it never finds a < character.

See the while loop here: https://github.com/jstedfast/MimeKit/blob/master/MimeKit/InternetAddress.cs#L617-L670

Once that completes, index == endIndex, so it proceeds into the if-block with the comment saying:

we've completely gobbled up an addr-spec w/o a domain

(which is technically inaccurate in this particular case, but it has the right idea).

Then the parser rewinds back to the beginning of the buffer before proceeding to try again, this time parsing the first token as the local-part of an addr-spec (an addr-spec token is just the email address without encapsulating <>'s in case you were wondering).

This explains the O(N^2) behavior.

@kamilpastwa
Copy link
Author

Out of curiosity, how do the results change if you convert the recipients string into a byte[] using Encoding.UTF8.GetBytes() before starting the stopwatch?
The first thing that InternetAddressList.TryParse() does is to convert the input string into byte[] and then calls the TryParse() overload that takes a byte[].

Let's try:

[TestCase(1000, true)]
[TestCase(1000, false)]
[TestCase(20000, true)]
[TestCase(20000, false)]
[TestCase(40000, true)]
[TestCase(40000, false)]
[TestCase(60000, true)]
[TestCase(60000, false)]
public void ManyRecipientsAndEncodingTest(int recipientsCount, bool encodeRecipientsBeforeStopwatch)
{
	// Arrange
	StringBuilder recipientsStringBuilder = new();
	for (int i = 0; i < recipientsCount; i++)
	{
		recipientsStringBuilder.Append($"\"Fake User{i + 1} <[email protected]>\"");
		if (i < recipientsCount - 1)
		{
			recipientsStringBuilder.Append(", ");
		}
	}
	string recipients = recipientsStringBuilder.ToString();
	var encodedRecipients = Encoding.UTF8.GetBytes(recipients);

	// Act
	var stopwatch = Stopwatch.StartNew();
	bool result = InternetAddressList.TryParse(new ParserOptions { AllowAddressesWithoutDomain = true }, encodedRecipients, out InternetAddressList addressList);

	// Assert
	Assert.IsTrue(result);
	Assert.AreEqual(addressList.Count, recipientsCount);
	Console.WriteLine($"{recipientsCount};{stopwatch.ElapsedMilliseconds / 1000f / 60f} (encode={encodeRecipientsBeforeStopwatch})"); // minutes
}

image

So there is almost no difference.

To confirm, I ran this test

[TestCase(20000)]
[TestCase(40000)]
[TestCase(60000)]
[TestCase(80000)]
[TestCase(100000)]
[TestCase(1000000)]
public void EncodingPerformanceTest(int recipientsCount)
{
	// Arrange
	StringBuilder recipientsStringBuilder = new();
	for (int i = 0; i < recipientsCount; i++)
	{
		recipientsStringBuilder.Append($"\"Fake User{i + 1} <[email protected]>\"");
		if (i < recipientsCount - 1)
		{
			recipientsStringBuilder.Append(", ");
		}
	}
	string recipients = recipientsStringBuilder.ToString();

	// Act
	var stopwatch = Stopwatch.StartNew();
	var encodedRecipients = Encoding.UTF8.GetBytes(recipients);

	// Assert
	Assert.AreEqual(encodedRecipients.Length, recipients.Length);
	Console.WriteLine($"{recipientsCount};{stopwatch.ElapsedMilliseconds}");
}

And it takes only 37ms for 1M recipients. So Encoding is insignificant.

@jstedfast
Copy link
Owner

Yea, once I spotted the real issue, I knew the encoding wasn't going to make a difference. I should have told you to ignore my previous comment.

jstedfast added a commit that referenced this issue Nov 29, 2024
In the case described in issue #1106, there are 1000's of "addresses"
which are really only quoted-strings separated by commas.

What this patch does is to prevent the parser from consuming the entire
string as 1 display-name, then deciding that there's no address (reached
end of string), so falling back to parsing the first qstring as a local-part
instead.

This was causing the parser to be O(N^2), so the larger that recipient list
got, the worse the performance got.

Fixes issue #1106
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance Improvements to speed or memory consumption
Projects
None yet
Development

No branches or pull requests

2 participants