Performance Issue: InternetAddressList.TryParse Slow with Large Recipient Lists #1106

kamilpastwa · 2024-11-26T08:26:12Z

Describe the problem
The InternetAddressList.TryParse method hangs for an extended time when processing a large number of input data. For example, when parsing a list of 100,000 participants, the method takes more than 10 minutes to complete.

Here's how the method scales with the size of the input data:

Platform:

OS: Windows 10
.NET Runtime: CoreCLR
.NET Framework: .NET 6.0
MimeKit Version: 4.8.0

To Reproduce
Run this unit test:

[TestCase(100000)]
public void RsmfLotOfParticipantsTest(int participantsCount)
{
	// Arrange

	// Prepare recipients 
	StringBuilder recipientsStringBuilder = new ();
	for (int i = 0; i < participantsCount; i++)
	{
		recipientsStringBuilder.Append($"\"Fake User{i+1} <>\"");
		if (i < participantsCount - 1)
		{
			recipientsStringBuilder.Append(", ");
		}
	}
	string recipients = recipientsStringBuilder.ToString();

	// Act
	var stopwatch = Stopwatch.StartNew();
	bool result = InternetAddressList.TryParse(new ParserOptions { AllowAddressesWithoutDomain = true }, recipients, out InternetAddressList addressList);

	// Assert
	Assert.IsTrue(result);
	Assert.AreEqual(addressList.Count, participantsCount);
	Console.WriteLine($"{participantsCount};{stopwatch.Elapsed}");
}

Expected behavior
The method should handle large input data efficiently without significant delays.

Additional context
I am attaching the original test file that led me to this problem:
100k-participants.zip

The text was updated successfully, but these errors were encountered:

jstedfast · 2024-11-27T14:49:23Z

I took a quick look at this last night and nothing stood out as obviously O(N^2) (which is what your test results would indicate), but I did find at least 1 thing that could improve performance (so I'll be digging into that at some point).

That said, I have not yet stepped through this in the debugger to see what is going on. Maybe it's the performance issue that I mentioned, or it could be something else that I just didn't spot in my quick glance through the code.

That said, when I was looking through the code, it looked like the parser would fail to parse <> as a valid address, so that might be related. I know, for example, that there are places in the code where if it fails to parse an address, it backtracks and tries to treat what it tried parsing in the first pass as a display-name and then proceeds to tokenize the next segment as the address.

That backtracking could easily cause an O(N^2) situation if that is indeed what is going on. I don't think it is, but I need to step through stuff in a debugger to truly know.

I'll try to get to actually debugging this as soon as I can. Hopefully I'll be able to carve out some time during the Thanksgiving holiday.

BTW, kudos to creating the graph. I love it :-)

kamilpastwa · 2024-11-28T11:22:45Z

First thing - my test results I presented in the issue was ran on MimeKit 4.0.0. Apologies for that; it was my mistake.
I noticed a significant performance improvement after upgrading to 4.8.0:

However, to investigate case with email and with empty email (<>), I think we should compare performance between these two situations:

Fake User1 [email protected]
Fake User1 <>

Test code

[TestCase(1000, "")]
[TestCase(1000, "[email protected]")]
[TestCase(20000, "")]
[TestCase(20000, "[email protected]")]
[TestCase(40000, "")]
[TestCase(40000, "[email protected]")]
[TestCase(60000, "")]
[TestCase(60000, "[email protected]")]
public void ManyRecipientsTest(int participantsCount, string email)
{
	// Arrange
	StringBuilder recipientsStringBuilder = new();
	for (int i = 0; i < participantsCount; i++)
	{
		recipientsStringBuilder.Append($"\"Fake User{i + 1} <{email}>\"");
		if (i < participantsCount - 1)
		{
			recipientsStringBuilder.Append(", ");
		}
	}
	string recipients = recipientsStringBuilder.ToString();

	// Act
	var stopwatch = Stopwatch.StartNew();
	bool result = InternetAddressList.TryParse(new ParserOptions { AllowAddressesWithoutDomain = true }, recipients, out InternetAddressList addressList);

	// Assert
	Assert.IsTrue(result);
	Assert.AreEqual(addressList.Count, participantsCount);
	Console.WriteLine($"{participantsCount};{stopwatch.Elapsed}");
}

Results:

So it doesn't look like empty email (<>) causes a performance problem.

jstedfast · 2024-11-28T14:01:36Z

Out of curiosity, how do the results change if you convert the recipients string into a byte[] using Encoding.UTF8.GetBytes() before starting the stopwatch?

The first thing that InternetAddressList.TryParse() does is to convert the input string into byte[] and then calls the TryParse() overload that takes a byte[].

jstedfast · 2024-11-28T16:48:47Z

Okay, well, I'm stepping through this in the debugger and I see what is going on now... or at least part of what is going on.

Each address is fully quoted: recipientsStringBuilder.Append($"\"Fake User{i + 1} <{email}>\"");

This means that InternetAddress.TryParse() treats this entire string as a quoted-string token. Then, that code notices that the very next char is a comma and decides that the sending client clearly constructed a malformed address header by not including the comma in the quoted-string token, so gobbles that up and proceeds to the next token which is yet again a quoted-string token. The InternetAddress.TryParse() method continues like this, gobbling up every quoted-string token into 1 massive display-name because it never finds a < character.

See the while loop here: https://github.com/jstedfast/MimeKit/blob/master/MimeKit/InternetAddress.cs#L617-L670

Once that completes, index == endIndex, so it proceeds into the if-block with the comment saying:

we've completely gobbled up an addr-spec w/o a domain

(which is technically inaccurate in this particular case, but it has the right idea).

Then the parser rewinds back to the beginning of the buffer before proceeding to try again, this time parsing the first token as the local-part of an addr-spec (an addr-spec token is just the email address without encapsulating <>'s in case you were wondering).

This explains the O(N^2) behavior.

kamilpastwa · 2024-11-29T12:32:23Z

Out of curiosity, how do the results change if you convert the recipients string into a byte[] using Encoding.UTF8.GetBytes() before starting the stopwatch?
The first thing that InternetAddressList.TryParse() does is to convert the input string into byte[] and then calls the TryParse() overload that takes a byte[].

Let's try:

[TestCase(1000, true)]
[TestCase(1000, false)]
[TestCase(20000, true)]
[TestCase(20000, false)]
[TestCase(40000, true)]
[TestCase(40000, false)]
[TestCase(60000, true)]
[TestCase(60000, false)]
public void ManyRecipientsAndEncodingTest(int recipientsCount, bool encodeRecipientsBeforeStopwatch)
{
	// Arrange
	StringBuilder recipientsStringBuilder = new();
	for (int i = 0; i < recipientsCount; i++)
	{
		recipientsStringBuilder.Append($"\"Fake User{i + 1} <[email protected]>\"");
		if (i < recipientsCount - 1)
		{
			recipientsStringBuilder.Append(", ");
		}
	}
	string recipients = recipientsStringBuilder.ToString();
	var encodedRecipients = Encoding.UTF8.GetBytes(recipients);

	// Act
	var stopwatch = Stopwatch.StartNew();
	bool result = InternetAddressList.TryParse(new ParserOptions { AllowAddressesWithoutDomain = true }, encodedRecipients, out InternetAddressList addressList);

	// Assert
	Assert.IsTrue(result);
	Assert.AreEqual(addressList.Count, recipientsCount);
	Console.WriteLine($"{recipientsCount};{stopwatch.ElapsedMilliseconds / 1000f / 60f} (encode={encodeRecipientsBeforeStopwatch})"); // minutes
}

So there is almost no difference.

To confirm, I ran this test

[TestCase(20000)]
[TestCase(40000)]
[TestCase(60000)]
[TestCase(80000)]
[TestCase(100000)]
[TestCase(1000000)]
public void EncodingPerformanceTest(int recipientsCount)
{
	// Arrange
	StringBuilder recipientsStringBuilder = new();
	for (int i = 0; i < recipientsCount; i++)
	{
		recipientsStringBuilder.Append($"\"Fake User{i + 1} <[email protected]>\"");
		if (i < recipientsCount - 1)
		{
			recipientsStringBuilder.Append(", ");
		}
	}
	string recipients = recipientsStringBuilder.ToString();

	// Act
	var stopwatch = Stopwatch.StartNew();
	var encodedRecipients = Encoding.UTF8.GetBytes(recipients);

	// Assert
	Assert.AreEqual(encodedRecipients.Length, recipients.Length);
	Console.WriteLine($"{recipientsCount};{stopwatch.ElapsedMilliseconds}");
}

And it takes only 37ms for 1M recipients. So Encoding is insignificant.

jstedfast · 2024-11-29T13:13:40Z

Yea, once I spotted the real issue, I knew the encoding wasn't going to make a difference. I should have told you to ignore my previous comment.

In the case described in issue #1106, there are 1000's of "addresses" which are really only quoted-strings separated by commas. What this patch does is to prevent the parser from consuming the entire string as 1 display-name, then deciding that there's no address (reached end of string), so falling back to parsing the first qstring as a local-part instead. This was causing the parser to be O(N^2), so the larger that recipient list got, the worse the performance got. Fixes issue #1106

jstedfast added the performance Improvements to speed or memory consumption label Nov 26, 2024

jstedfast closed this as completed Nov 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance Issue: InternetAddressList.TryParse Slow with Large Recipient Lists #1106

Performance Issue: InternetAddressList.TryParse Slow with Large Recipient Lists #1106

kamilpastwa commented Nov 26, 2024

jstedfast commented Nov 27, 2024

kamilpastwa commented Nov 28, 2024

jstedfast commented Nov 28, 2024

jstedfast commented Nov 28, 2024 •

edited

Loading

kamilpastwa commented Nov 29, 2024

jstedfast commented Nov 29, 2024

Performance Issue: InternetAddressList.TryParse Slow with Large Recipient Lists #1106

Performance Issue: InternetAddressList.TryParse Slow with Large Recipient Lists #1106

Comments

kamilpastwa commented Nov 26, 2024

jstedfast commented Nov 27, 2024

kamilpastwa commented Nov 28, 2024

jstedfast commented Nov 28, 2024

jstedfast commented Nov 28, 2024 • edited Loading

kamilpastwa commented Nov 29, 2024

jstedfast commented Nov 29, 2024

jstedfast commented Nov 28, 2024 •

edited

Loading