Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve linux performance when listening on many ports #600

Merged
merged 4 commits into from
Apr 18, 2024

Conversation

LukeButters
Copy link
Contributor

@LukeButters LukeButters commented Apr 17, 2024

Background

While trying to get 1000s of listening simulated tentacles working, I was running into issues in which Octopus was unable to communicate with the Tentacle. Eventually I discovered that the CPU usage of Tentacle Simulator was high despite it not doing anything. Turns out it was spending all its time in SpinWait.

This was done because TcpListener.Stop hangs if Accept is in progress dotnet/runtime#24513 . Which is claimed to be fixed, although I see an absolute disaster of reverts so it is not clear if it is fixed.

Working on the assumption TcpListener.Stop still hangs if Accept is in progress, the new optional path is to:

  • await waiting to accept a connection with the cancellation token.
  • when the SecureListener is disposed first the CancellationToken is cancelled, then we wait for the accept task to complete (which will be some sort cancellation exception), after that when we know we are not in the middle of Accept we then call TcpListener.Stop.

Note that the new behaviour must be opted in by setting:

UseAsyncListener = true;

on HalibutTimeoutsAndLimits.

Results

Before

CPU usage was half my cores when listening on 1000 sockets.

After

CPU usage is near nothing.

Tentacle Simulator can actually be used with 1000s of listening tentacles.

How to review this PR

Quality ✔️

Pre-requisites

  • I have read How we use GitHub Issues for help deciding when and where it's appropriate to make an issue.
  • I have considered informing or consulting the right people, according to the ownership map.
  • I have considered appropriate testing for my change.

@LukeButters LukeButters requested a review from a team as a code owner April 17, 2024 06:15
@@ -49,6 +49,8 @@ public HalibutTimeoutsAndLimits Build()
limits.MaximumActiveTcpConnectionsPerPollingSubscription = maximumActiveTcpConnectionsPerPollingSubscription.Value;
}

limits.UseAsyncListener = true;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This feature is optional, but we are turning it on by default for our tests.

I think it makes more sense to run our tests the way they will be used by Octopus (i.e., we this off).

In saying that, it would be nice to see all the tests work with this on as well. Would it be overkill to double the number of tests and create different test cases with this on and off? 🤔

Or do we just make a few choice tests with this on for now?

Copy link
Contributor Author

@LukeButters LukeButters Apr 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the direction would be to have this on so if anything we would want this on now, not off.

We will need to roll this out, and step one is get signal here.

Do we want a one off run with it off to check all that behaviour remains?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the plan is to move it towards being the default behaviour, then this is fine.

@@ -75,7 +75,12 @@ public static string CurrentTestHash()
{
using (SHA256 mySHA256 = SHA256.Create())
{
return Convert.ToBase64String(mySHA256.ComputeHash(TestContext.CurrentContext.Test.FullName.GetUTF8Bytes()))
string s = TestContext.CurrentContext.Test.FullName;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this running locally, or on TeamCity?

I'm guessing it won't cause any issues running them in parallel?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

both. The advantage here is net48 and net6 can run in parallel on your laptop without log file collisions.

@@ -55,7 +55,7 @@ async Task WriteDataToFile()
// So what we can write it down as one chunk.
while (queue.TryDequeue(out var log)) list.Add(log);

using(var fileWriter = new FileStream(logFilePath, FileMode.Append, FileAccess.Write, FileShare.Delete | FileShare.ReadWrite))
using(var fileWriter = await OpenLogFile())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just wondering, was this an issue locally or on TeamCity?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

locally, my guess is crowd strike with slightly different timings (I had turbo on my laptop turned off, so I guess that increased the time files were opened for which I guess resulted in the conflict)

@@ -25,7 +25,7 @@ public class HalibutRuntime : IHalibutRuntime
readonly ConcurrentDictionary<Uri, IPendingRequestQueue> queues = new();
readonly IPendingRequestQueueFactory queueFactory;
readonly X509Certificate2 serverCertificate;
readonly List<IDisposable> listeners = new();
readonly MutexedItem<List<IAsyncDisposable>> listeners = new MutexedItem<List<IAsyncDisposable>>(new List<IAsyncDisposable>());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to verify, the motivation behind this change was to make disposal async, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We needed to make accept be async (with a cancellation token)
Which meant that to finish the accept loop task we needed to await it which meant a IAsyncDisposable.
To do that meant we needed this to be a list of IAsyncDisposable
To do that and to preserve the only one thing can access this list at a time behaviour we needed some way of locking in both async and sync context (lock does not work with async)
To ensure that is done properly this type ensures, assuming you don't try really hard, that you can't have access to the list unless it is mutually exclusive access.

try
{
#if !NETFRAMEWORK
client = await listener.AcceptTcpClientAsync(this.cancellationToken);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think I understand the subtlety here.

In the old one, we would wait until cts.IsCancellationRequested || listener.Pending() before calling accept.

Using the cancellation token removes the need for cts.IsCancellationRequested. But is waiting for listener.Pending() important? Or does this happen for free somehow with using the async versions of AcceptTcpClientAsync?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The whole wait until we know something is pending was done to avoid sitting in Accept at the same time as listener.stop() being called, because a bug did (or still does honestly I can't tell) where the call to listener.stop() would hang if AcceptTcpClient was in the middle of accepting a new connection.

The big change is to remove the SpinWait that polls listener.pending() because that is very expensive :(.
Instead we use the new (relative to when the old code was added) method that is AcceptTcpClientAsync with a cancellation token. When we shut down on linux the order is always:

  1. Cancel the CTS
  2. Wait for the loop to exit (which it does)
  3. then and only then stop the listener (at this point we know we are NOT in accept and so can not hang).

Copy link
Contributor

@sburmanoctopus sburmanoctopus left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking good.

I'd like to understand the comment in SecureListener regarding listener.Pending() before approval though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants