Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[QUIC] Root listener/connection while waiting on new connection/stream event #74450

Merged
merged 2 commits into from
Aug 26, 2022

Conversation

ManickaP
Copy link
Member

Root cause

When awaiting reading from Channel (whether it's channel of connections in listener or channel of stream in connection) we're technically awaiting a native event from msquic. During this awaiting we were hitting another instance of the problem stated in https://devblogs.microsoft.com/pfxteam/keeping-async-methods-alive/. In other words, GC would collect listener/connection while we were waiting for NEW_CONNECTION/NEW_STREAM events.

As a side note, we did have the same problem before the API changes (with the State objects), even with more places where GC had an opportunity to collect Listener/Connection/Stream. My suspicion is timing changes due to API changes and/or other changes elsewhere just uncovered this.

Fix

Added rooting of listener/connection in AcceptIncoming... methods. Also added tests directly targeted at this problem (they were consistently failing before the fix, always on the first run).

Future

We have a lot of occurrences where we allocate and free GCHandle of the same object repeatedly in its lifetime. I don't know how expensive this is, but if it proves so, we might think of a better solution to this.

Note that the PR is done in 2 commits:

Fixes #73377

@ghost
Copy link

ghost commented Aug 23, 2022

Tagging subscribers to this area: @dotnet/ncl
See info in area-owners.md if you want to be subscribed.

Issue Details

Root cause

When awaiting reading from Channel (whether it's channel of connections in listener or channel of stream in connection) we're technically awaiting a native event from msquic. During this awaiting we were hitting another instance of the problem stated in https://devblogs.microsoft.com/pfxteam/keeping-async-methods-alive/. In other words, GC would collect listener/connection while we were waiting for NEW_CONNECTION/NEW_STREAM events.

As a side note, we did have the same problem before the API changes (with the State objects), even with more places where GC had an opportunity to collect Listener/Connection/Stream. My suspicion is timing changes due to API changes and/or other changes elsewhere just uncovered this.

Fix

Added rooting of listener/connection in AcceptIncoming... methods. Also added tests directly targeted at this problem (they were consistently failing before the fix, always on the first run).

Future

We have a lot of occurrences where we allocate and free GCHandle of the same object repeatedly in its lifetime. I don't know how expensive this is, but if it proves so, we might think of a better solution to this.

Note that the PR is done in 2 commits:

Fixes #73377

Author: ManickaP
Assignees: -
Labels:

area-System.Net.Quic

Milestone: -

@ManickaP ManickaP requested review from CarnaViire, wfurt and rzikm August 23, 2022 19:48
@@ -562,7 +564,6 @@ await using (stream.ConfigureAwait(false))
}

stream.Abort(QuicAbortDirection.Read, (long)Http3ErrorCode.StreamCreationError);
stream.Dispose();
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The stream is in await using which will handle the Dispose.

@dotnet dotnet deleted a comment from azure-pipelines bot Aug 23, 2022
Copy link
Member

@rzikm rzikm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! good catch!

Copy link
Member

@wfurt wfurt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@wfurt wfurt requested a review from stephentoub August 24, 2022 06:24
@ManickaP ManickaP force-pushed the mapichov/fix-73377 branch from 395194e to 619899e Compare August 24, 2022 09:05
@ManickaP ManickaP force-pushed the mapichov/fix-73377 branch from 619899e to 9ab37f9 Compare August 24, 2022 09:20
@ManickaP
Copy link
Member Author

/azp run runtime-libraries-coreclr outerloop

@dotnet dotnet deleted a comment from azure-pipelines bot Aug 24, 2022
@azure-pipelines
Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@dotnet dotnet deleted a comment from azure-pipelines bot Aug 24, 2022
@dotnet dotnet deleted a comment from azure-pipelines bot Aug 24, 2022
@ManickaP
Copy link
Member Author

@dotnet/area-infrastructure-libraries The outerloop tests seem to be broken (in several PRs and main as well) with:

'"C:\helix\work\correlation\dotnet.exe"' is not recognized as an internal or external command,
operable program or batch file.

Only 10% of the tests is passing.

Is this a known issue?

@ViktorHofer
Copy link
Member

Is this a known issue?

Yes: #74468

@wfurt
Copy link
Member

wfurt commented Aug 24, 2022

I was thinking about it more I'm wondering it is caller's responsibility to root the objects they need/use. For QuicListener it may not be problem for Kestrel as they probably hold reference and perhaps we use it wrong in our tests....?

We had similar problem in #72586. But what was not rooted was internal object so it was responsibility of HttpHandler e.g. product issue.

@rzikm
Copy link
Member

rzikm commented Aug 25, 2022

it is caller's responsibility to root the objects they need/use.

It may be rather difficult and impractical for the user to do so (and very difficult to explain why/how to do it). Consider the QuicStreamConnectedStreamConformanceTests.CreateConnectedStreamsAsync (where the hang occurred)

protected override async Task<StreamPair> CreateConnectedStreamsAsync()

Based on what the mentioned blogpost in PR description says, once we get to the AcceptConnectionAsync()

The entire async call stack is not rooted anywhere. the object graph that would be collected includes the QuicStreamConnectedStreamConformanceTests and all continuations on the CreateConnectedStreamsAsync, the test state machine (Parallel_ReadWriteMultipleStreamsConcurrently), and maybe the xUnit continuation which would run the next test...

Keeping the above in mind, I don't really see how to root the QuicListener the way you suggest.

@rzikm
Copy link
Member

rzikm commented Aug 25, 2022

One other thing I discovered while trying to understand the issue: Consider following code

(await QuicListener.Listen(/*...*/)).AcceptConnectionAsync()

I.e. calling AcceptConnectionAsync and abandoning both the listener and the returned task.

Unless there is an incoming connection, the GCHandle allocated in AcceptConnectionAsync is never freed and thus the code above creates a resource leak. We may possibly encounter similar state when user forgets to Dispose the QuicListener in some scenario.

@ManickaP
Copy link
Member Author

Unless there is an incoming connection, the GCHandle allocated in AcceptConnectionAsync is never freed and thus the code above creates a resource leak. We may possibly encounter similar state when user forgets to Dispose the QuicListener in some scenario.

And I don't see a way around it, unfortunately. If we do not root, GC might collect it. If we root, we don't have a way to know that user is not using the object anymore without them calling Dispose (Finalizer won't help here, no State-like middle man will help either). Any ideas? Any opinions?

@stephentoub
Copy link
Member

How is this different from Socket.AcceptAsync? If you call that, never connect, and drop all other references, isn't the Socket still kept alive? I'd expect it to be.

@ManickaP
Copy link
Member Author

How is this different from Socket.AcceptAsync? If you call that, never connect, and drop all other references, isn't the Socket still kept alive? I'd expect it to be.

So it's fine to keep it rooted and have a leak if the user doesn't dispose? And we have a precedent for this in sockets. I'm asking. I don't know what's right or wrong. I lived in the idea that we cannot have leaks even if users don't Dispose everything (e.g. using finalizers to catch these cases).

@rzikm
Copy link
Member

rzikm commented Aug 25, 2022

How is this different from Socket.AcceptAsync? If you call that, never connect, and drop all other references, isn't the Socket still kept alive? I'd expect it to be.

Just made a quick test and it seems that sockets do behave this way

using System.Net;
using System.Net.Sockets;

void DoStuff()
{
    var s = new Socket(AddressFamily.InterNetwork, SocketType.Stream, ProtocolType.Tcp);
    s.Bind(new IPEndPoint(IPAddress.Any, 50001));
    s.Listen();
    System.Console.WriteLine("Bound+Listening");
    _ = s.AcceptAsync(); // removing this line will allow the socket to be GC'd
}

DoStuff();

System.Console.ReadLine();
GC.Collect();

System.Console.WriteLine("Abandoned and GC'd?");
System.Console.ReadLine();

@stephentoub
Copy link
Member

If the code were instead:

static Socket CreateListener();
... 
new Thread(()  =>
{
    Socket server = CreateListener().Accept();
    Console.WriteLine(server);
}). Start();

you'd be really surprised if a GC caused that all to go away if one occurred before a client connected. It's the same here with async. The code is issuing a call to start an operation; the relevant state needs to be kept alive until the operation completes.

@ManickaP ManickaP merged commit 37eff15 into dotnet:main Aug 26, 2022
@ManickaP ManickaP deleted the mapichov/fix-73377 branch August 26, 2022 18:25
@wfurt
Copy link
Member

wfurt commented Aug 26, 2022

I'm wondering if we could use GC.KeepAlive(this) but I don't know if it works for the async continuations....

@stephentoub
Copy link
Member

wondering if we could use GC.KeepAlive(this)

Nope

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
6 participants