Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test.Saga hangs test runner on Linux under certain conditions #135

Closed
jenspettersson opened this issue Feb 19, 2019 · 31 comments
Closed

Test.Saga hangs test runner on Linux under certain conditions #135

jenspettersson opened this issue Feb 19, 2019 · 31 comments

Comments

@jenspettersson
Copy link

Not 100% sure this is caused by Test.Sagabut we only experience this when running tests (in a Linux container) that contains Saga-tests that does some async operation.

Reproduced in https://github.com/jenspettersson/NSBSagaTest

Symptom

When running dotnet test inside a Linux container (in this repo microsoft/dotnet:2.2.104-sdk on Docker For Windows with Linux containers) the test executor hangs if you test a Saga from at least two different tests in parallel and the saga does some sort of async operation.

This is not the case when running tests with dotnet test on a Windows 10 computer or directly from Visual Studio.

How to reproduce

Clone the repo and and run docker build -t nsb-lab . in the root. This will try to run all the tests while running on microsoft/dotnet:2.2.104-sdk. The result should be:

Test run for /app/NSBSagaTest.Tests/bin/Debug/netcoreapp2.2/NSBSagaTest.Tests.dll(.NETCoreApp,Version=v2.2)
Microsoft (R) Test Execution Command Line Tool Version 15.9.0
Copyright (c) Microsoft Corporation.  All rights reserved.

Starting test execution, please wait...

and it hangs "forever" on the last line.

I've included a xunit.runner.json file with both parallelizeAssembly and parallelizeTestCollections set to true (I think this is the default for the runner on Linux) and if you change it to:

{
  "parallelizeAssembly": true,
  "parallelizeTestCollections": false
}

and run the docker command again, the tests should pass.

Test run for /app/NSBSagaTest.Tests/bin/Debug/netcoreapp2.2/NSBSagaTest.Tests.dll(.NETCoreApp,Version=v2.2)
Microsoft (R) Test Execution Command Line Tool Version 15.9.0
Copyright (c) Microsoft Corporation.  All rights reserved.

Starting test execution, please wait...

Total tests: 4. Passed: 4. Failed: 0. Skipped: 0.
Test Run Successful.
Test execution time: 6.5840 Seconds

Instead of changing the xunit.runner.json file, you can remove Another_Test_for_SagaOne and Another_Test_for_SagaTwo and the test runner won't hang.

@bording
Copy link
Member

bording commented Feb 19, 2019

@jenspettersson Have you tried running on Linux, but not inside a container?

@jenspettersson
Copy link
Author

@bording Nope, I don't have access to a Linux instance atm. Maybe I should have been more clear about that in the issue.

@jenspettersson
Copy link
Author

However, it happens on our build server as well and that's a Linux server building docker containers. But still, only built/run within a container.

@bording
Copy link
Member

bording commented Feb 19, 2019

@jenspettersson It would be good to know if it was specific to being inside a container. Do you have WSL set up on your Windows 10 machine? That would be an easy way to test it, and would provide another useful data point.

@jenspettersson
Copy link
Author

@bording No, I don't have WSL on my Windows 10 machine, it's my clients machine and I don't have permission to setup everything. I did just test this on my personal MacBook with .NET Core 2.2.104 SDK installed (no docker) and the same thing happened.

@bording
Copy link
Member

bording commented Feb 19, 2019

@jenspettersson I cloned your repo and have run the tests using dotnet test with the 2.2.104 SDK in the following environments:

  • Windows 10 1809
  • Ubuntu 18.04 WSL
  • Ubuntu 18.04 VM
  • docker running on the Ubuntu 18.04 VM

The tests are running properly and not hanging for all of those. It would seem that there must be something more going on because I am not able to repro the problem you're describing.

@jenspettersson
Copy link
Author

@bording Strange, but thanks for investigating. I'm currently testing this on Azure Dev Ops with a build pipeline that runs on "Hosted Ubuntu 1604" and it's also hangs.

I made that Azure DevOps project public and it's running build no. 71 as I'm writing this and it has been running for 4 minutes and is still not done: https://dev.azure.com/jenspettersson/nsb-saga-test/_build/results?buildId=71&_a=summary

@jenspettersson
Copy link
Author

I canceled the build after it's been trying to run the tests for 10 minutes:

2019-02-19T20:30:55.2729839Z Starting test execution, please wait...
2019-02-19T20:40:57.7701502Z ##[error]The operation was canceled.

@jenspettersson
Copy link
Author

Tested the same pipeline (build 72) but with build agent "Hosted VS2017" and it also hanged...

@jenspettersson
Copy link
Author

jenspettersson commented Feb 19, 2019

Last test for tonight: Created a Ubuntu 18.04 VM on Azure and cloned the repo and ran dotnet test and it also hangs.

Changed the xunit.runner.json to:

{
  "parallelizeAssembly": true,
  "parallelizeTestCollections": false
}

and ran dotnet test again and it succeded after a couple of seconds.

@bording
Copy link
Member

bording commented Feb 19, 2019

And yet I've tried all the same things, and it works fine. I think we're going to need more to go on to be able to figure it out.

@andreasohlund
Copy link
Member

I can confirm that this hangs on a Ubuntu 18.04 VM that @jenspettersson have created on Azure.

@bording would you be able to take a look if I send you the credentials?

@bording
Copy link
Member

bording commented Feb 20, 2019

@jenspettersson I can see the tests hanging on the VM you created, but they don't hang on mine. That means the problem has to be more than just "run on Linux". Perhaps it's related to the type/number of CPUs on the machine?

Also, I see that you updated your repo to include an NUnit-based test project. However, that isn't yet a fair comparison because you need to add an assembly-level Parallelizable attribute to get NUnit to run tests in parallel.

See https://github.com/Particular/NServiceBus/blob/develop/src/NServiceBus.AcceptanceTests/AssemblyInfo.cs for an example of what you need to add.

Once NUnit is also running tests in parallel, I'll be interested to see what the behavior is on your VM.

@jenspettersson
Copy link
Author

Yes, perhaps the problem might be more than just "run on Linux", that was just our initial thought after spending hours finally tracking down the specific tests that caused our whole test suite to never complete inside our docker containers.

I added the Parallelizable.Fixtures to the NUnit tests and they still works, but the xunit ones does not.

The VM I set up had only 1 vcpu (size: B1s in azure) but even as I changed it to two (B2s) it still behaved the same way (that is: xunit tests hangs, nunit tests works).

@bording
Copy link
Member

bording commented Feb 20, 2019

I was able to take a memory dump of the hanging test process, and after playing around with lldb for a while, it looks like most of the threads are sitting inside of the Task.Wait code. One thread was clearly doing that from inside the NServiceBus.Testing.

My current theory is that

sagaIsInvoked(saga, testContext).GetAwaiter().GetResult();
is the culprit.

I think we need to review all the places GetAwaiter().GetResult() is being called to synchronously wait for a Task to complete.

@andreasohlund
Copy link
Member

andreasohlund commented Feb 21, 2019 via email

@andreasohlund
Copy link
Member

Talking with @danielmarbach we don't see a quick fix for this. But we think that we've found a better workaround, add this before all your tests.

System.Threading.SynchronizationContext.SetSynchronizationContext(null);

That seemed to work when we changed it on the VM, can you verify?

Hopefully XUnit have some way to put that code in a single place.

@bording
Copy link
Member

bording commented Feb 21, 2019

If xUnit has a SynchronizationContext, I'm not sure it's a good idea to remove it?

@danielmarbach
Copy link
Contributor

It is there primarily for async void. You can restore it after the test in a finally block

@bording
Copy link
Member

bording commented Feb 21, 2019

Looks like xUnit has two that it uses:

The first one is for async void tests, and the second one is what controls the test concurrency, so it seems like trying to remove them would lead to undesired behavior. I've even found a GitHub issue saying that System.Threading.SynchronizationContext.SetSynchronizationContext(null) may not work.

@danielmarbach
Copy link
Contributor

As long as jens is aware that the tests using it break out from concurrency behavior and he doesn't apply it globally it should be fine fiven that the saga tests are independent

@danielmarbach
Copy link
Contributor

When we introduce an async fluent style we would have to release either suffixed methods or release a new major. Though already today we mention the AAA style should be preferred over the fluent style. See

https://docs.particular.net/nservicebus/testing/fluent

And

https://docs.particular.net/nservicebus/testing/#testing-a-saga

@bording
Copy link
Member

bording commented Feb 21, 2019

Though already today we mention the AAA style should be preferred over the fluent style.

Perhaps we should be more aggressive in moving to that being the only style, especially since the fluent style has these deadlock bombs hidden in it?

@andreasohlund
Copy link
Member

andreasohlund commented Feb 21, 2019 via email

@jenspettersson
Copy link
Author

I have absolutely no problems moving away from the fluent style testing. This particular service only have a few sagas with just a couple of tests, so no problems for us to rewrite those tests.

We haven't been using the fluent style testing for our regular handlers so it will be more consistent for us to change our saga tests.

Thanks for the information!

@andreasohlund andreasohlund self-assigned this Apr 3, 2019
@andreasohlund andreasohlund removed their assignment May 16, 2019
@simonfox
Copy link

simonfox commented Sep 6, 2019

We are just in the final stages of an upgrade from NSB5 to NSB7.

We have a suite of existing Fluent style tests which are suffering from the deadlock issue. I've read this xunit issue which confirms the problem as well as the position of the project in terms of doing anything to help (which is somewhat understandable but the tone is interesting). I can also confirm the suggested workarounds both work for me - disabling parallelization or removing the thread limit (however I'm not sure of the implications of that?).

The new AAA style is great, however converting our existing suites to that would be a massive undertaking and I can put up with slower execution. However I do feel like it would be feasible to convert .OnMessage to await .OnMessageAsync if it were available.

I guess in summary what I'm saying is

  • .OnMessageAsync would be great
  • maybe a warning in the upgrade guide would be useful to highlight that an upgraded Fluent style suite might fail on a runner doing parallelization (I was sweating that I'd done something wrong in async land when after finally completing the required changes to support the new api and running my tests they were deadlocking)

@danielmarbach
Copy link
Contributor

@simonfox would this satisfy your needs?

#148

@simonfox
Copy link

simonfox commented Sep 6, 2019

@danielmarbach absolutely!

Re the discussion on the PR - I would be fine with a major bump and replacement of sync entirely.

@danielmarbach
Copy link
Contributor

Closing this as the PR has been merged

@danielmarbach
Copy link
Contributor

@simonfox
Copy link

@danielmarbach perfect! Thanks so much.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants