Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Massive xunit test performance degradation in .NET core 2.1.700 and 2.2.300 #2775

Closed
JSkimming opened this issue May 22, 2019 · 29 comments · Fixed by microsoft/vstest#2024
Closed

Comments

@JSkimming
Copy link

JSkimming commented May 22, 2019

TL;DR

Update: The issue has been found and is now tracked via PR microsoft/vstest#2024.

Update2: The issue has now been fixed with the .NET SDK releases 2.1.701 & 2.2.301, demonstrated by the following builds.

General

Operating System

  • OS: Linux - specifically containers
  • Versions: 2.1.700 and 2.2.300

I've found that the latest release .NET Core 2.1 & 2.2 have introduced a performance degradation when executing xunit tests via dotnet test.

I have found this with 3 different OSS projects I maintain, and across two different CI providers (CircleCI & Travis). The specific details are below.

FYI. I haven't raised this against xunit, because the only difference is the two releases of .NET Core.

The following are builds of identical source code running against 4 different releases of .NET core running in CircleCI using containers.

Please see the compare-dotnet-releases branch of https://github.com/JSkimming/abioc.

As you can see the test execution takes an order of magnitude longer.

Here is the same build in travis. - NOTE: The mac build is fine because it is pinned to an earlier version of .NET Core.

Here are two completely different projects, both exhibiting the same issue in CircleCI and Travis

https://circleci.com/gh/JSkimming/tesla-net/tree/master
https://travis-ci.org/JSkimming/tesla-net/builds/531160500
https://circleci.com/gh/JSkimming/Castle.Core.AsyncInterceptor/tree/master
https://travis-ci.org/JSkimming/Castle.Core.AsyncInterceptor/builds/530615608

Update

Latest builds with timings between steps

@karelz
Copy link
Member

karelz commented May 22, 2019

Did you get a chance to collect traces and compare where the difference comes from?
Given that latest version is fast, is it worth investigation?

cc @adamsitnik @billwert @brianrob

@JSkimming
Copy link
Author

Given that latest version is fast, is it worth investigation?

@karelz It's the latest versions that are slow.

Did you get a chance to collect traces and compare where the difference comes from?

Not yet, how should I go about collecting traces?

@karelz
Copy link
Member

karelz commented May 22, 2019

oh, I didn't notice it is minutes, not seconds. That is bad.
@adamsitnik can you please take a look? This looks serious.

@livarcocc @nguerrera is it a known issue?

@JSkimming
Copy link
Author

JSkimming commented May 22, 2019

@karelz I'm unable to replicate it on my machine 😉

worksonmymachine

I've tried on Windows, WSL and Docker (there's a helper Dockerfile in the root of the repo), but given it's happening on two different CI systems, and three different OSS projects, I figured it worth reporting before being able to replicate on my laptop.

The fastest way to execute the tests is to run coverage.cmd on windows, and coverage.sh on linux, coverage.cmd just tests .NET Framework though.

@karelz
Copy link
Member

karelz commented May 22, 2019

OK, that is interesting - I assume we will have similar problems to replicate it.
Are you able to reproduce it on similar machine setup in cloud? (similar VM used by the CI systems)

@JSkimming
Copy link
Author

I assume we will have similar problems to replicate it.

Quite possibly.

Are you able to reproduce it on similar machine setup in cloud? (similar VM used by the CI systems)

Nope, I haven't tried that. I'm not in a position to try that at the moment either.

@brianrob
Copy link
Member

I'm going to try the Dockerfile out and see if I can reproduce the regression.

@nguerrera
Copy link
Contributor

@livarcocc @nguerrera is it a known issue?

First I hear of it.

@nguerrera
Copy link
Contributor

As you can see the test execution takes an order of magnitude longer.

Strangely, the dotnet test runs are reporting mere seconds of total time while the overall job takes much longer. Is it possible to get timestamps at the start/end of each command?

@nguerrera
Copy link
Contributor

CC @singhsarab

JSkimming added a commit to JSkimming/abioc that referenced this issue May 22, 2019
@JSkimming
Copy link
Author

JSkimming commented May 22, 2019

I'm going to try the Dockerfile out and see if I can reproduce the regression.

@brianrob I've just pushed an updated Dockerfile, and I'm experimenting with tweaking the following:

docker build --cpuset-cpus=0 --cpuset-mems=0 --cpu-period=1000 --build-arg CACHE_BUST=1 .

You need to keep changing CACHE_BUST and it will cache up to the test execution.

Is it possible to get timestamps at the start/end of each command?

@nguerrera I've just pushed an update to display the dates, checkout the builds of this PR JSkimming/abioc#82

@brianrob
Copy link
Member

It seems that I'm not able to reproduce this either.

@karelz
Copy link
Member

karelz commented May 22, 2019

@brianrob is there something we can ask @JSkimming to collect that we can then analyze?

@brianrob
Copy link
Member

Yes. If you're able to reproduce the issue, please follow the instructions at https://github.com/dotnet/coreclr/blob/master/Documentation/project-docs/linux-performance-tracing.md to create a trace for each of the good and bad cases. You don't need to capture the whole thing - let's just start with 20 seconds for each.

@JSkimming
Copy link
Author

@brianrob any tips on how to get that running in a hosted CI system?

So far I've only been able to replicate the issue on CircleCI and Travis.

There's a lot of sudo and having multiple shell windows open in those instructions, none of which I'm sure I can do easily. I believe I can configure Travis to be able to use sudo, though I think that will change it from container builds to VMs, which may no longer exhibit the issue.

If it's something that can be enabled on Azure DevOps, I'm willing to set-up the build there too, though until I try, I don't know if I'll be able to replicate things.

@JSkimming
Copy link
Author

Strangely, the dotnet test runs are reporting mere seconds of total time while the overall job takes much longer. Is it possible to get timestamps at the start/end of each command?

@nguerrera I've updated the main description with links to the builds with timings between the commands.

@karelz
Copy link
Member

karelz commented May 22, 2019

@JSkimming sadly we are not familiar with those environments.
If it is related only to specific environment (special container in CI systems), it may be very well some problem in the environment setup ...

@JSkimming
Copy link
Author

JSkimming commented May 23, 2019

OK, I think I've found the issue.

TL;DR

On a hunch, I changed the script to stop displaying the output of dotnet test, and things are back to normal. See here for CircleCI, and here for Travis.

The Hunch

I noticed the test output flickered when displaying a message.

Test run in progress...

Here's an example

TestCliOutput

I get the impression the CLI is writing that message a lot, which eventually results in the flickering because a buffer somewhere is filled.

Hosted CI systems capture the stdout of the tools being executed, but in their case, all those writes are going through code (that ultimately streams to the browser) rather than directly to a terminal, and all those stdout writes from dotnet test are destroying things.

This also explains the discrepancy between the test execution timings from the dotnet test and the actual timings I inserted. See below; the CLI states the tests executed in 2.1397 Seconds whereas the real execution time is 10m 49s.

Wed May 22 20:53:05 UTC 2019
0 minutes and 7 seconds elapsed.

$ dotnet test --no-restore --no-build -f netcoreapp2.1 -c Release /root/project/test/Abioc.Tests.Internal/Abioc.Tests.Internal.csproj --results-directory /root/project/test/TestResults/output/ --logger "trx;LogFileName=Abioc.Tests.Internal.trx"
Test run for /root/project/test/Abioc.Tests.Internal/bin/Release/netcoreapp2.1/Abioc.Tests.Internal.dll(.NETCoreApp,Version=v2.1)
Microsoft (R) Test Execution Command Line Tool Version 16.1.0
Copyright (c) Microsoft Corporation.  All rights reserved.

Starting test execution, please wait...
Test run in progress.Test run in progress ... Results File: /root/project/test/TestResults/output/Abioc.Tests.Internal.trx

Test Run Successful.
Total tests: 310
     Passed: 310
 Total time: 2.1397 Seconds

Wed May 22 21:03:54 UTC 2019
10 minutes and 56 seconds elapsed.

I'm not sure where to start looking for what changed in the dotnet cli that caused all this. All scenarios are using v16.1.0 of vstest.

Actually, I misread the versions, the issue was introduced in v16.1.0 see below.

@karelz
Copy link
Member

karelz commented May 23, 2019

Awesome to see we are closer to a root cause!
@nguerrera do you have an idea if CLI is responsible for too frequent text update in this case?

@nguerrera
Copy link
Contributor

Looks like microsoft/vstest#1964

@singhsarab

@JSkimming
Copy link
Author

JSkimming commented May 23, 2019

@nguerrera Ahh, I misread the numbers. The earlier (working) release was 16.0.1, and those changes went into 16.1.0. So it is in vstest.

@JSkimming
Copy link
Author

From a quick reading of the code I think I may have identified the issue.

The ProgressIndicator clears and redisplays the progress message whenever it receives successive calls to Pause() then Start().

This occurs whenever ConsoleLogger.TestMessageHandler() or ConsoleLogger.TestResultHandler is called regardless of the verbosity level.

As a result, the log message is constantly being redrawn, even if the verbosity level would not result in a message.

I'm going to have a look at a potential fix, and submit a PR later. I'll keep you updated on progress.

@nguerrera
Copy link
Contributor

@karelz This should be moved to microsoft/vstest repo.

@nguerrera
Copy link
Contributor

@JSkimming Awesome work tracking this down!

@JSkimming
Copy link
Author

@karelz I've raised this PR microsoft/vstest#2024, all feedback is welcome.

Do you want to raise this as an issue in the microsoft/vstest repo, I'm happy to do it if it helps.

@karelz
Copy link
Member

karelz commented May 24, 2019

Awesome. Please open issue there if they need it, but maybe PR is enough. We can't move issues cross-org.
ZenHub would make just a copy.

Awesome work, thanks a ton!!!

@karelz karelz closed this as completed May 24, 2019
@tremblaysimon
Copy link

I observed a degradation of performance even in a Linux host environment (not only in a CI environment).

One of our test project (around 2800 tests) took about 30 seconds (on 2.1.604) and now it takes about 1.87 minutes. We talk about a 4.5 times longer to run tests than previous SDK.

Do you know if the PR involved will fix that too?

@JSkimming
Copy link
Author

@karelz this issue is still not resolved.

The fix I implemented was merged on the 28th May, and the NuGet package was released on the 30th May, but until there is a .NET Core SDK release where the latest version of VSTest is "inserted", the problem persists.

Is there any update on when a new SDK will be released incorporating the latest VSTest?

@JSkimming
Copy link
Author

The .NET SDK releases 2.1.701 & 2.2.301 are now available and contain the fixes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

5 participants