Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System.Numerics.Tensors tests timing out under DOTNET_JITMinOpts=1 on x86 #97629

Closed
jakobbotsch opened this issue Jan 29, 2024 · 8 comments · Fixed by #98155
Closed

System.Numerics.Tensors tests timing out under DOTNET_JITMinOpts=1 on x86 #97629

jakobbotsch opened this issue Jan 29, 2024 · 8 comments · Fixed by #98155
Labels
area-System.Numerics.Tensors blocking-clean-ci-optional Blocking optional rolling runs Known Build Error Use this to report build issues in the .NET Helix tab
Milestone

Comments

@jakobbotsch
Copy link
Member

jakobbotsch commented Jan 29, 2024

Error Blob

{
  "ErrorMessage": "'System.Numerics.Tensors.Tests' END OF WORK ITEM LOG: Command timed out, and was killed",
  "BuildRetry": false,
  "ErrorPattern": "",
  "ExcludeConsoleLog": false
}

Reproduction Steps

These tests are timing out when run with DOTNET_JITMinOpts=1.

Example run: https://dev.azure.com/dnceng-public/public/_build/results?buildId=543546&view=ms.vss-test-web.build-test-results-tab
Example log: https://helixre107v0xdcypoyl9e7f.blob.core.windows.net/dotnet-runtime-refs-heads-main-07f52de3725e4b168a/System.Numerics.Tensors.Net8.Tests/1/console.78bd6b1b.log?helixlogtype=result

@tannergooding @stephentoub are these tests long-running enough for this to be expected? Should we just disable them in stress configurations?

Known issue validation

Build: 🔎 https://dev.azure.com/dnceng-public/public/_build/results?buildId=543546
Error message validated: 'System.Numerics.Tensors.Tests' END OF WORK ITEM LOG: Command timed out, and was killed
Result validation: ✅ Known issue matched with the provided build.
Validation performed at: 1/29/2024 9:36:30 AM UTC

Report

Build Definition Test Pull Request
558298 dotnet/runtime System.Numerics.Tensors.Tests.WorkItemExecution
557623 dotnet/runtime System.Numerics.Tensors.Tests.WorkItemExecution #98096
557612 dotnet/runtime System.Numerics.Tensors.Tests.WorkItemExecution #98107
557374 dotnet/runtime System.Numerics.Tensors.Tests.WorkItemExecution #98048
556608 dotnet/runtime System.Numerics.Tensors.Tests.WorkItemExecution
554712 dotnet/runtime System.Numerics.Tensors.Tests.WorkItemExecution
553688 dotnet/runtime System.Numerics.Tensors.Tests.WorkItemExecution #97897
551611 dotnet/runtime System.Numerics.Tensors.Tests.WorkItemExecution #97897
548901 dotnet/runtime System.Numerics.Tensors.Tests.WorkItemExecution
544009 dotnet/runtime System.Numerics.Tensors.Tests.WorkItemExecution
543546 dotnet/runtime System.Numerics.Tensors.Tests.WorkItemExecution

Summary

24-Hour Hit Count 7-Day Hit Count 1-Month Count
4 8 11
@jakobbotsch jakobbotsch added blocking-clean-ci-optional Blocking optional rolling runs Known Build Error Use this to report build issues in the .NET Helix tab labels Jan 29, 2024
@dotnet-issue-labeler dotnet-issue-labeler bot added the needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners label Jan 29, 2024
@ghost ghost added the untriaged New issue has not been triaged by the area owner label Jan 29, 2024
@ghost
Copy link

ghost commented Jan 29, 2024

Tagging subscribers to this area: @dotnet/area-system-numerics-tensors
See info in area-owners.md if you want to be subscribed.

Issue Details

Error Blob

{
  "ErrorMessage": "'System.Numerics.Tensors.Tests' END OF WORK ITEM LOG: Command timed out, and was killed",
  "BuildRetry": false,
  "ErrorPattern": "",
  "ExcludeConsoleLog": false
}

Reproduction Steps

These tests are timing out when run with DOTNET_JITMinOpts=1.

Example run: https://dev.azure.com/dnceng-public/public/_build/results?buildId=543546&view=ms.vss-test-web.build-test-results-tab
Example log: https://helixre107v0xdcypoyl9e7f.blob.core.windows.net/dotnet-runtime-refs-heads-main-07f52de3725e4b168a/System.Numerics.Tensors.Net8.Tests/1/console.78bd6b1b.log?helixlogtype=result

@tannergooding @stephentoub are these tests long-running enough for this to be expected? Should we just disable them in stress configurations?

Known issue validation

Build: 🔎 https://dev.azure.com/dnceng-public/public/_build/results?buildId=543546
Error message validated: 'System.Numerics.Tensors.Tests' END OF WORK ITEM LOG: Command timed out, and was killed
Result validation: ✅ Known issue matched with the provided build.
Validation performed at: 1/29/2024 9:36:21 AM UTC

Author: jakobbotsch
Assignees: -
Labels:

area-System.Numerics.Tensors, untriaged, blocking-clean-ci-optional, Known Build Error, needs-area-label

Milestone: -

@jakobbotsch jakobbotsch removed untriaged New issue has not been triaged by the area owner needs-area-label An area label is needed to ensure this gets routed to the appropriate area owners labels Jan 29, 2024
@jakobbotsch jakobbotsch added this to the 9.0.0 milestone Jan 29, 2024
@stephentoub
Copy link
Member

stephentoub commented Jan 29, 2024

are these tests long-running enough for this to be expected

They take ~40s in release on my machine. They also rely heavily on generic specialization, inlining, and unused code being eliminated early. Might be worth a short investigation to understand what's causing the timeouts, eg is it just the tests taking longer, is the jit itself actually taking too long, etc

@jakobbotsch
Copy link
Member Author

I took a quick look at one of the timeouts; that one seems to make good progress until System.Numerics.Tensors.Tests.ConvertTests.ConvertTruncatingAndSaturating. We end up in code like

if (typeof(TOther) == typeof(double))
{
double actualValue = (double)(object)value;
result = (Half)actualValue;
return true;
}
else if (typeof(TOther) == typeof(short))
{
short actualValue = (short)(object)value;
result = (Half)actualValue;
return true;
}
else if (typeof(TOther) == typeof(int))
{
int actualValue = (int)(object)value;
result = (Half)actualValue;
return true;
}
else if (typeof(TOther) == typeof(long))
{
long actualValue = (long)(object)value;
result = (Half)actualValue;
return true;
}
else if (typeof(TOther) == typeof(Int128))
{
Int128 actualValue = (Int128)(object)value;
result = (Half)actualValue;
return true;
}
else if (typeof(TOther) == typeof(nint))
{
nint actualValue = (nint)(object)value;
result = (Half)actualValue;
return true;
}
else if (typeof(TOther) == typeof(sbyte))
{
sbyte actualValue = (sbyte)(object)value;
result = actualValue;
return true;
}
else if (typeof(TOther) == typeof(float))
{
float actualValue = (float)(object)value;
result = (Half)actualValue;
return true;
}
else
{
result = default;
return false;
}

and
if (typeof(TOther) == typeof(double))
{
double actualValue = (double)(object)value;
result = (actualValue >= +170141183460469231731687303715884105727.0) ? MaxValue :
(actualValue <= -170141183460469231731687303715884105728.0) ? MinValue : (Int128)actualValue;
return true;
}
else if (typeof(TOther) == typeof(Half))
{
Half actualValue = (Half)(object)value;
result = (actualValue == Half.PositiveInfinity) ? MaxValue :
(actualValue == Half.NegativeInfinity) ? MinValue : (Int128)actualValue;
return true;
}
else if (typeof(TOther) == typeof(short))
{
short actualValue = (short)(object)value;
result = actualValue;
return true;
}
else if (typeof(TOther) == typeof(int))
{
int actualValue = (int)(object)value;
result = actualValue;
return true;
}
else if (typeof(TOther) == typeof(long))
{
long actualValue = (long)(object)value;
result = actualValue;
return true;
}
else if (typeof(TOther) == typeof(nint))
{
nint actualValue = (nint)(object)value;
result = actualValue;
return true;
}
else if (typeof(TOther) == typeof(sbyte))
{
sbyte actualValue = (sbyte)(object)value;
result = actualValue;
return true;
}
else if (typeof(TOther) == typeof(float))
{
float actualValue = (float)(object)value;
result = (actualValue >= +170141183460469231731687303715884105727.0f) ? MaxValue :
(actualValue <= -170141183460469231731687303715884105728.0f) ? MinValue : (Int128)actualValue;
return true;
}
else
{
result = default;
return false;
}

which, when unfolded, invokes the JIT_GetRuntimeType fcall a number of times, which is itself quite slow in checked builds. A perfview trace for a couple of minutes when running this test single threaded gives

image

image

I didn't see any indication that something is deadlocked, so seems like we should just disable under minopts/stress.

@jakobbotsch
Copy link
Member Author

jakobbotsch commented Feb 8, 2024

That test System.Numerics.Tensors.Tests.ConvertTests.ConvertTruncatingAndSaturating takes a long time with a checked runtime, even without JITMinOpts=1 (multiple minutes on my machine). Should perhaps be made a bit more lightweight.

@stephentoub
Copy link
Member

That test System.Numerics.Tensors.Tests.ConvertTests.ConvertTruncatingAndSaturating takes a long time with a checked runtime, even without JITMinOpts=1 (multiple minutes on my machine). Should perhaps be made a bit more lightweight.

The whole test suite takes ~40s on my machine. I'm surprised that one test would be taking minutes. That's with a release runtime?

@jakobbotsch
Copy link
Member Author

That test System.Numerics.Tensors.Tests.ConvertTests.ConvertTruncatingAndSaturating takes a long time with a checked runtime, even without JITMinOpts=1 (multiple minutes on my machine). Should perhaps be made a bit more lightweight.

The whole test suite takes ~40s on my machine. I'm surprised that one test would be taking minutes. That's with a release runtime?

No, sorry (I edited my comment) -- with a checked runtime but with optimized codegen, I meant.

@ghost ghost added the in-pr There is an active PR which will close this issue when it is merged label Feb 8, 2024
@tannergooding
Copy link
Member

I'm not sure how much more light weight such tests can really be made either. We have complex functionality here that needs to be tested. Due to the API signature and the need to use generics, we need to type check to ensure efficiency, and we need to test a range of edge case behaviors.

Would splitting it out into more tests help here?

@ghost ghost removed the in-pr There is an active PR which will close this issue when it is merged label Feb 8, 2024
@jakobbotsch
Copy link
Member Author

Would splitting it out into more tests help here?

I think that would help as well.

I do wonder if the overall total of conversions could be reduced, however. Currently it is doing 17 * 17 (all from/to combinations) * 256 * 255 (different tensor lengths) conversions. Perhaps not every combination of from/to needs to try all possible tensor lengths in that range.

@github-actions github-actions bot locked and limited conversation to collaborators Mar 12, 2024
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Numerics.Tensors blocking-clean-ci-optional Blocking optional rolling runs Known Build Error Use this to report build issues in the .NET Helix tab
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants