Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

pauseonstart test failures #38847

Closed
BruceForstall opened this issue Jul 7, 2020 · 8 comments · Fixed by #39807
Closed

pauseonstart test failures #38847

BruceForstall opened this issue Jul 7, 2020 · 8 comments · Fixed by #39807
Assignees
Labels
area-Tracing-coreclr blocking-outerloop Blocking the 'runtime-coreclr outerloop' and 'runtime-libraries-coreclr outerloop' runs
Milestone

Comments

@BruceForstall
Copy link
Member

tracing\eventpipe\pauseonstart\pauseonstart\pauseonstart.cmd has failed in multiple CI runs:

R2R Windows_NT x64 Checked forcerelocs @ Windows.10.Amd64.Open

set RunCrossGen=1 
set COMPlus_TieredCompilation=0
set COMPlus_ForceRelocs=1

https://dev.azure.com/dnceng/public/_build/results?buildId=716943&view=ms.vss-test-web.build-test-results-tab&runId=22218608&paneView=dotnet-dnceng.dnceng-build-release-tasks.helix-test-information-tab&resultId=110533

Collected crash dump C:\cores\crashdump_6592.dmp
    tracing\eventpipe\pauseonstart\pauseonstart\pauseonstart.cmd [FAIL]
      
      cmdLine:C:\h\w\977C081E\w\A174094B\e\tracing\eventpipe\pauseonstart\pauseonstart\pauseonstart.cmd Timed Out
      
      Return code:      -100
      Raw output file:      C:\h\w\977C081E\w\A174094B\e\tracing\eventpipe\Reports\tracing.eventpipe\pauseonstart\pauseonstart\pauseonstart.output.txt
      Raw output:
      BEGIN EXECUTION
              1 file(s) copied.
      " "C:\h\w\977C081E\p\crossgen.exe"  /Platform_Assemblies_Paths C:\h\w\977C081E\p;C:\h\w\977C081E\w\A174094B\e\tracing\eventpipe\pauseonstart\pauseonstart /in C:\h\w\977C081E\w\A174094B\e\tracing\eventpipe\pauseonstart\pauseonstart\pauseonstart.org /out C:\h\w\977C081E\w\A174094B\e\tracing\eventpipe\pauseonstart\pauseonstart\\pauseonstart.dll"
      Microsoft (R) CoreCLR Native Image Generator - Version 5.0.0-ci
      Copyright (c) Microsoft Corporation.  All rights reserved.
      
      Native image C:\h\w\977C081E\w\A174094B\e\tracing\eventpipe\pauseonstart\pauseonstart\\pauseonstart.dll generated successfully.
       "C:\h\w\977C081E\p\corerun.exe" pauseonstart.dll 
        0.0s: ::== Running test: TEST_RuntimeResumesExecutionWithCommand
        0.1s: Server name is 'DOTNET_TRACE_TESTS_4hx5s0ds.ucu'
        0.7s: running sub-process: C:\h\w\977C081E\p\CoreRun.exe C:\h\w\977C081E\w\A174094B\e\tracing\eventpipe\pauseonstart\pauseonstart\pauseonstart.dll 0
        1.2s: subprocess started: True
        1.2s: subprocess PID: 5256
        2.8s: { Magic=System.Byte[]; ClrInstanceId=fe0342ed-a66a-d806-9784-9bed6388f8fc; ProcessId=5256; Unused=0; }
        2.9s: Sent: { Header={ Magic=System.Byte[]; Size=20; CommandSet=255; CommandId=1; Reserved=0 }; }
        3.1s: received: { Header={ Magic=System.Byte[]; Size=24; CommandSet=255; CommandId=0; Reserved=0 }; Payload=[ '0000' '0000' '0000' '0000'  ]}
        3.1s: Sending 'exit' to subprocess stdin
        3.9s: ================= Subprocess Exited =================
        3.9s: SubProcess exited - Exit code: 0
        3.9s: Subprocess stdout: 
        1.0s: The runtime has been configured to pause during startup and is awaiting a Diagnostics IPC ResumeStartup command from a server at 'DOTNET_TRACE_TESTS_4hx5s0ds.ucu'.
        3.1s: Subprocess started!  Waiting for input...
        3.1s: Received 'exit'.  Exiting...
        3.9s: Subprocess stderr: 
        4.0s: Test passed: True
        4.0s: 
        4.0s: ::== Running test: TEST_TracesHaveRelevantEvents
        4.1s: Server name is 'DOTNET_TRACE_TESTS_5oyavcdg.fvm'
        4.1s: running sub-process: C:\h\w\977C081E\p\CoreRun.exe C:\h\w\977C081E\w\A174094B\e\tracing\eventpipe\pauseonstart\pauseonstart\pauseonstart.dll 0
        4.1s: subprocess started: True
        4.1s: subprocess PID: 2216
        4.1s: Standard Diagnostics Server connection not created yet -> try again in 100 ms
        4.3s: Standard Diagnostics Server connection not created yet -> try again in 100 ms
        4.5s: { Magic=System.Byte[]; ClrInstanceId=fd234556-3fe8-a13b-8e71-270d82f351c3; ProcessId=2216; Unused=0; }
        4.6s: Starting EventPipeSession over standard connection
        4.7s: Started EventPipeSession over standard connection with session id: 0x21112581990
        4.8s: Send ResumeRuntime Diagnostics IPC Command
        4.8s: Sent: { Header={ Magic=System.Byte[]; Size=20; CommandSet=255; CommandId=1; Reserved=0 }; }
        4.8s: received: { Header={ Magic=System.Byte[]; Size=24; CommandSet=255; CommandId=0; Reserved=0 }; Payload=[ '0000' '0000' '0000' '0000'  ]}
        6.8s: Stopping EventPipeSession over standard connection
        7.0s: Stopped EventPipeSession over standard connection
        7.0s: Sending 'exit' to subprocess stdin
        7.1s: ================= Subprocess Exited =================
        7.1s: SubProcess exited - Exit code: 0
        7.1s: Subprocess stdout: 
        0.3s: The runtime has been configured to pause during startup and is awaiting a Diagnostics IPC ResumeStartup command from a server at 'DOTNET_TRACE_TESTS_5oyavcdg.fvm'.
        1.8s: Subprocess started!  Waiting for input...
        3.0s: Received 'exit'.  Exiting...
        7.1s: Subprocess stderr: 
        7.8s: isStartupEventPresent: True
        7.8s: Test passed: True
        7.8s: 
        7.8s: ::== Running test: TEST_MultipleSessionsCanBeStartedWhilepaused
        7.8s: Server name is 'DOTNET_TRACE_TESTS_m2gvge2y.off'
        7.8s: running sub-process: C:\h\w\977C081E\p\CoreRun.exe C:\h\w\977C081E\w\A174094B\e\tracing\eventpipe\pauseonstart\pauseonstart\pauseonstart.dll 0
        7.8s: subprocess started: True
        7.8s: subprocess PID: 7056
        7.8s: Standard Diagnostics Server connection not created yet -> try again in 100 ms
        8.1s: { Magic=System.Byte[]; ClrInstanceId=8c074577-3971-cf98-8fa7-a94ed870796a; ProcessId=7056; Unused=0; }
        8.1s: Starting EventPipeSession over standard connection
        8.2s: Started EventPipeSession over standard connection with session id: 0x1dc8bd901c0
        8.2s: Starting EventPipeSession over standard connection
        8.2s: Started EventPipeSession over standard connection with session id: 0x1dc8bd8ffc0
        8.2s: Starting EventPipeSession over standard connection
        8.2s: Started EventPipeSession over standard connection with session id: 0x1dc8bd8fd40
        8.2s: Send ResumeRuntime Diagnostics IPC Command
        8.2s: Sent: { Header={ Magic=System.Byte[]; Size=20; CommandSet=255; CommandId=1; Reserved=0 }; }
        8.2s: received: { Header={ Magic=System.Byte[]; Size=24; CommandSet=255; CommandId=0; Reserved=0 }; Payload=[ '0000' '0000' '0000' '0000'  ]}
       10.9s: Stopping EventPipeSession over standard connection
       11.2s: Stopped EventPipeSession over standard connection
       11.2s: Sending 'exit' to subprocess stdin
       11.3s: ================= Subprocess Exited =================
       11.3s: SubProcess exited - Exit code: 0
       11.3s: Subprocess stdout: 
        0.3s: The runtime has been configured to pause during startup and is awaiting a Diagnostics IPC ResumeStartup command from a server at 'DOTNET_TRACE_TESTS_m2gvge2y.off'.
        1.3s: Subprocess started!  Waiting for input...
        3.5s: Received 'exit'.  Exiting...
       11.3s: Subprocess stderr: 
       11.3s: nStartupEventsSeen: 3
       11.3s: Test passed: True
       11.3s: 
       11.3s: ::== Running test: TEST_CanStartAndStopSessionWhilepaused
       11.3s: Server name is 'DOTNET_TRACE_TESTS_eqlrpwut.bif'
       11.3s: running sub-process: C:\h\w\977C081E\p\CoreRun.exe C:\h\w\977C081E\w\A174094B\e\tracing\eventpipe\pauseonstart\pauseonstart\pauseonstart.dll 0
       11.4s: subprocess started: True
       11.4s: subprocess PID: 2616
       11.4s: Standard Diagnostics Server connection not created yet -> try again in 100 ms
       11.7s: Standard Diagnostics Server connection not created yet -> try again in 100 ms
       11.9s: { Magic=System.Byte[]; ClrInstanceId=67f94336-c91d-3fa7-a884-938f26d09ca4; ProcessId=2616; Unused=0; }
       11.9s: Starting EventPipeSession over standard connection
       11.9s: Started EventPipeSession over standard connection with session id: 0x1dbd77632b0
       11.9s: Starting EventPipeSession over standard connection
       11.9s: Started EventPipeSession over standard connection with session id: 0x1dbd7762e30
       11.9s: Starting EventPipeSession over standard connection
       11.9s: Started EventPipeSession over standard connection with session id: 0x1dbd7762f30
       12.9s: Stopping EventPipeSession over standard connection
       12.9s: Stopped EventPipeSession over standard connection
       12.9s: Send ResumeRuntime Diagnostics IPC Command
       12.9s: Sent: { Header={ Magic=System.Byte[]; Size=20; CommandSet=255; CommandId=1; Reserved=0 }; }
       12.9s: received: { Header={ Magic=System.Byte[]; Size=24; CommandSet=255; CommandId=0; Reserved=0 }; Payload=[ '0000' '0000' '0000' '0000'  ]}
       12.9s: Sending 'exit' to subprocess stdin
       13.9s: ================= Subprocess Exited =================
       13.9s: SubProcess exited - Exit code: 0
       13.9s: Subprocess stdout: 
        0.3s: The runtime has been configured to pause during startup and is awaiting a Diagnostics IPC ResumeStartup command from a server at 'DOTNET_TRACE_TESTS_eqlrpwut.bif'.
        2.4s: Subprocess started!  Waiting for input...
        2.5s: Received 'exit'.  Exiting...
       13.9s: Subprocess stderr: 
       13.9s: Test passed: True
       13.9s: 
       13.9s: ::== Running test: TEST_DisabledCommandsError
       13.9s: Server name is 'DOTNET_TRACE_TESTS_r0gx5vln.r1j'
       13.9s: running sub-process: C:\h\w\977C081E\p\CoreRun.exe C:\h\w\977C081E\w\A174094B\e\tracing\eventpipe\pauseonstart\pauseonstart\pauseonstart.dll 0
       13.9s: subprocess started: True
       13.9s: subprocess PID: 3304
       13.9s: Standard Diagnostics Server connection not created yet -> try again in 100 ms
       14.3s: { Magic=System.Byte[]; ClrInstanceId=b77c4a04-4e10-da4e-9134-cb0a04cf210e; ProcessId=3304; Unused=0; }
       14.3s: Send profiler attach Diagnostics IPC Command
       14.3s: Sent: { Header={ Magic=System.Byte[]; Size=20; CommandSet=3; CommandId=1; Reserved=0 }; }
       14.3s: received: { Header={ Magic=System.Byte[]; Size=24; CommandSet=255; CommandId=255; Reserved=0 }; Payload=[ '005B' '0013' '0013' '0080'  ]}
       14.6s: { Magic=System.Byte[]; ClrInstanceId=b77c4a04-4e10-da4e-9134-cb0a04cf210e; ProcessId=3304; Unused=0; }
       14.6s: Send ResumeRuntime Diagnostics IPC Command
       14.6s: Sent: { Header={ Magic=System.Byte[]; Size=20; CommandSet=255; CommandId=1; Reserved=0 }; }
      
      cmdLine:C:\h\w\977C081E\w\A174094B\e\tracing\eventpipe\pauseonstart\pauseonstart\pauseonstart.cmd Timed Out
      Test Harness Exitcode is : -100
      To run the test:
      > set CORE_ROOT=C:\h\w\977C081E\p
      > C:\h\w\977C081E\w\A174094B\e\tracing\eventpipe\pauseonstart\pauseonstart\pauseonstart.cmd
      Expected: True
      Actual:   False

Also: all the Windows x86 legs of runtime-coreclr jitstress-isas-x86:
https://dev.azure.com/dnceng/public/_build/results?buildId=716582&view=ms.vss-test-web.build-test-results-tab&runId=22207570&resultId=110525&paneView=debug

@josalem

@BruceForstall
Copy link
Member Author

@BruceForstall BruceForstall added the blocking-outerloop Blocking the 'runtime-coreclr outerloop' and 'runtime-libraries-coreclr outerloop' runs label Jul 7, 2020
BruceForstall added a commit to BruceForstall/runtime that referenced this issue Jul 7, 2020
@tommcdon tommcdon added p1 and removed untriaged New issue has not been triaged by the area owner labels Jul 7, 2020
@tommcdon tommcdon added this to the 5.0.0 milestone Jul 7, 2020
@josalem
Copy link
Contributor

josalem commented Jul 7, 2020

Taking a look at the history on this test, I think the first failures happened on Friday last week. This indicates to me that recent changes may have caused this failure. Let me look into recent changes and see if anything stands out.

@josalem
Copy link
Contributor

josalem commented Jul 21, 2020

I'll be tracking my research into this failure in this issue.

I've got a repro of this under the debugger locally. It looks like somehow, the polling loop rarely gets in a weird state on Windows x86 where the first two entries in the pHandles array fed to WaitForMultipleObjects both point to the server instead of one of them pointing to the reversed connection. The server doesn't have any input and it isn't waiting on the reverse connection's event handle, so the WaitForMultipleObjects isn't returning.

// call wait for multiple obj
DWORD dwWait = WaitForMultipleObjects(
nHandles, // count
pHandles, // handles
false, // Don't wait-all
timeoutMs);

The reverse connection's handle (0x01b8) is set and the server's handle (0x01b0) is not set, which is correct. Both values in pHandles are 0x01b0.

0:001> !handle 0x01b0 f
Handle 1b0
  Type         	Event
  Attributes   	0
  GrantedAccess	0x1f0003:
         Delete,ReadControl,WriteDac,WriteOwner,Synch
         QueryState,ModifyState
  HandleCount  	2
  PointerCount 	65535
  Name         	<none>
  Object Specific Information
    Event Type Manual Reset
    Event is Waiting
0:001> !handle 0x01b8 f
Handle 1b8
  Type         	Event
  Attributes   	0
  GrantedAccess	0x1f0003:
         Delete,ReadControl,WriteDac,WriteOwner,Synch
         QueryState,ModifyState
  HandleCount  	2
  PointerCount 	65535
  Name         	<none>
  Object Specific Information
    Event Type Manual Reset
    Event is Set

I'm not sure how this could happen, though. I'm going to see about adding some additional logging and seeing if I can pinpoint what happened.

@BruceForstall
Copy link
Member Author

@josalem
Copy link
Contributor

josalem commented Jul 21, 2020

I think we should turn this test off for Windows x86 until I can diagnose what this issue is.

@BruceForstall
Copy link
Member Author

@josalem
Copy link
Contributor

josalem commented Jul 22, 2020

Well, darn... That looks to be the same symptom. I hadn't seen it on x64 yet. It definitely seems more prevalent on x86. I'll turn the test off for all Windows legs, then spina draft PR with the tests turned back on to do further CI triaging.

@josalem
Copy link
Contributor

josalem commented Jul 23, 2020

I've been investigating this some more today and I've got a hypothesis of what's going on.

There is a missing else statement in this chunk of code that will result in the first element of pHandles not being set. When we call Read with overlapped IO, there is a possibility there was already data to be read and it succeeds. In this case, the first element isn't set and whatever data is in the heap at that location is left alone. If this is an invalid handle, WaitForMultipleObjects will error and things correct themselves on the next iteration. If, however, the data that's in the first element is any other valid waitable handle, then we will never return from WaitForMultipleObjects.

HANDLE *pHandles = new HANDLE[nHandles];
for (uint32_t i = 0; i < nHandles; i++)
{
rgIpcPollHandles[i].revents = 0; // ignore any inputs on revents
if (rgIpcPollHandles[i].pIpc != nullptr)
{
// SERVER
_ASSERTE(rgIpcPollHandles[i].pIpc->mode == DiagnosticsIpc::ConnectionMode::SERVER);
pHandles[i] = rgIpcPollHandles[i].pIpc->_oOverlap.hEvent;
}
else
{
// CLIENT
bool fSuccess = false;
DWORD dwDummy = 0;
if (!rgIpcPollHandles[i].pStream->_isTestReading)
{
// check for data by doing an asynchronous 0 byte read.
// This will signal if the pipe closes (hangup) or the server
// sends new data
fSuccess = ::ReadFile(
rgIpcPollHandles[i].pStream->_hPipe, // handle
nullptr, // null buffer
0, // read 0 bytes
&dwDummy, // dummy variable
&rgIpcPollHandles[i].pStream->_oOverlap); // overlap object to use
rgIpcPollHandles[i].pStream->_isTestReading = true;
if (!fSuccess)
{
DWORD error = ::GetLastError();
switch (error)
{
case ERROR_IO_PENDING:
pHandles[i] = rgIpcPollHandles[i].pStream->_oOverlap.hEvent;
break;
case ERROR_PIPE_NOT_CONNECTED:
// hangup
rgIpcPollHandles[i].revents = (uint8_t)PollEvents::HANGUP;
delete[] pHandles;
return -1;
default:
if (callback != nullptr)
callback("0 byte async read on client connection failed", error);
delete[] pHandles;
return -1;
}
}
}
else
{
pHandles[i] = rgIpcPollHandles[i].pStream->_oOverlap.hEvent;
}
}
}
// call wait for multiple obj
DWORD dwWait = WaitForMultipleObjects(
nHandles, // count
pHandles, // handles
false, // Don't wait-all
timeoutMs);

I have observed it happening a couple times locally and it looks like that is what is happening. The only reason I think that's the case, is I collected the following stress log:

// we put 0x01b0 (432) in pHandles[1], and the debugger has 0x01bc (444) in pHandles[0]
 7f4   0.348469973 : `DIAGNOSTICS_PORT`   warning (432): Added SERVER handle to pHandles.

 7f4   0.348466722 : `DIAGNOSTICS_PORT`   IpcStreamFactory::GetNextAvailableStream - Poll attempt: 1, timeout: -1ms.

 7f4   0.345858678 : `DIAGNOSTICS_PORT`   IpcStreamFactory::GetNextAvailableStream - NON :: Poll attempt: 2, connection 1 had no events.

 7f4   0.345858623 : `DIAGNOSTICS_PORT`   IpcStreamFactory::GetNextAvailableStream - SIG :: Poll attempt: 2, connection 0 signalled.

 7f4   0.345854798 : `DIAGNOSTICS_PORT`   warning (432): Added SERVER handle to pHandles.

 7f4   0.345854749 : `DIAGNOSTICS_PORT`   warning (444): Added CLIENT handle to pHandles.

 7f4   0.345854614 : `DIAGNOSTICS_PORT`   IpcStreamFactory::GetNextAvailableStream - Poll attempt: 2, timeout: -1ms.

 7f4   0.345854353 : `DIAGNOSTICS_PORT`   IpcStreamFactory::GetNextAvailableStream - NON :: Poll attempt: 1, connection 1 had no events.

 7f4   0.345854293 : `DIAGNOSTICS_PORT`   IpcStreamFactory::GetNextAvailableStream - NON :: Poll attempt: 1, connection 0 had no events.

 7f4   0.345854114 : `DIAGNOSTICS_PORT`   warning (6): WaitForMultipleObjects failed.

 7f4   0.345849419 : `DIAGNOSTICS_PORT`   warning (432): Added SERVER handle to pHandles.

 7f4   0.345846088 : `DIAGNOSTICS_PORT`   IpcStreamFactory::GetNextAvailableStream - Poll attempt: 1, timeout: -1ms.

In the hung state, the debugger says that the first element of pHandles is 0x01bc (444) which was the handle for the client from the previous call to GetNextAvailableStream. It's presumably been reused by the kernel since it is a valid handle again. My assertion is that the same memory location was given back to the call to new HANDLE[nHandles] from the previous call to Poll. Effectively, we race with the kernel recycling handles and the rare condition that malloc gives us either the same memory or memory containing junk data that happens to be a valid handle.

The fix would be to just put the else statement back in.

@ghost ghost locked as resolved and limited conversation to collaborators Dec 8, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-Tracing-coreclr blocking-outerloop Blocking the 'runtime-coreclr outerloop' and 'runtime-libraries-coreclr outerloop' runs
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants