Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle success case for overlapped IO in Windows Diagnostics IPC PAL #39807

Merged

Conversation

josalem
Copy link
Contributor

@josalem josalem commented Jul 23, 2020

fixes #38847 (see the issue for more details on my investigation)

There is a missing else condition on the error checking for a call to Read with OverlappedIO. In this case, the value in pHandles[0] is unset. pHandles[0] could be:

  1. complete junk -> WaitForMultipleObjects will fail and everything corrects on the second call to Poll
  2. any other waitable handle -> WaitForMultipleObjects is now waiting on the forward server connection which won't signal and some other handle which presumably also won't signal.

In the second case, we will hang and timeout the test. The runtime is effectively single threaded in this state, because the only other thread is waiting in ceemain.cpp on the Pause On Start event.

CC @tommcdon @BruceForstall @dotnet/dotnet-diag

@josalem josalem added this to the 5.0.0 milestone Jul 23, 2020
@josalem josalem requested review from noahfalk and sywhang July 23, 2020 00:58
@josalem josalem self-assigned this Jul 23, 2020
@ghost
Copy link

ghost commented Jul 23, 2020

Tagging subscribers to this area: @tommcdon
See info in area-owners.md if you want to be subscribed.

@josalem
Copy link
Contributor Author

josalem commented Jul 23, 2020

AzDO says the legs marked "running" actually completed successfully, so the CI is actually all green. This happened on another PR of mine.

@josalem
Copy link
Contributor Author

josalem commented Jul 23, 2020

#39789 didn't automerge and turn the pauseonstart test off (due to the same GH issue I mentioned above). This worked out in my favor since a failure early this morning used the new dumping logic that I added to CI and the dump shows the same symptom that I believe this PR is fixing, i.e., WaitForMultipleObjects getting a "junk" value that is a valid handle from elsewhere in the process. I believe this PR should resolve the issue.

0:001> ?? pHandles[0]
void * 0x0000012c <----- SERVER handle
0:001> ?? pHandles[1]
void * 0x0000012c <----- SERVER handle
0:001> ?? rgIpcPollHandles[0]
struct IpcStream::DiagnosticsIpc::IpcPollHandle
   +0x000 pIpc             : (null) 
   +0x004 pStream          : 0x03550710 IpcStream
   +0x008 revents          : 0 ''
   +0x00c pUserData        : 0x0351ffd8 Void
0:001> dx -r1 ((coreclr!IpcStream *)0x3550710)
((coreclr!IpcStream *)0x3550710)                 : 0x3550710 [Type: IpcStream *]
    InfiniteTimeout  : -1 [Type: int]
    [+0x000] _hPipe           : 0x130 [Type: void *]
    [+0x004] _oOverlap        [Type: _OVERLAPPED]
    [+0x018] _isTestReading   : 1 [Type: int]
    [+0x01c] _mode            : CLIENT (0) [Type: IpcStream::DiagnosticsIpc::ConnectionMode]
0:001> dx -r1 (*((coreclr!_OVERLAPPED *)0x3550714))
(*((coreclr!_OVERLAPPED *)0x3550714))                 [Type: _OVERLAPPED]
    [+0x000] Internal         : 0x0 [Type: unsigned long]
    [+0x004] InternalHigh     : 0x0 [Type: unsigned long]
    [+0x008] Offset           : 0x0 [Type: unsigned long]
    [+0x00c] OffsetHigh       : 0x0 [Type: unsigned long]
    [+0x008] Pointer          : 0x0 [Type: void *]
    [+0x010] hEvent           : 0x138 [Type: void *] <------- CLIENT handle
0:001> ?? rgIpcPollHandles[1]
struct IpcStream::DiagnosticsIpc::IpcPollHandle
   +0x000 pIpc             : 0x035396f8 IpcStream::DiagnosticsIpc
   +0x004 pStream          : (null) 
   +0x008 revents          : 0 ''
   +0x00c pUserData        : 0x0351fef8 Void
0:001> dx -r1 ((coreclr!IpcStream::DiagnosticsIpc *)0x35396f8)
((coreclr!IpcStream::DiagnosticsIpc *)0x35396f8)                 : 0x35396f8 [Type: IpcStream::DiagnosticsIpc *]
    [+0x000] mode             : SERVER (1) [Type: IpcStream::DiagnosticsIpc::ConnectionMode]
    MaxNamedPipeNameLength : 0x100 [Type: unsigned int]
    [+0x004] _pNamedPipeName  : "\\.\pipe\dotnet-diagnostic-5352" [Type: char [256]]
    [+0x104] _hPipe           : 0x128 [Type: void *]
    [+0x108] _oOverlap        [Type: _OVERLAPPED]
    [+0x11c] _isListening     : true [Type: bool]
0:001> dx -r1 (*((coreclr!_OVERLAPPED *)0x3539800))
(*((coreclr!_OVERLAPPED *)0x3539800))                 [Type: _OVERLAPPED]
    [+0x000] Internal         : 0x103 [Type: unsigned long]
    [+0x004] InternalHigh     : 0x0 [Type: unsigned long]
    [+0x008] Offset           : 0x0 [Type: unsigned long]
    [+0x00c] OffsetHigh       : 0x0 [Type: unsigned long]
    [+0x008] Pointer          : 0x0 [Type: void *]
    [+0x010] hEvent           : 0x12c [Type: void *] <------- SERVER handle

@josalem
Copy link
Contributor Author

josalem commented Jul 23, 2020

Happened again: The CI is green according to AzDO, but GH hasn't updated to show the jobs as completed or successful.

@josalem
Copy link
Contributor Author

josalem commented Jul 30, 2020

GH isn't seeing the AzDO job has finished and succeeded:
image

CI is all green. I'm going to kick an outer loop to be doubly sure and then merge in the morning.

@josalem
Copy link
Contributor Author

josalem commented Jul 30, 2020

outerloop didn't seem to kick off correctly because the OSX leg hit a homebrew error so the tests never built for running on other platforms. Innerloop passed all green, so I'll merge this morning if there are no objections.

@hoyosjs
Copy link
Member

hoyosjs commented Jul 30, 2020

The OSX issue has been fixed, but it was after the PR was opened, so probably would require rebasing. Not sure it's justified.

@josalem
Copy link
Contributor Author

josalem commented Jul 30, 2020

Not sure it's justified.

Agreed. I'll give it another moment for people to air their opinions and then merge it in.

@davmason
Copy link
Member

:shipit:

Copy link
Contributor

@sywhang sywhang left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the changes LGTM but we should make sure OSX legs pass before we check this in

@josalem
Copy link
Contributor Author

josalem commented Jul 30, 2020

@sywhang I don't think any changes here would affect the OSX legs. It sounds like that's an infrastructure bug. I can rebase and rerun outer loop, but I'm not sure it would do much at this point.

@josalem josalem force-pushed the dev/josalem/windows-diagnosticsipc-improvements branch from 1b734e9 to df57001 Compare July 31, 2020 01:29
@ghost
Copy link

ghost commented Jul 31, 2020

Hello @josalem!

Because this pull request has the auto-merge label, I will be glad to assist with helping to merge this pull request once all check-in policies pass.

p.s. you can customize the way I help with merging this pull request, such as holding this pull request until a specific person approves. Simply @mention me (@msftbot) and give me an instruction to get started! Learn more here.

@josalem
Copy link
Contributor Author

josalem commented Jul 31, 2020

Windows failures are unrelated (Interop PInvoke CopyCtor test), and the mac failure is a known issue unrelated to these changes being investigated in #39979.

@josalem
Copy link
Contributor Author

josalem commented Jul 31, 2020

Remaining test failures are unrelated. I'll merge without if there's no further objection.

@josalem josalem merged commit e306489 into dotnet:master Jul 31, 2020
@josalem josalem deleted the dev/josalem/windows-diagnosticsipc-improvements branch July 31, 2020 22:24
Jacksondr5 pushed a commit to Jacksondr5/runtime that referenced this pull request Aug 10, 2020
…otnet#39807)

Prevents unintialized memory from making its way into the call to WaitForMultipleObjects.  That memory could contain a valid handle value and cause WFMO to hang since it isn't waiting on the correct handles.
@ghost ghost locked as resolved and limited conversation to collaborators Dec 8, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

pauseonstart test failures
4 participants