-
Notifications
You must be signed in to change notification settings - Fork 359
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
tracing\\eventpipe\\diagnosticport\\diagnosticport\\diagnosticport.cmd Intermittent Failure in CI #2044
Comments
I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label. |
Tagging subscribers to this area: @tommcdon |
Did some preliminary triage on this I wanted to document: All the failures are on win-x64 and present with the following failure in the logs:
The first failure is from October 29th, 2020 which is a full 2 months after that test was added and the day after the new managed thread pool was merged in. I've had 3 instances of This test is specifically checking the functionality of setting The logs indicate that the ResumeRuntime command was sent and the runtime returned the OK response, but the MRE hasn't been set, meaning we haven't seen the event yet. That means there are two possible errors happening:
There are dumps of both the parent and child processes from these failures, but neither seem to be in good condition. When I open them in windbg I see suspicious stacks: I think there may be issues with the DAC in th3e artifacts from the build I got the dumps from. Opening them in VS gives a better view, but the main thread stack is truncated: This is the child process, which should have sent the The evidence points to the |
I've left the test running in three consoles restricted to one core for >6 days and the issue has not reproduced locally. Since we are getting core dumps from the failure in CI, I'll look at turning on StressLog in these tests to see if I can get more information. |
Managed to validate with a different dump that the test is indeed waiting inside the
Since we're in managed code, that means the runtime resumed already. This implies that the EEStartupStarted event was either not fired or not received. |
Added @josalem it seems like you're actively investigating, if you think it the fix will take some time still (more than a couple of hours) can we disable the test on the meantime? |
I'm hesitant to turn it off since it only seems to happen in CI, but we can if it is blocking things. I can disable it in the morning. I'm actively looking into this one. I went digging through the session state for the app and it's in a weird state: Ignore the The odd thing is that Looking into the buffer manager logic, we should only ever deallocate buffers if it is empty or the session ended. Looking at a couple instances of the issue, the symptom appears to be the same. The runtime is successfully resumed and the CC @sywhang |
Jotting down a few notes:
|
A little more progress. The I looked in the internal buffer for the reader and decoded the
This means that the reader is going to be stuck in this 6 byte read indefinitely because the writer app won't be sending any more events/data. This also means that the received events won't get dispatched which means our test will remain waiting for them. Why is this intermittent? Why did it not start failing till Oct 29th? This one is harder to rationalize with this hypothesis. It is possible that a difference in TSC speeds or values between Intel and AMD caused the I am fairly confident that this means the runtime is actually behaving correctly. The StressLogs show that the server is in a good state at the timeout and that the runtime resumed correctly. The dump shows that EventPipe correctly sent the 2 expected events and is sleeping until more events were sent. The reader app has the full The issue is that the event never get dispatched due to the attempt to get to an 8 bit offset into the internal buffer. I believe the issue is in TraceEvent and not in the runtime. This is still a hypothesis, but my confidence in this is growing. Especially because the logic we are blocked in has the following comment on it: // PERF policy
// In the streaming (non-seekable) case we don't want to buffer any more data than was
// requested and needed for alignment because this might cause the thread to block waiting
// for the unneeded data to arrive. There is probably a better way to do this that can
// oportunistically buffer if the data is available but this code isn't that sophisticated
// yet.
//
// In the non-streaming (seekable) case we do want to buffer because that lets the
// reader achieve higher throughput.
int fillSize = inputStream.CanSeek ? bytes.Length : (position + minimum + (align-1)) & ~(align-1); I think I could validate this by splicing together a trace by hand that has an |
Confirmed my hypothesis about the cause of the hang. I doctored a nettrace stream to have an EventBlock with a size not divisible by 8. Feeding this |
* real fix needs to happen in TraceEvent
* real fix needs to happen in TraceEvent
Failures since bandaid fix all appear unrelated and are due to PRs testing linking (which is failing the test since they rely on |
New failure is not a recurrence of the original issue. It is a failure in a PR to changes in floating point number parsing. |
Runfo Tracking Issue: tracing\eventpipe\diagnosticport\diagnosticport\diagnosticport.cmd Intermittent Failure in CI
Build Result Summary
The text was updated successfully, but these errors were encountered: