Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

🔥🔥🔥 "Libraries Test Run checked coreclr Linux" timing out on all PRs #45061

Closed
jkotas opened this issue Nov 21, 2020 · 17 comments
Closed
Assignees
Labels
area-Infrastructure blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' untriaged New issue has not been triaged by the area owner

Comments

@jkotas
Copy link
Member

jkotas commented Nov 21, 2020

Examples: #44688, #44945, ...

@Dotnet-GitSync-Bot
Copy link
Collaborator

I couldn't figure out the best area label to add to this issue. If you have write-permissions please help me learn by adding exactly one area label.

@Dotnet-GitSync-Bot Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Nov 21, 2020
@ghost
Copy link

ghost commented Nov 21, 2020

Tagging subscribers to this area: @ViktorHofer
See info in area-owners.md if you want to be subscribed.

Issue Details
Author: jkotas
Assignees: -
Labels:

area-Infrastructure, untriaged

Milestone: -

@jkotas jkotas added the blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' label Nov 21, 2020
@ViktorHofer
Copy link
Member

@safern can you please take a look at that one?

@ViktorHofer
Copy link
Member

Still happening in PRs, i.e. #45108.

@ericstj ericstj changed the title "Libraries Test Run checked coreclr Linux" timing out on all PRs 🔥🔥🔥 "Libraries Test Run checked coreclr Linux" timing out on all PRs Nov 23, 2020
@ericstj
Copy link
Member

ericstj commented Nov 23, 2020

cc @aik-jahoda
Does anyone know if this is related to a code change, or do we think this is due to infrastructure (EG: reduced agents in machine pools?)

@wfurt
Copy link
Member

wfurt commented Nov 23, 2020

I was searching throug Kusto and I don't see even console link. SO this may be infrastructure. All the cases I look at are containers.

@SteveMCarroll
Copy link

@jkotas let me know about this and i'm catching up.
My amateur sleuthing suggests this is not likely a repo level issue.
has this been reported to First Responders? please cc me on this one.

@safern
Copy link
Member

safern commented Nov 23, 2020

@safern can you please take a look at that one?

Just catching up on this. I can help to look at data and follow up with FR if there is no thread already.

@stephentoub
Copy link
Member

has this been reported to First Responders? please cc me on this one.

It was here:
#44980 (comment)

@safern
Copy link
Member

safern commented Nov 23, 2020

Thanks @stephentoub. I'm taking over to drive closure on this one.

@safern safern self-assigned this Nov 23, 2020
@safern
Copy link
Member

safern commented Nov 23, 2020

I just looked at data for some jobs that were linked here, and it looks like that queue was either clogged or had a hicup. The workitems are running fine, taking less than 1 minute one they get a machine, but it looks like the average waiting time on the queue was 11 hours 😮

@safern
Copy link
Member

safern commented Nov 23, 2020

The core-eng issues are: https://github.com/dotnet/core-eng/issues/11485 and https://github.com/dotnet/core-eng/issues/11468. @dotnet/dnceng says it is fixed.

I'm going to leave this open to see if we get more instances of this before EOD, if not I will close it.

@jkotas
Copy link
Member Author

jkotas commented Nov 24, 2020

Still happening: #45137

@safern
Copy link
Member

safern commented Nov 24, 2020

Ok, I was just about to close this as the data suggested it didn't happen but I found out that the jobs where it happened are not showing on kusto, so I looked into swagger and that example just posted, shows all workitems as "waiting". https://helix.dot.net/api/jobs/48940d46-78a5-4bab-be97-a0f38db8c27a/details?api-version=2019-06-17

I pinged the FR thread and the issue on: https://github.com/dotnet/core-eng/issues/11468#issuecomment-732715076

Thanks, for reporting the new instance!

@safern
Copy link
Member

safern commented Nov 24, 2020

Update: the queue should be back at capacity. We're killing all jobs that started more than 2 hours ago to ease the queue.

Current issue to investigate why machines are suddenly going offline is here: https://github.com/dotnet/core-eng/issues/11503

@tarekgh
Copy link
Member

tarekgh commented Nov 24, 2020

Still happening in #45079. I did rerun the timed out test a couple of times without any luck.

@safern
Copy link
Member

safern commented Nov 26, 2020

I haven't seen this anymore. I looked at the queue health and it is pretty healthy with average wait times of 15 mins since yesterday. Please re-open if you do see this happen again.

@safern safern closed this as completed Nov 26, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 26, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-Infrastructure blocking-clean-ci Blocking PR or rolling runs of 'runtime' or 'runtime-extra-platforms' untriaged New issue has not been triaged by the area owner
Projects
None yet
Development

No branches or pull requests

9 participants