Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Agent is not responding in Windows tasks again #1232

Closed
ericoporto opened this issue Sep 10, 2023 · 8 comments
Closed

Agent is not responding in Windows tasks again #1232

ericoporto opened this issue Sep 10, 2023 · 8 comments
Labels

Comments

@ericoporto
Copy link
Contributor

ericoporto commented Sep 10, 2023

Expected Behavior

Windows tasks should build normally

Real Behavior

They don't start, I get an Agent is Not Responding issue

The Agent is not responding! message didn't appear before 08/09/2023, so it's probably something else that changed in the meantime.

Related Info

I know this looks similar to #1213 , but we tried to force a rerun of the docker build and it didn't fix.

Tried to force a re-cache by rebuilding the windows docker image changing the comments (which change the cache hash) but it didn't work either: adventuregamestudio/ags#2128

The error is happening in our recent builds since 08/09/2023
https://cirrus-ci.com/github/adventuregamestudio/ags

If it could have some better logging in case this is something wrong on our side, because right now we can't figure it out anyway to make it work.

I looked into Google Cloud Registry, and all our cached images for windows are there since 2021 which made things even more mysterious for me. I pulled the docker image locally in my Windows 10 and as far as I can tell it runs and it's normal?

What I did noticed is Container Registry is Deprecated!!! So we have until May 15, 2024 to migrate to Artifact Registry or something else!

Apparently if I try many times I can also get the Instance got rescheduled! message.


I see the same issue happen in other windows builds in CirrusCi.

@ericoporto ericoporto added the bug label Sep 10, 2023
@ericoporto ericoporto changed the title Agent no responding in Windows tasks again Agent not responding in Windows tasks again Sep 10, 2023
@ericoporto ericoporto changed the title Agent not responding in Windows tasks again Agent is not responding in Windows tasks again Sep 11, 2023
@ericoporto
Copy link
Contributor Author

I found above that we are not the only repository where this issue is appearing - unfortunately a lot of repositories are using cirrus-ci only for the FreeBSD or arm64 offering, so I didn’t knew many repositories that included windows builds.

@fkorotkov
Copy link
Contributor

The underlying issue seems to be that Cirrus Agent inside a container is not able to resolve grpc.cirrus-ci.com. Looks like microsoft/Windows-Containers#217 but there is not workaround that I can find yet. Continue looking...

@fkorotkov
Copy link
Contributor

Fixed it by setting --dns flag explicitly when a container is getting run e.g. docker run --dns 8.8.8.8.

@ericoporto
Copy link
Contributor Author

indeed it's fixed, thank you!

@ericoporto
Copy link
Contributor Author

ericoporto commented Sep 13, 2023

Oops, we have a new issue that says

Failed to get scheduled in timely manner!

Not sure what that means.

From here: https://cirrus-ci.com/task/6424294868123648

(it’s a windows task too…)

edit: triggering a rebuild made that error go away.

@ericoporto
Copy link
Contributor Author

ericoporto commented Feb 4, 2024

@fkorotkov
Copy link
Contributor

This usually related to one of the layers of the prebuilt windows container expiring. Please try to re-run the Prebuilt task next time. Cirrus logic is only checking availability of the first few layers of an image in order to decide if it's there or not. Since Windows containers are so huge and have so many layers this logic fails more frequently.

@ericoporto
Copy link
Contributor Author

It would be nice to get either a better error message or some documentation with things to try it out.

Right now it was as far as I can tell a 12hr outage that started to work again eventually after trying repeatedly to rerun things.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants