Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add safety net for unresponsive docker containers #5120

Merged
merged 4 commits into from
Nov 14, 2023
Merged

Conversation

jorg-vr
Copy link
Contributor

@jorg-vr jorg-vr commented Nov 8, 2023

This pull request adds a rescue block around checking the docker stats.

I am not 100% this is the root cause of the current issue, but it does seem likely.

Checking for stats has the potential to cause an error (eg a timeout if the container is non responsive). This results in rails crashing and never stopping the container.

I tried to be specific in the errors I catch to avoid ignoring other causes.

I also fixed the propagation of errors from within the timer thread. We already have a task that notifies us when errors occur in the submission runner, but this was not triggered by errors occurring within the thread.
See https://stackoverflow.com/a/9095369

@jorg-vr jorg-vr added the bug Something isn't working label Nov 8, 2023
@jorg-vr jorg-vr self-assigned this Nov 8, 2023
@jorg-vr jorg-vr requested a review from a team as a code owner November 8, 2023 10:20
@jorg-vr jorg-vr requested review from bmesuere and chvp and removed request for a team November 8, 2023 10:20
@chvp
Copy link
Member

chvp commented Nov 8, 2023

I would indeed also report when these errors happen so we have a backtrace and can investigate further in the future. (And maybe just rescue from any error, to see if there are other things we're missing.)

Copy link
Member

@bmesuere bmesuere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrapping this with a rescue doesn't hurt, but I would also add more logging so we at least know something went wrong here.

@jorg-vr jorg-vr requested a review from bmesuere November 14, 2023 09:54
Copy link
Member

@bmesuere bmesuere left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The abort on exception is a good idea.
Logging a warning is better than nothing, but is a bit hidden. We will probably never grep the log files to see this warning. I assume it's hard to trigger a slack notification?

@jorg-vr
Copy link
Contributor Author

jorg-vr commented Nov 14, 2023

It is not that hard to trigger a slack notification, but I am not sure if we would want it in this case.

This specific error was triggered by infinite loops in the student code. The way it is fixed, they will now receive a timeout error as expected. I do not see why we would want an error every time it happens in the future

Should another error occur, it will now trigger an internal error state for the submission and thus a notification for us.

This is why I kept the caught errors in this case specific. To differentiate the interesting cases

@bmesuere
Copy link
Member

But we haven't seen similar problems with other judges, so it is an issue with the judge, right?

@jorg-vr
Copy link
Contributor Author

jorg-vr commented Nov 14, 2023

I think this is more of a language issue then a judge specific issue. The judge code does not seem to do that much out of the ordinary. I think c might just be more capable of taking all available resources, which makes the docker non responsive

@jorg-vr jorg-vr merged commit b121058 into main Nov 14, 2023
11 of 13 checks passed
@jorg-vr jorg-vr deleted the fix/infinite-dockers branch November 14, 2023 13:52
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants