-
-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add safety net for unresponsive docker containers #5120
Conversation
I would indeed also report when these errors happen so we have a backtrace and can investigate further in the future. (And maybe just rescue from any error, to see if there are other things we're missing.) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrapping this with a rescue doesn't hurt, but I would also add more logging so we at least know something went wrong here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The abort on exception is a good idea.
Logging a warning is better than nothing, but is a bit hidden. We will probably never grep the log files to see this warning. I assume it's hard to trigger a slack notification?
It is not that hard to trigger a slack notification, but I am not sure if we would want it in this case. This specific error was triggered by infinite loops in the student code. The way it is fixed, they will now receive a timeout error as expected. I do not see why we would want an error every time it happens in the future Should another error occur, it will now trigger an internal error state for the submission and thus a notification for us. This is why I kept the caught errors in this case specific. To differentiate the interesting cases |
But we haven't seen similar problems with other judges, so it is an issue with the judge, right? |
I think this is more of a language issue then a judge specific issue. The judge code does not seem to do that much out of the ordinary. I think c might just be more capable of taking all available resources, which makes the docker non responsive |
This pull request adds a rescue block around checking the docker stats.
I am not 100% this is the root cause of the current issue, but it does seem likely.
Checking for stats has the potential to cause an error (eg a timeout if the container is non responsive). This results in rails crashing and never stopping the container.
I tried to be specific in the errors I catch to avoid ignoring other causes.
I also fixed the propagation of errors from within the timer thread. We already have a task that notifies us when errors occur in the submission runner, but this was not triggered by errors occurring within the thread.
See https://stackoverflow.com/a/9095369