-
Notifications
You must be signed in to change notification settings - Fork 40
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Multiple/many parallel jobs lead to "random" failures #490
Comments
We recently received a similar report and I originally thought it may be related to Kubernetes 1.30 and the |
I can test with various EKS versions, however, I am not sure how to build a minimal example with |
Is there a stack trace? Or can the verbosity level be increased to produce one? If not, I think we have a problem with the error being inadequately logged, and we need to figure out which line of code is generating the exception. Most likely, this is caused by a race condition between k8s modifying the job status, and the runner attempting to read and modify the manifest itself. As mentioned earlier, the resultant hash collision of resourceVersion would cause this conflict. So if we re-queue the current task whenever this error is encountered, the runner thread should eventually fetch the latest version, and succeed I would expect. |
This is "the most" detailed log I get:
|
Thanks. That helps with narrowing things down. |
To change/update the |
How do you build the Is it building this as-is, or is there a "min" configuration somewhere? |
@mapk-amazon That's the right image. Building it as is will do the job. If you'd like to test the changes, please try this branch: galaxyproject/galaxy#18514 |
Fwiw @mapk-amazon , you can also use |
Thank you all. I used
|
Thanks @mapk-amazon, it sure looks like a race condition. How did you upload the 100 files? Through the UI, API, or other means (bioblend etc)? |
While this is shown as an error in the logs, I think that the behaviour of
the code is harmless. Do you actually see the failure in the UI? That is
why we added that "ignoring" part there.
…On Fri, 16 Aug 2024, 16:47 Keith Suderman, ***@***.***> wrote:
Thanks @mapk-amazon <https://github.com/mapk-amazon>, it sure looks like
a race condition. How did you upload the 100 files? Through the UI, API, or
other means (bioblend etc)?
—
Reply to this email directly, view it on GitHub
<#490 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACZ6XXXMY4HYX54256ARN3ZRYNIXAVCNFSM6AAAAABLLBZ3V2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJTG42DMMJYHE>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
When running hundreds of jobs, you're are always bound to get some
arbitrary errors, we mitigate that in our use of the setup with aggressive
resubmission policies.
…On Fri, 16 Aug 2024, 17:15 Pablo Moreno, ***@***.***> wrote:
While this is shown as an error in the logs, I think that the behaviour of
the code is harmless. Do you actually see the failure in the UI? That is
why we added that "ignoring" part there.
On Fri, 16 Aug 2024, 16:47 Keith Suderman, ***@***.***>
wrote:
> Thanks @mapk-amazon <https://github.com/mapk-amazon>, it sure looks like
> a race condition. How did you upload the 100 files? Through the UI, API, or
> other means (bioblend etc)?
>
> —
> Reply to this email directly, view it on GitHub
> <#490 (comment)>,
> or unsubscribe
> <https://github.com/notifications/unsubscribe-auth/AACZ6XXXMY4HYX54256ARN3ZRYNIXAVCNFSM6AAAAABLLBZ3V2VHI2DSMVQWIX3LMV43OSLTON2WKQ3PNVWWK3TUHMZDEOJTG42DMMJYHE>
> .
> You are receiving this because you are subscribed to this thread.Message
> ID: ***@***.***>
>
|
Thank you for your input! @ksuderman I use the webinterface. I can try the API if you think it makes a difference. |
But yes, I do see this error every now and then in our logs, maybe I don't see it in the UI as an error due to the resubmissions. |
True, but we are getting reports of the @mapk-amazon no need to try the API, I just want to make sure I am using the same procedure when I try to recreate the problem.. |
Update : I believe I know now what is happening. In my understanding the aggressive "retries" are the root cause of the issues. The job pod (the one scheduling the pods) shows for failing pods, that "Galaxy" receives twice the information about the pod.
Then it starts cleaning (twice) and one fails, as the other one already deleted/starting deletion. Finally, it shows
It seems the first job moved the data already and the second did no longer found the file. The result is a technically successful job (as the container finished), the results were processed successfully once, and the second iteration (the later one) responds with an error and Galaxy believes the job fails. |
Update 2: I believe I was wrong (yet again). Please take a look at the PR galaxyproject/galaxy#19001 :) |
Setup
The setup is deployed on AWS on EKS:
Issue
Galaxy "usually" deploys jobs just fine. We started importing with Batch files into Galaxy and experience random failures of pods.
Logs
and
In the k8s log we also see that the pods was launched around the time:
Ideas/Hypothesis
Current ideas are that the hash (e.g.
f4b62
) has a collision and leads to resource conflicts for the pods and to failures of some jobs.Does the team has any experience with it? Any fixes? Thank you :)
The text was updated successfully, but these errors were encountered: