-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Database deadlock #10245
Comments
That looks like the new triggers updating history time at the same time as the job finish method updating something related to the history ... we probably don't want to auto-increment the history update time anymore? I'll take a look. |
Hmm, I tried to simulate some somewhat deadlock-prone conditions with workflows and failing intermediate jobs, but no luck (that's often the case with a deadlock, they depend on the length of the transaction and you're more easily getting them at high system load ...). I have noticed an occasional deadlock on the Jenkins servers, so we should definitely do something about this. The triggers were added in #8187, and @Nerdinacan mentioned an alternative that I think is feasible?
A quick and easy way is to inject that logic in https://github.com/mvdbeek/galaxy/blob/4a822ad1a083348e0627b9c8222ba1bb03e40513/lib/galaxy/model/base.py#L64, where we do something similar already anyway. I think doing this there is less deadlock-prone because we can update just the history update time once for all dirty objects in a session, and we can isolate this from the larger transaction that may modify many rows. Not sure if this is really going to help, and without being able to consistently trigger the deadlock it's a shot in the dark |
I tried the patch but it did not appear to solve the problem. Error log
Confirmation of the patch in the running instance:
This is consistently occurring during my workflow invocation. |
Just a sanity check, you also ran the migration, right? |
It runs during deployment, and I believe that Galaxy errors out during init if the database is not at the correct version. I will rerun it and see what happens. |
Reported by the uwsgi app:
After rerunning the workflow with the confirmed database migration it continues to fail. Not on the same job/step every time. It does appear to be on the same tool though. The jobs that fail are failing on the awk tool. I assume this is because these jobs are very fast, very short runtimes, and there are many of them. |
They fail because the output file wasn't collected
That can happen when the filesystem doesn't sync fully when the job is read back. You can tune Still we shouldn't ever deadlock here, at least not reproducibly and at a high volume. |
You are right, that does coincide with every deadlock. I wasn't aware of that configuration option. I will increase it and see if it resolves my issue for now. |
Actually, this particular error happens within the galaxy/lib/galaxy/jobs/__init__.py Line 1278 in ebcf9e6
The thing that happens before the deadlock is
|
Yea, I am not sure why it is looking for a file there, every job throws that error. I assume it is something to do with an upcoming feature. |
I don't follow, the dataset collection error is downstream of the actual error that failed the job (which is a failing metadata job). But it seems that more than one process is updating the dataset state, at the same time, causing the deadlock. |
Is it possible workers are not getting an actual lock on the job? I can scale back down to one worker and one workflow scheduler and try again. This is backed by a AWS RDS Postgres server. Is the uwsgi config above sensible? |
I am sure I've seen this deadlock at least once in the Jenkins API tests that run with just a single process of everything, in a test that checks that datasets produced downstream of a job are going to be paused, so I'm thinking it's not really related to the config you've set up, but maybe we're putting things into a threading queue when they shouldn't go there. |
Any chance you could add mvdbeek@3f6464b on top of the branch you're running right now? It'll print a stacktrace for every flush, I hope that will help us narrow down where the conflict is happening.
There's no deadlock in my case, but it looks like |
That traceback is odd, the tracback indicates this is in a |
Sorry, ninja deleted that error. I compared the patch to the file and wiggle decided to inject the code into a completely random spot without any sort of warning. 20.05 does not have the versioned_session function in base.py |
562cd58 should apply without conflicts |
I am noticing something that could be related as the uwsgi app inits:
It lists the workflow schedulers twice. Could the schedulers be duplicated? Some additional errors
DB log:
Finally, it finished (errored) the workflow. That patch caused a serious performance issue with the uwsgi app or database during workflow run. Here is the log around the deadlock
logs-from-galaxy-worker-in-galaxy-worker-6db9967dd4-lstsz-2.txt |
Alright, so there appears to be a conflict for dataset 90853, and it looks like the kubernetes runner is attempting to set the job state to RUNNING, after the job has already failed ... and that does happen after the after the deadlock is reported. Which could be just a delay on logging or query execution of the transaction that is not rolled back. Can you try this without Unfortunately the traceback logs got a bit tangled, I'll see if I can prefix them with an event id. |
If you're doing another round of testing can you include mvdbeek@bac9fe7 ? That would make it easier to decipher mangled logs |
Does max_pod_retries need to be explicitly set? I do not have it configured |
Yes, |
Looks like there are several duplicate params: galaxy/lib/galaxy/jobs/runners/kubernetes.py Lines 412 to 423 in ae2f670
To clarify, you would like me to set it to 0? or set it to 3? It defaults to 1 in the linked code. |
I went ahead an reran with the patch and max_pod_retries unset (default value). Here is the log around the deadlock
|
0, the goal is to eliminate any retries. |
This is also different deadlock from the one we saw without the triggers (it's the one in the initial post), is this on the branch with or without the triggers ? |
The branch I am working with is the accumulation of all the patches you have suggested. I will rerun setting max_pod_retries to 0. |
I reran it and it failed again with the dataset "Unable to finish job". The logs are now saturated and the deadlock exception has fallen off the end of the log before I could capture it. |
I reran the workflow with #10315, it continues to deadlock. The deadlock appeared in the workflow scheduler log on one run. I reran again and it reappeared in both the job handler log and workflow scheduler log. Job Handler Log
Workflow Scheduler Log
|
Upgraded to 20.09, removing all patches except the k8s id patch. Reran job and got a scheduler deadlock much earlier in the workflow execution than before: Scheduler Log
Job handler log
|
To keep a log here, we talked in gitter and are testing dropping the triggers to try to isolate the issue, using https://gist.github.com/00119ec94ef74574fe3173807429a14e (so far, seems likely it's the triggers and we need to think of a clean way to avoid them locking) |
We've already tried removing the triggers, that's what #10254 was about. |
20.09 + the patch may have been the magic sauce. I can't actually confirm that it worked because any time I try to access the history Galaxy locks up for half an hour. |
I guess if it works with Dannon's patch but not with #10254 it doesn't really matter if it's a trigger or observing the sqlalchemy session. Which is cool, we can pop over the manual update time statement over the message queue and do some debouncing there. |
Waiting on #10322 to confirm the deadlock issue is resolved. |
The deadlock is resolved by the patch and upgrading to 20.09. |
I think this has been resolved in an unpatched 20.09 now. Can I close this? |
What was the relevant PR? |
Give me a week to test this, with so many database schema changes/patches I mangled my database and need to wipe and start over before testing. |
We definitely improved it, but I think I heard reports that this is still happening at EU. |
Rerunning again
|
So I looked carefully at your last log there. What's weird is that Looking back at the thread, did you actually try if #10254 resolved the issue ? |
The traceback is also triggered at https://github.com/mvdbeek/galaxy/blob/d8358c602c1ce23119b4b44cfa2f124674a47cde/lib/galaxy/tools/actions/__init__.py#L417 which implies you're using |
My last exception report was with #10360. legacy_eager_objectstore_initialization was set to true for some reason while we were testing things. |
That PR fixed something that broke in 20.09, so I'm not surprised you're still seeing deadlocks that you also reported against 20.05. If things run smoothly with @dannon's patch I think it would be safe to run that in production. I don't think we actually use the update_time that those triggers set in our current UI. |
#10360 is part of 20.09. It doesn't conflict with #10254 or dannon's patch, except that the migration number should be higher now. I will update #10254 to match the current migration, but #10254 just moves the trigger logic into sqlalchemy-space, it might still have the same problems (but maybe it doesn't, would be awesome to know that). I can also create a variant that is like the dropped migration, but that also removes the additional update time queries, which I don't think we need before merging the big history PR. |
I've added #10821 which drops all manual update_time triggers and manipulation on HDAs and HDCAs (their update_time is set with onupdate anyway), that should solve the problem. I'm not sure we'll actually merge it into 20.09, since it would require a migration, which we try to avoid for minor updates. |
Is the goal to simply test that it works? If I use that patch in a production system then the next release will interfere with the migration count. |
It would be good to confirm this works solves the deadlocks, yes. Dropping the triggers is also something we could do outside of the migration, I think. let me see what we can do there, and the. maybe you could give this a go next week? |
sure |
The following exception occurs while running a very large workflow:
The dataset reports
Unable to finish job
as the error.Galaxy 20.05 (autoscaled uwsgi app, 3 job handlers, 3 workflow schedulers)
I am not sure how to prevent this.
The text was updated successfully, but these errors were encountered: