Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Database deadlock #10245

Closed
innovate-invent opened this issue Sep 18, 2020 · 55 comments
Closed

Database deadlock #10245

innovate-invent opened this issue Sep 18, 2020 · 55 comments
Labels

Comments

@innovate-invent
Copy link
Contributor

innovate-invent commented Sep 18, 2020

The following exception occurs while running a very large workflow:

galaxy.metadata DEBUG 2020-09-18 19:35:46,971 setting metadata externally failed for HistoryDatasetAssociation 54263: Metadata results could not be read from '/data/jobs_directory/024/24438/metadata/metadata_results_outfile'
galaxy.jobs.runners ERROR 2020-09-18 19:35:48,127 (24438/galaxy-islandcompare-24438) Job wrapper finish method failed
Traceback (most recent call last):
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1247, in _execute_context
    self.dialect.do_execute(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 590, in do_execute
    cursor.execute(statement, parameters)
psycopg2.errors.DeadlockDetected: deadlock detected
DETAIL:  Process 25865 waits for ShareLock on transaction 646521; blocked by process 4054.
Process 4054 waits for ShareLock on transaction 646527; blocked by process 25743.
Process 25743 waits for ExclusiveLock on tuple (11,36) of relation 17171 of database 16396; blocked by process 25865.
HINT:  See server log for query details.
CONTEXT:  while updating tuple (11,36) in relation "history"
SQL statement "UPDATE history
                    SET update_time = (now() at time zone 'utc')
                    WHERE id = NEW.history_id OR id = OLD.history_id"
PL/pgSQL function update_history_update_time() line 9 at SQL statement


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 540, in _finish_or_resubmit_job
    job_wrapper.finish(tool_stdout, tool_stderr, exit_code, check_output_detected_state=check_output_detected_state, job_stdout=job_stdout, job_stderr=job_stderr)
  File "/srv/galaxy/lib/galaxy/jobs/__init__.py", line 1713, in finish
    self.sa_session.flush()
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/scoping.py", line 162, in do
    return getattr(self.registry(), name)(*args, **kwargs)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2496, in flush
    self._flush(objects)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2637, in _flush
    transaction.rollback(_capture_exception=True)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__
    compat.raise_(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 178, in raise_
    raise exception
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2597, in _flush
    flush_context.execute()
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/unitofwork.py", line 422, in execute
    rec.execute(self)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/unitofwork.py", line 586, in execute
    persistence.save_obj(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/persistence.py", line 230, in save_obj
    _emit_update_statements(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/persistence.py", line 994, in _emit_update_statements
    c = cached_connections[connection].execute(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 984, in execute
    return meth(self, multiparams, params)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/sql/elements.py", line 293, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1097, in _execute_clauseelement
    ret = self._execute_context(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1287, in _execute_context
    self._handle_dbapi_exception(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1481, in _handle_dbapi_exception
    util.raise_(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 178, in raise_
    raise exception
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1247, in _execute_context
    self.dialect.do_execute(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 590, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.OperationalError: (psycopg2.errors.DeadlockDetected) deadlock detected
DETAIL:  Process 25865 waits for ShareLock on transaction 646521; blocked by process 4054.
Process 4054 waits for ShareLock on transaction 646527; blocked by process 25743.
Process 25743 waits for ExclusiveLock on tuple (11,36) of relation 17171 of database 16396; blocked by process 25865.
HINT:  See server log for query details.
CONTEXT:  while updating tuple (11,36) in relation "history"
SQL statement "UPDATE history
                    SET update_time = (now() at time zone 'utc')
                    WHERE id = NEW.history_id OR id = OLD.history_id"
PL/pgSQL function update_history_update_time() line 9 at SQL statement

[SQL: UPDATE history_dataset_association SET update_time=%(update_time)s, info=%(info)s, blurb=%(blurb)s, peek=%(_peek)s, tool_version=%(tool_version)s, metadata=%(_metadata)s, version=%(version)s WHERE history_dataset_association.id = %(history_dataset_association_id)s]
[parameters: {'update_time': datetime.datetime(2020, 9, 18, 19, 35, 47, 120934), 'info': '', 'blurb': '2,897 lines, 98 comments', '_peek': '#\n#\n# Core-specific parameters:\n#\n#    Menu parameters:\n', 'tool_version': 'GNU Awk 5.0.1, API: 2.0\n', '_metadata': <psycopg2.extensions.Binary object at 0x7f51075e1ed0>, 'version': 2, 'history_dataset_association_id': 54263}]
(Background on this error at: http://sqlalche.me/e/e3q8)
galaxy.jobs ERROR 2020-09-18 19:35:48,180 fail(): Missing output file in working directory: [Errno 2] No such file or directory: '/data/jobs_directory/024/24438/outputs/galaxy_dataset_02b9521f-28a9-4cb2-9d9c-40f3a8088fd3.dat'

The dataset reports Unable to finish job as the error.

Galaxy 20.05 (autoscaled uwsgi app, 3 job handlers, 3 workflow schedulers)

uwsgi:
      buffer-size: 16384
      processes: 1
      threads: 4
      offload-threads: 2

I am not sure how to prevent this.

@mvdbeek
Copy link
Member

mvdbeek commented Sep 21, 2020

psycopg2.errors.DeadlockDetected: deadlock detected
DETAIL: Process 25865 waits for ShareLock on transaction 646521; blocked by process 4054.
Process 4054 waits for ShareLock on transaction 646527; blocked by process 25743.
Process 25743 waits for ExclusiveLock on tuple (11,36) of relation 17171 of database 16396; blocked by process 25865.
HINT: See server log for query details.
CONTEXT: while updating tuple (11,36) in relation "history"
SQL statement "UPDATE history
SET update_time = (now() at time zone 'utc')
WHERE id = NEW.history_id OR id = OLD.history_id"
PL/pgSQL function update_history_update_time() line 9 at SQL statement

That looks like the new triggers updating history time at the same time as the job finish method updating something related to the history ... we probably don't want to auto-increment the history update time anymore? I'll take a look.

@mvdbeek
Copy link
Member

mvdbeek commented Sep 21, 2020

Hmm, I tried to simulate some somewhat deadlock-prone conditions with workflows and failing intermediate jobs, but no luck (that's often the case with a deadlock, they depend on the length of the transaction and you're more easily getting them at high system load ...). I have noticed an occasional deadlock on the Jenkins servers, so we should definitely do something about this.

The triggers were added in #8187, and @Nerdinacan mentioned an alternative that I think is feasible?

Sifting through the server-side code to each place we update a dataset and modify it to do the equivalent work of the trigger. This could still work, but on the whole triggers will be more performant and maintaining timestamps is probably the best use of a trigger. (Except maybe audit trails.)

A quick and easy way is to inject that logic in https://github.com/mvdbeek/galaxy/blob/4a822ad1a083348e0627b9c8222ba1bb03e40513/lib/galaxy/model/base.py#L64, where we do something similar already anyway. I think doing this there is less deadlock-prone because we can update just the history update time once for all dirty objects in a session, and we can isolate this from the larger transaction that may modify many rows. Not sure if this is really going to help, and without being able to consistently trigger the deadlock it's a shot in the dark ☹️ .

@innovate-invent
Copy link
Contributor Author

innovate-invent commented Sep 23, 2020

I tried the patch but it did not appear to solve the problem.

Error log
galaxy.metadata DEBUG 2020-09-23 08:18:37,802 setting metadata externally failed for HistoryDatasetAssociation 75133: Metadata results could not be read from '/data/jobs_directory/033/33491/metadata/metadata_results_outfile'
galaxy.jobs.runners ERROR 2020-09-23 08:18:42,330 (33491/galaxy-islandcompare-33491) Job wrapper finish method failed
Traceback (most recent call last):
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1247, in _execute_context
    self.dialect.do_execute(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 590, in do_execute
    cursor.execute(statement, parameters)
psycopg2.errors.DeadlockDetected: deadlock detected
DETAIL:  Process 7619 waits for ShareLock on transaction 957712; blocked by process 11270.
Process 11270 waits for ShareLock on transaction 957711; blocked by process 7619.
HINT:  See server log for query details.
CONTEXT:  while updating tuple (890,92) in relation "dataset"


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 540, in _finish_or_resubmit_job
    job_wrapper.finish(tool_stdout, tool_stderr, exit_code, check_output_detected_state=check_output_detected_state, job_stdout=job_stdout, job_stderr=job_stderr)
  File "/srv/galaxy/lib/galaxy/jobs/__init__.py", line 1713, in finish
    self.sa_session.flush()
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/scoping.py", line 162, in do
    return getattr(self.registry(), name)(*args, **kwargs)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2496, in flush
    self._flush(objects)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2637, in _flush
    transaction.rollback(_capture_exception=True)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__
    compat.raise_(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 178, in raise_
    raise exception
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2597, in _flush
    flush_context.execute()
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/unitofwork.py", line 422, in execute
    rec.execute(self)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/unitofwork.py", line 586, in execute
    persistence.save_obj(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/persistence.py", line 230, in save_obj
    _emit_update_statements(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/persistence.py", line 994, in _emit_update_statements
    c = cached_connections[connection].execute(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 984, in execute
    return meth(self, multiparams, params)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/sql/elements.py", line 293, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1097, in _execute_clauseelement
    ret = self._execute_context(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1287, in _execute_context
    self._handle_dbapi_exception(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1481, in _handle_dbapi_exception
    util.raise_(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 178, in raise_
    raise exception
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1247, in _execute_context
    self.dialect.do_execute(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 590, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.OperationalError: (psycopg2.errors.DeadlockDetected) deadlock detected
DETAIL:  Process 7619 waits for ShareLock on transaction 957712; blocked by process 11270.
Process 11270 waits for ShareLock on transaction 957711; blocked by process 7619.
HINT:  See server log for query details.
CONTEXT:  while updating tuple (890,92) in relation "dataset"

[SQL: UPDATE dataset SET update_time=%(update_time)s, state=%(state)s, file_size=%(file_size)s WHERE dataset.id = %(dataset_id)s]
[parameters: {'update_time': datetime.datetime(2020, 9, 23, 8, 18, 40, 681763), 'state': 'ok', 'file_size': 285, 'dataset_id': 56360}]
(Background on this error at: http://sqlalche.me/e/e3q8)
galaxy.jobs ERROR 2020-09-23 08:18:42,377 fail(): Missing output file in working directory: [Errno 2] No such file or directory: '/data/jobs_directory/033/33491/outputs/galaxy_dataset_e68e6668-a26f-4e5e-9c3f-3b020d7adf6b.dat'
galaxy.tools.error_reports DEBUG 2020-09-23 08:19:00,612 Bug report plugin <galaxy.tools.error_reports.plugins.sentry.SentryPlugin object at 0x7f834baa4b20> generated response None

Confirmation of the patch in the running instance:

grep 'objects_requiring_history_updates' /srv/galaxy/lib/galaxy/model/__init__.py
def objects_requiring_history_updates(iter):
        for hda, hdca in objects_requiring_history_updates(chain(session.dirty, session.deleted)):

This is consistently occurring during my workflow invocation.

@mvdbeek
Copy link
Member

mvdbeek commented Sep 23, 2020

Just a sanity check, you also ran the migration, right?

@innovate-invent
Copy link
Contributor Author

It runs during deployment, and I believe that Galaxy errors out during init if the database is not at the correct version. I will rerun it and see what happens.

@innovate-invent
Copy link
Contributor Author

Reported by the uwsgi app:

igrate.versioning.script.base DEBUG 2020-09-23 20:02:25,535 [p:1,w:0,m:0] [MainThread] Script lib/galaxy/model/migrate/versions/0165_add_content_update_time.py loaded successfully
migrate.versioning.script.base DEBUG 2020-09-23 20:02:25,535 [p:1,w:0,m:0] [MainThread] Loading script lib/galaxy/model/migrate/versions/0166_drop_update_time_triggers.py...
migrate.versioning.script.base DEBUG 2020-09-23 20:02:25,535 [p:1,w:0,m:0] [MainThread] Script lib/galaxy/model/migrate/versions/0166_drop_update_time_triggers.py loaded successfully
migrate.versioning.repository DEBUG 2020-09-23 20:02:25,535 [p:1,w:0,m:0] [MainThread] Repository lib/galaxy/model/migrate loaded successfully
migrate.versioning.repository DEBUG 2020-09-23 20:02:25,536 [p:1,w:0,m:0] [MainThread] Config: {'db_settings': {'repository_id': 'Galaxy', 'version_table': 'migrate_version', 'required_dbs': '[]'}}
galaxy.model.migrate.check INFO 2020-09-23 20:02:25,814 [p:1,w:0,m:0] [MainThread] At database version 166
galaxy.model.tool_shed_install.migrate.check INFO 2020-09-23 20:02:27,669 [p:1,w:0,m:0] [MainThread] At database version 17
galaxy.config INFO 2020-09-23 20:02:27,670 [p:1,w:0,m:0] [MainThread] Install database using its own connection sqlite:////data/config/toolshed.sqlite?isolation_level=IMMEDIATE

After rerunning the workflow with the confirmed database migration it continues to fail. Not on the same job/step every time. It does appear to be on the same tool though. The jobs that fail are failing on the awk tool. I assume this is because these jobs are very fast, very short runtimes, and there are many of them.

@mvdbeek
Copy link
Member

mvdbeek commented Sep 24, 2020

They fail because the output file wasn't collected

galaxy.jobs ERROR 2020-09-18 19:35:48,180 fail(): Missing output file in working directory: [Errno 2] No such file or directory: '/data/jobs_directory/024/24438/outputs/galaxy_dataset_02b9521f-28a9-4cb2-9d9c-40f3a8088fd3.dat'

That can happen when the filesystem doesn't sync fully when the job is read back. You can tune retry_job_output_collection to get around this.

Still we shouldn't ever deadlock here, at least not reproducibly and at a high volume.

@innovate-invent
Copy link
Contributor Author

You are right, that does coincide with every deadlock. I wasn't aware of that configuration option. I will increase it and see if it resolves my issue for now.

@mvdbeek
Copy link
Member

mvdbeek commented Sep 24, 2020

Actually, this particular error happens within the fail method (

log.error("fail(): Missing output file in working directory: %s", unicodify(e))
), so the job already errored, the missing output file could be a red herring.

The thing that happens before the deadlock is

galaxy.metadata DEBUG 2020-09-18 19:35:46,971 setting metadata externally failed for HistoryDatasetAssociation 54263: Metadata results could not be read from '/data/jobs_directory/024/24438/metadata/metadata_results_outfile'

@innovate-invent
Copy link
Contributor Author

Yea, I am not sure why it is looking for a file there, every job throws that error. I assume it is something to do with an upcoming feature.

@mvdbeek
Copy link
Member

mvdbeek commented Sep 24, 2020

I don't follow, the dataset collection error is downstream of the actual error that failed the job (which is a failing metadata job). But it seems that more than one process is updating the dataset state, at the same time, causing the deadlock.

@innovate-invent
Copy link
Contributor Author

innovate-invent commented Sep 24, 2020

Is it possible workers are not getting an actual lock on the job? I can scale back down to one worker and one workflow scheduler and try again. This is backed by a AWS RDS Postgres server.

Is the uwsgi config above sensible?

@mvdbeek
Copy link
Member

mvdbeek commented Sep 24, 2020

I am sure I've seen this deadlock at least once in the Jenkins API tests that run with just a single process of everything, in a test that checks that datasets produced downstream of a job are going to be paused, so I'm thinking it's not really related to the config you've set up, but maybe we're putting things into a threading queue when they shouldn't go there.

@mvdbeek
Copy link
Member

mvdbeek commented Sep 24, 2020

Any chance you could add mvdbeek@3f6464b on top of the branch you're running right now?

It'll print a stacktrace for every flush, I hope that will help us narrow down where the conflict is happening.
Just runing this on my dev instance I get this for instance:

galaxy.jobs.runners.local DEBUG 2020-09-24 12:16:37,228 (27024) executing job script: /Users/mvandenb/src/galaxy/database/job_working_directory1/027/27024/galaxy_27024.sh
galaxy.model.base DEBUG 2020-09-24 12:16:37,253 Dirty objects: IdentitySet([<galaxy.model.Job(27024) at 0x149534cd0>, <galaxy.model.Dataset(125384) at 0x14941d490>, <galaxy.model.HistoryDatasetAssociation(138240) at 0x149540c70>])
  File "/Users/mvandenb/.vscode/extensions/ms-python.python-2020.8.109390/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydev_bundle/pydev_monkey.py", line 823, in __call__
    ret = self.original_func(*self.args, **self.kwargs)
  File "/Users/mvandenb/.pyenv/versions/3.8.2/lib/python3.8/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
  File "/Users/mvandenb/.pyenv/versions/3.8.2/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/Users/mvandenb/.pyenv/versions/3.8.2/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/mvandenb/src/galaxy/lib/galaxy/jobs/runners/__init__.py", line 137, in run_next
    method(arg)
  File "/Users/mvandenb/src/galaxy/lib/galaxy/jobs/runners/local.py", line 118, in queue_job
    job_wrapper.change_state(model.Job.states.RUNNING, job=job)
  File "/Users/mvandenb/src/galaxy/lib/galaxy/jobs/__init__.py", line 1399, in change_state
    self.sa_session.flush()
  File "/Users/mvandenb/src/galaxy/.venv/lib/python3.8/site-packages/sqlalchemy/orm/scoping.py", line 163, in do
    return getattr(self.registry(), name)(*args, **kwargs)
  File "/Users/mvandenb/src/galaxy/.venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2523, in flush
    self._flush(objects)
  File "/Users/mvandenb/src/galaxy/.venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2552, in _flush
    self.dispatch.before_flush(self, flush_context, objects)
  File "/Users/mvandenb/src/galaxy/.venv/lib/python3.8/site-packages/sqlalchemy/event/attr.py", line 261, in __call__
    fn(*args, **kw)
  File "/Users/mvandenb/src/galaxy/lib/galaxy/model/base.py", line 68, in before_flush
    traceback.print_stack(sys._current_frames()[threading.get_ident()])
galaxy.jobs.runners.util.process_groups DEBUG 2020-09-24 12:16:39,962 check_pg(): No process found in process group 95877
galaxy.jobs.runners.local DEBUG 2020-09-24 12:16:39,963 execution finished: /Users/mvandenb/src/galaxy/database/job_working_directory1/027/27024/galaxy_27024.sh
...
galaxy.model.base DEBUG 2020-09-24 12:16:40,034 Dirty objects: IdentitySet([<galaxy.model.Dataset(125384) at 0x1495116d0>, <galaxy.model.HistoryDatasetAssociation(138240) at 0x14949ac70>])
  File "/Users/mvandenb/.vscode/extensions/ms-python.python-2020.8.109390/pythonFiles/lib/python/debugpy/_vendored/pydevd/_pydev_bundle/pydev_monkey.py", line 823, in __call__
    ret = self.original_func(*self.args, **self.kwargs)
  File "/Users/mvandenb/.pyenv/versions/3.8.2/lib/python3.8/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
  File "/Users/mvandenb/.pyenv/versions/3.8.2/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/Users/mvandenb/.pyenv/versions/3.8.2/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/Users/mvandenb/src/galaxy/lib/galaxy/jobs/runners/__init__.py", line 137, in run_next
    method(arg)
  File "/Users/mvandenb/src/galaxy/lib/galaxy/jobs/runners/local.py", line 153, in queue_job
    self._finish_or_resubmit_job(job_state, stdout, stderr, job_id=job_id)
  File "/Users/mvandenb/src/galaxy/lib/galaxy/jobs/runners/__init__.py", line 540, in _finish_or_resubmit_job
    job_wrapper.finish(tool_stdout, tool_stderr, exit_code, check_output_detected_state=check_output_detected_state, job_stdout=job_stdout, job_stderr=job_stderr)
  File "/Users/mvandenb/src/galaxy/lib/galaxy/jobs/__init__.py", line 1714, in finish
  File "/Users/mvandenb/src/galaxy/.venv/lib/python3.8/site-packages/sqlalchemy/orm/scoping.py", line 163, in do
    return getattr(self.registry(), name)(*args, **kwargs)
  File "/Users/mvandenb/src/galaxy/.venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2523, in flush
    self._flush(objects)
  File "/Users/mvandenb/src/galaxy/.venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2552, in _flush
    self.dispatch.before_flush(self, flush_context, objects)
  File "/Users/mvandenb/src/galaxy/.venv/lib/python3.8/site-packages/sqlalchemy/event/attr.py", line 261, in __call__
    fn(*args, **kw)
  File "/Users/mvandenb/src/galaxy/lib/galaxy/model/base.py", line 68, in before_flush
    traceback.print_stack(sys._current_frames()[threading.get_ident()])
...

There's no deadlock in my case, but it looks like change_state could possibly conflict with the job_finish method. If we could see that in the context of a deadlock that is probably going to be very helpful (and we can link the dataset ids here).

@mvdbeek
Copy link
Member

mvdbeek commented Sep 24, 2020

That traceback is odd, the tracback indicates this is in a __init__
method now? Let me put the patch on top of the backported branch

@innovate-invent
Copy link
Contributor Author

Sorry, ninja deleted that error. I compared the patch to the file and wiggle decided to inject the code into a completely random spot without any sort of warning. 20.05 does not have the versioned_session function in base.py

@mvdbeek
Copy link
Member

mvdbeek commented Sep 24, 2020

562cd58 should apply without conflicts

@innovate-invent
Copy link
Contributor Author

innovate-invent commented Sep 24, 2020

I am noticing something that could be related as the uwsgi app inits:

galaxy.web_stack.handlers DEBUG 2020-09-24 20:58:44,672 [p:1,w:0,m:0] [MainThread] Read definition for handler 'workflow-scheduler0'
galaxy.web_stack.handlers DEBUG 2020-09-24 20:58:44,673 [p:1,w:0,m:0] [MainThread] Read definition for handler 'workflow-scheduler1'
galaxy.web_stack.handlers DEBUG 2020-09-24 20:58:44,673 [p:1,w:0,m:0] [MainThread] Read definition for handler 'workflow-scheduler2'
galaxy.web_stack.handlers DEBUG 2020-09-24 20:58:44,673 [p:1,w:0,m:0] [MainThread] Read definition for handler 'workflow-scheduler3'
galaxy.web_stack.handlers DEBUG 2020-09-24 20:58:44,673 [p:1,w:0,m:0] [MainThread] Read definition for handler 'workflow-scheduler0'
galaxy.web_stack.handlers DEBUG 2020-09-24 20:58:44,673 [p:1,w:0,m:0] [MainThread] Read definition for handler 'workflow-scheduler1'
galaxy.web_stack.handlers DEBUG 2020-09-24 20:58:44,673 [p:1,w:0,m:0] [MainThread] Read definition for handler 'workflow-scheduler2'
galaxy.web_stack.handlers DEBUG 2020-09-24 20:58:44,673 [p:1,w:0,m:0] [MainThread] Read definition for handler 'workflow-scheduler3'

It lists the workflow schedulers twice. Could the schedulers be duplicated?

Some additional errors
*** uWSGI is running in multiple interpreter mode ***
spawned uWSGI master process (pid: 1)
spawned uWSGI worker 1 (pid: 10, cores: 4)
running "unix_signal:2 gracefully_kill_them_all" (master-start)...
running "unix_signal:15 gracefully_kill_them_all" (master-start)...
spawned 2 offload threads for uWSGI worker 1
Exception ignored in: <function _after_at_fork_child_reinit_locks at 0x7f826adba700>
Traceback (most recent call last):
  File "/usr/local/lib/python3.8/logging/__init__.py", line 260, in _after_at_fork_child_reinit_locks
    _releaseLock()  # Acquired by os.register_at_fork(before=.
  File "/usr/local/lib/python3.8/logging/__init__.py", line 228, in _releaseLock
    _lock.release()
RuntimeError: cannot release un-acquired lock
galaxy.web_stack DEBUG 2020-09-24 20:58:58,684 [p:10,w:1,m:0] [MainThread] Calling postfork function: <bound method ApplicationStack.set_postfork_server_name of 
galaxy.web_stack DEBUG 2020-09-24 20:58:58,689 [p:10,w:1,m:0] [MainThread] server_name set to: main.web.1
galaxy.web_stack DEBUG 2020-09-24 20:58:58,689 [p:10,w:1,m:0] [MainThread] Calling postfork function: <bound method JobManager.start of 
galaxy.web_stack DEBUG 2020-09-24 20:58:58,689 [p:10,w:1,m:0] [MainThread] Calling postfork function: <bound method DatabaseHeartbeat.start of 
galaxy.web_stack DEBUG 2020-09-24 20:58:58,698 [p:10,w:1,m:0] [MainThread] Calling postfork function: <bound method UWSGIApplicationStack.start of 
galaxy.web_stack DEBUG 2020-09-24 20:58:58,698 [p:10,w:1,m:0] [MainThread] Calling postfork function: <bound method GalaxyQueueWorker.bind_and_start of 
galaxy.queue_worker INFO 2020-09-24 20:58:58,698 [p:10,w:1,m:0] [MainThread] Binding and starting galaxy control worker for main.web.1
galaxy.web_stack DEBUG 2020-09-24 20:58:58,714 [p:10,w:1,m:0] [MainThread] Calling postfork function: <function UniverseApplication.__init__. at 0x7f8255817280>
galaxy.queue_worker INFO 2020-09-24 20:58:58,715 [p:10,w:1,m:0] [MainThread] Queuing async task rebuild_toolbox_search_index for main.web.1.
galaxy.model DEBUG 2020-09-24 20:59:01,194 [p:10,w:1,m:0] [database_heartbeart_main.web.1.thread] Dirty objects: IdentitySet([])
  File "/usr/local/lib/python3.8/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
  File "/usr/local/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/srv/galaxy/lib/galaxy/model/database_heartbeat.py", line 100, in send_database_heartbeat
    self.update_watcher_designation()
  File "/srv/galaxy/lib/galaxy/model/database_heartbeat.py", line 89, in update_watcher_designation
    self.sa_session.flush()
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/scoping.py", line 162, in do
    return getattr(self.registry(), name)(*args, **kwargs)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2496, in flush
    self._flush(objects)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2525, in _flush
    self.dispatch.before_flush(self, flush_context, objects)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/event/attr.py", line 261, in __call__
    fn(*args, **kw)
  File "/srv/galaxy/lib/galaxy/model/__init__.py", line 6196, in before_flush
    traceback.print_stack(sys._current_frames()[threading.get_ident()])
galaxy.web_stack DEBUG 2020-09-24 20:59:25,064 [p:10,w:1,m:0] [MainThread] Calling postfork function: <built-in function register>
galaxy.web_stack DEBUG 2020-09-24 20:59:25,064 [p:10,w:1,m:0] [MainThread] Calling postfork function: <function postfork_setup at 0x7f825945f3a0>
galaxy.web_stack INFO 2020-09-24 20:59:25,064 [p:10,w:1,m:0] [MainThread] Galaxy server instance 'main.web.1' is running
Starting server in PID 1.
serving on uwsgi://127.0.0.1:9000
chdir() to /srv/galaxy
galaxy.queue_worker INFO 2020-09-24 20:59:25,604 [p:10,w:1,m:0] [Thread-1] Instance 'main.web.1' received 'rebuild_toolbox_search_index' task, executing now.
galaxy.tools.search DEBUG 2020-09-24 20:59:25,604 [p:10,w:1,m:0] [Thread-1] Starting to build toolbox index.
galaxy.tools.search DEBUG 2020-09-24 20:59:26,321 [p:10,w:1,m:0] [Thread-1] Toolbox index finished (717.055 ms)
galaxy.model DEBUG 2020-09-24 21:00:04,084 [p:10,w:1,m:0] [database_heartbeart_main.web.1.thread] Dirty objects: IdentitySet([<galaxy.model.WorkerProcess object at 0x7f8254262700>])
  File "/usr/local/lib/python3.8/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
  File "/usr/local/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/srv/galaxy/lib/galaxy/model/database_heartbeat.py", line 100, in send_database_heartbeat
    self.update_watcher_designation()
  File "/srv/galaxy/lib/galaxy/model/database_heartbeat.py", line 89, in update_watcher_designation
    self.sa_session.flush()
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/scoping.py", line 162, in do
    return getattr(self.registry(), name)(*args, **kwargs)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2496, in flush
    self._flush(objects)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2525, in _flush
    self.dispatch.before_flush(self, flush_context, objects)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/event/attr.py", line 261, in __call__
    fn(*args, **kw)
  File "/srv/galaxy/lib/galaxy/model/__init__.py", line 6196, in before_flush
    traceback.print_stack(sys._current_frames()[threading.get_ident()])
galaxy.model DEBUG 2020-09-24 21:01:04,119 [p:10,w:1,m:0] [database_heartbeart_main.web.1.thread] Dirty objects: IdentitySet([<galaxy.model.WorkerProcess object at 0x7f8254262c40>])
  File "/usr/local/lib/python3.8/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
  File "/usr/local/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/srv/galaxy/lib/galaxy/model/database_heartbeat.py", line 100, in send_database_heartbeat
    self.update_watcher_designation()
  File "/srv/galaxy/lib/galaxy/model/database_heartbeat.py", line 89, in update_watcher_designation
    self.sa_session.flush()
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/scoping.py", line 162, in do
    return getattr(self.registry(), name)(*args, **kwargs)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2496, in flush
    self._flush(objects)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2525, in _flush
    self.dispatch.before_flush(self, flush_context, objects)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/event/attr.py", line 261, in __call__
    fn(*args, **kw)
  File "/srv/galaxy/lib/galaxy/model/__init__.py", line 6196, in before_flush
    traceback.print_stack(sys._current_frames()[threading.get_ident()])
galaxy.model DEBUG 2020-09-24 21:02:09,223 [p:10,w:1,m:0] [database_heartbeart_main.web.1.thread] Dirty objects: IdentitySet([<galaxy.model.WorkerProcess object at 0x7f8254051580>])
  File "/usr/local/lib/python3.8/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
  File "/usr/local/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/srv/galaxy/lib/galaxy/model/database_heartbeat.py", line 100, in send_database_heartbeat
    self.update_watcher_designation()
  File "/srv/galaxy/lib/galaxy/model/database_heartbeat.py", line 89, in update_watcher_designation
    self.sa_session.flush()
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/scoping.py", line 162, in do
    return getattr(self.registry(), name)(*args, **kwargs)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2496, in flush
    self._flush(objects)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2525, in _flush
    self.dispatch.before_flush(self, flush_context, objects)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/event/attr.py", line 261, in __call__
    fn(*args, **kw)
  File "/srv/galaxy/lib/galaxy/model/__init__.py", line 6196, in before_flush
    traceback.print_stack(sys._current_frames()[threading.get_ident()])

DB log:

2020-09-24 20:43:19 UTC::@:[16188]:LOG: automatic analyze of table "galaxy.public.history_dataset_association" system usage: CPU: user: 1.04 s, system: 217.08 s, elapsed: 1147.17 s
2020-09-24 20:43:24 UTC::@:[4967]:LOG: checkpoint complete: wrote 2684 buffers (11.0%); 0 WAL file(s) added, 0 removed, 1 recycled; write=269.156 s, sync=0.072 s, total=269.900 s; sync files=133, longest=0.002 s, average=0.000 s; distance=76973 kB, estimate=76973 kB
2020-09-24 20:43:54 UTC::@:[4967]:LOG: checkpoint starting: time
2020-09-24 20:45:05 UTC::@:[24569]:LOG: automatic vacuum of table "galaxy.public.galaxy_user": index scans: 1
pages: 2 removed, 1 remain, 0 skipped due to pins, 0 skipped frozen
tuples: 11 removed, 4 remain, 1 are dead but not yet removable, oldest xmin: 1393946
buffer usage: 335 hits, 0 misses, 6 dirtied
avg read rate: 0.000 MB/s, avg write rate: 0.003 MB/s
system usage: CPU: user: 0.00 s, system: 0.16 s, elapsed: 18.54 s
2020-09-24 20:49:25 UTC::@:[26431]:LOG: automatic vacuum of table "galaxy.public.history": index scans: 1
pages: 0 removed, 34 remain, 0 skipped due to pins, 0 skipped frozen
tuples: 52 removed, 87 remain, 7 are dead but not yet removable, oldest xmin: 1393946
buffer usage: 387 hits, 0 misses, 2 dirtied
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
system usage: CPU: user: 0.00 s, system: 1.37 s, elapsed: 35.80 s
2020-09-24 20:57:45 UTC:10.0.1.145(32926):galaxy@galaxy:[17517]:LOG: could not receive data from client: Connection reset by peer
2020-09-24 20:57:45 UTC:10.0.1.145(59932):galaxy@galaxy:[15871]:LOG: could not receive data from client: Connection reset by peer
2020-09-24 20:57:45 UTC:10.0.1.145(59940):galaxy@galaxy:[15876]:LOG: could not receive data from client: Connection reset by peer
2020-09-24 20:57:45 UTC:10.0.1.145(32924):galaxy@galaxy:[17516]:LOG: could not receive data from client: Connection reset by peer
2020-09-24 20:57:45 UTC:10.0.1.145(59936):galaxy@galaxy:[15872]:LOG: could not receive data from client: Connection reset by peer
2020-09-24 20:58:20 UTC::@:[4967]:LOG: checkpoint complete: wrote 5367 buffers (22.1%); 0 WAL file(s) added, 0 removed, 0 recycled; write=864.685 s, sync=0.001 s, total=865.135 s; sync files=138, longest=0.001 s, average=0.000 s; distance=63369 kB, estimate=75612 kB
2020-09-24 20:58:58 UTC::@:[30383]:LOG: automatic vacuum of table "galaxy.public.worker_process": index scans: 0
pages: 0 removed, 23 remain, 1 skipped due to pins, 0 skipped frozen
tuples: 0 removed, 129 remain, 52 are dead but not yet removable, oldest xmin: 1393946
buffer usage: 137 hits, 0 misses, 0 dirtied
avg read rate: 0.000 MB/s, avg write rate: 0.000 MB/s
system usage: CPU: user: 0.00 s, system: 0.00 s, elapsed: 36.11 s
2020-09-24 20:58:58 UTC::@:[4967]:LOG: checkpoint starting: time
2020-09-24 20:59:51 UTC:10.0.1.145(32862):galaxy@galaxy:[17189]:LOG: could not receive data from client: Connection reset by peer
2020-09-24 20:59:51 UTC:10.0.1.145(32862):galaxy@galaxy:[17189]:LOG: unexpected EOF on client connection with an open transaction

Finally, it finished (errored) the workflow. That patch caused a serious performance issue with the uwsgi app or database during workflow run.

Here is the log around the deadlock
urllib3.connectionpool DEBUG 2020-09-24 22:56:43,497 https://quay.io:443 "GET /api/v1/repository/biocontainers/gawk HTTP/1.1" 200 1936
galaxy.tool_util.deps.containers INFO 2020-09-24 22:56:43,500 Checking with container resolver [MulledDockerContainerResolver[namespace=biocontainers]] found description [ContainerDescription[identifier=quay.io/biocontainers/gawk:5.0.1,type=docker]]
galaxy.model DEBUG 2020-09-24 22:56:43,505 Dirty objects: IdentitySet([<galaxy.model.Job(53244) at 0x7efc7008b190>])
urllib3.connectionpool DEBUG 2020-09-24 22:56:43,596 https://172.20.0.1:443 "GET /apis/batch/v1/namespaces/islandcompare/jobs/galaxy-islandcompare-53244 HTTP/1.1" 404 242
urllib3.connectionpool DEBUG 2020-09-24 22:56:43,603 https://172.20.0.1:443 "POST /apis/batch/v1/namespaces/islandcompare/jobs HTTP/1.1" 201 1770
galaxy.model DEBUG 2020-09-24 22:56:43,611 Dirty objects: IdentitySet([<galaxy.model.Job(53244) at 0x7efc7008b190>])
  File "/srv/galaxy/lib/galaxy/jobs/__init__.py", line 1399, in change_state
    self.sa_session.flush()
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/scoping.py", line 162, in do
    return getattr(self.registry(), name)(*args, **kwargs)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2496, in flush
    self._flush(objects)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2525, in _flush
    self.dispatch.before_flush(self, flush_context, objects)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/event/attr.py", line 261, in __call__
    fn(*args, **kw)
  File "/srv/galaxy/lib/galaxy/model/__init__.py", line 6196, in before_flush
    traceback.print_stack(sys._current_frames()[threading.get_ident()])
  File "/usr/local/lib/python3.8/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
  File "/usr/local/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/srv/galaxy/lib/galaxy/jobs/handler.py", line 235, in __monitor
    self.__monitor_step()
  File "/srv/galaxy/lib/galaxy/jobs/handler.py", line 253, in __monitor_step
    self.__handle_waiting_jobs()
  File "/srv/galaxy/lib/galaxy/jobs/handler.py", line 388, in __handle_waiting_jobs
    self.dispatcher.put(self.job_wrappers.pop(job.id))
  File "/srv/galaxy/lib/galaxy/jobs/handler.py", line 987, in put
    self.job_runners[runner_name].put(job_wrapper)
  File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 152, in put
    job_wrapper.enqueue()
  File "/srv/galaxy/lib/galaxy/jobs/__init__.py", line 1462, in enqueue
    self.sa_session.flush()
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/scoping.py", line 162, in do
    return getattr(self.registry(), name)(*args, **kwargs)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2496, in flush
    self._flush(objects)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2525, in _flush
    self.dispatch.before_flush(self, flush_context, objects)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/event/attr.py", line 261, in __call__
    fn(*args, **kw)
  File "/srv/galaxy/lib/galaxy/model/__init__.py", line 6196, in before_flush
    traceback.print_stack(sys._current_frames()[threading.get_ident()])
  File "/usr/local/lib/python3.8/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
  File "/usr/local/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 137, in run_next
    method(arg)
  File "/srv/galaxy/lib/galaxy/jobs/runners/kubernetes.py", line 111, in queue_job
    if not self.prepare_job(job_wrapper,
  File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 236, in prepare_job
    job_wrapper.prepare()
  File "/srv/galaxy/lib/galaxy/jobs/__init__.py", line 1119, in prepare
    self.sa_session.flush()
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/scoping.py", line 162, in do
    return getattr(self.registry(), name)(*args, **kwargs)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2496, in flush
    self._flush(objects)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2525, in _flush
    self.dispatch.before_flush(self, flush_context, objects)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/event/attr.py", line 261, in __call__
    fn(*args, **kw)
  File "/srv/galaxy/lib/galaxy/model/__init__.py", line 6196, in before_flush
    traceback.print_stack(sys._current_frames()[threading.get_ident()])
  File "/usr/local/lib/python3.8/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
  File "/usr/local/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 137, in run_next
    method(arg)
  File "/srv/galaxy/lib/galaxy/jobs/runners/kubernetes.py", line 111, in queue_job
    if not self.prepare_job(job_wrapper,
  File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 237, in prepare_job
    job_wrapper.runner_command_line = self.build_command_line(
  File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 273, in build_command_line
    container = self._find_container(job_wrapper)
  File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 455, in _find_container
    job_wrapper.set_container(container)
  File "/srv/galaxy/lib/galaxy/jobs/__init__.py", line 2285, in set_container
    self.sa_session.flush()
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/scoping.py", line 162, in do
    return getattr(self.registry(), name)(*args, **kwargs)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2496, in flush
    self._flush(objects)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2525, in _flush
    self.dispatch.before_flush(self, flush_context, objects)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/event/attr.py", line 261, in __call__
    fn(*args, **kw)
  File "/srv/galaxy/lib/galaxy/model/__init__.py", line 6196, in before_flush
    traceback.print_stack(sys._current_frames()[threading.get_ident()])
  File "/usr/local/lib/python3.8/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
  File "/usr/local/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 137, in run_next
    method(arg)
  File "/srv/galaxy/lib/galaxy/jobs/runners/kubernetes.py", line 131, in queue_job
    self.__get_k8s_job_spec(ajs)
  File "/srv/galaxy/lib/galaxy/jobs/runners/kubernetes.py", line 220, in __get_k8s_job_spec
    k8s_job_spec = {"template": self.__get_k8s_job_spec_template(ajs),
  File "/srv/galaxy/lib/galaxy/jobs/runners/kubernetes.py", line 235, in __get_k8s_job_spec_template
    "containers": self.__get_k8s_containers(ajs),
  File "/srv/galaxy/lib/galaxy/jobs/runners/kubernetes.py", line 272, in __get_k8s_containers
    "image": self._find_container(ajs.job_wrapper).container_id,
  File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 455, in _find_container
    job_wrapper.set_container(container)
  File "/srv/galaxy/lib/galaxy/jobs/__init__.py", line 2285, in set_container
    self.sa_session.flush()
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/scoping.py", line 162, in do
    return getattr(self.registry(), name)(*args, **kwargs)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2496, in flush
    self._flush(objects)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2525, in _flush
    self.dispatch.before_flush(self, flush_context, objects)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/event/attr.py", line 261, in __call__
    fn(*args, **kw)
  File "/srv/galaxy/lib/galaxy/model/__init__.py", line 6196, in before_flush
    traceback.print_stack(sys._current_frames()[threading.get_ident()])
  File "/usr/local/lib/python3.8/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
  File "/usr/local/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 137, in run_next
    method(arg)
  File "/srv/galaxy/lib/galaxy/jobs/runners/kubernetes.py", line 164, in queue_job
    job_wrapper.set_external_id(k8s_job_name)
  File "/srv/galaxy/lib/galaxy/jobs/__init__.py", line 1434, in set_external_id
    self.sa_session.flush()
urllib3.connectionpool DEBUG 2020-09-24 22:56:43,659 https://172.20.0.1:443 "GET /apis/batch/v1/namespaces/islandcompare/jobs?labelSelector=app%3Dgalaxy-islandcompare-53241 HTTP/1.1" 200 1930
urllib3.connectionpool DEBUG 2020-09-24 22:56:43,665 https://172.20.0.1:443 "GET /apis/batch/v1/namespaces/islandcompare/jobs?labelSelector=app%3Dgalaxy-islandcompare-53244 HTTP/1.1" 200 1930
galaxy.model DEBUG 2020-09-24 22:56:43,724 Dirty objects: IdentitySet([<galaxy.model.HistoryDatasetAssociation(118292) at 0x7efc606919a0>, <galaxy.model.Dataset(90856) at 0x7efc603adeb0>, <galaxy.model.Job(53244) at 0x7efc7031ae50>])
urllib3.connectionpool DEBUG 2020-09-24 22:56:44,752 https://172.20.0.1:443 "GET /apis/batch/v1/namespaces/islandcompare/jobs?labelSelector=app%3Dgalaxy-islandcompare-53241 HTTP/1.1" 200 1930
urllib3.connectionpool DEBUG 2020-09-24 22:56:44,759 https://172.20.0.1:443 "GET /apis/batch/v1/namespaces/islandcompare/jobs?labelSelector=app%3Dgalaxy-islandcompare-53244 HTTP/1.1" 200 1930
urllib3.connectionpool DEBUG 2020-09-24 22:56:45,767 https://172.20.0.1:443 "GET /apis/batch/v1/namespaces/islandcompare/jobs?labelSelector=app%3Dgalaxy-islandcompare-53241 HTTP/1.1" 200 1930
urllib3.connectionpool DEBUG 2020-09-24 22:56:45,778 https://172.20.0.1:443 "GET /apis/batch/v1/namespaces/islandcompare/jobs?labelSelector=app%3Dgalaxy-islandcompare-53244 HTTP/1.1" 200 1930
urllib3.connectionpool DEBUG 2020-09-24 22:56:46,786 https://172.20.0.1:443 "GET /apis/batch/v1/namespaces/islandcompare/jobs?labelSelector=app%3Dgalaxy-islandcompare-53241 HTTP/1.1" 200 None
urllib3.connectionpool DEBUG 2020-09-24 22:56:46,801 https://172.20.0.1:443 "GET /apis/batch/v1/namespaces/islandcompare/jobs?labelSelector=app%3Dgalaxy-islandcompare-53244 HTTP/1.1" 200 1930
galaxy.model DEBUG 2020-09-24 22:56:47,017 Dirty objects: IdentitySet([<galaxy.model.Job(53241) at 0x7efc703ba7c0>])
galaxy.jobs DEBUG 2020-09-24 22:56:47,170 finish(): Moved /data/jobs_directory/053/53241/outputs/galaxy_dataset_d1fb5f67-06a0-4cd9-ac96-7a05a03e9f1f.dat to /data/database/files/090/dataset_90853.dat
galaxy.metadata DEBUG 2020-09-24 22:56:47,286 setting metadata externally failed for HistoryDatasetAssociation 118289: Metadata results could not be read from '/data/jobs_directory/053/53241/metadata/metadata_results_outfile'
galaxy.model DEBUG 2020-09-24 22:56:47,398 Dirty objects: IdentitySet([<galaxy.model.HistoryDatasetCollectionAssociation(1543) at 0x7efc3e71dfa0>, <galaxy.model.HistoryDatasetAssociation(118289) at 0x7efc72676b80>, <galaxy.model.Dataset(90853) at 0x7efc3e6ac280>])
urllib3.connectionpool DEBUG 2020-09-24 22:56:47,809 https://172.20.0.1:443 "GET /apis/batch/v1/namespaces/islandcompare/jobs?labelSelector=app%3Dgalaxy-islandcompare-53244 HTTP/1.1" 200 1930
galaxy.jobs.runners ERROR 2020-09-24 22:56:48,449 (53241/galaxy-islandcompare-53241) Job wrapper finish method failed
Traceback (most recent call last):
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1247, in _execute_context
    self.dialect.do_execute(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 590, in do_execute
    cursor.execute(statement, parameters)
psycopg2.errors.DeadlockDetected: deadlock detected
DETAIL:  Process 30664 waits for ShareLock on transaction 1399438; blocked by process 23950.
Process 23950 waits for ShareLock on transaction 1399440; blocked by process 30664.
HINT:  See server log for query details.
CONTEXT:  while updating tuple (1353,14) in relation "dataset"


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 540, in _finish_or_resubmit_job
    job_wrapper.finish(tool_stdout, tool_stderr, exit_code, check_output_detected_state=check_output_detected_state, job_stdout=job_stdout, job_stderr=job_stderr)
  File "/srv/galaxy/lib/galaxy/jobs/__init__.py", line 1713, in finish
    self.sa_session.flush()
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/scoping.py", line 162, in do
    return getattr(self.registry(), name)(*args, **kwargs)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2496, in flush
    self._flush(objects)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2637, in _flush
    transaction.rollback(_capture_exception=True)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__
    compat.raise_(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 178, in raise_
    raise exception
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2597, in _flush
    flush_context.execute()
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/unitofwork.py", line 422, in execute
    rec.execute(self)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/unitofwork.py", line 586, in execute
    persistence.save_obj(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/persistence.py", line 230, in save_obj
    _emit_update_statements(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/persistence.py", line 994, in _emit_update_statements
    c = cached_connections[connection].execute(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 984, in execute
    return meth(self, multiparams, params)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/sql/elements.py", line 293, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1097, in _execute_clauseelement
    ret = self._execute_context(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1287, in _execute_context
    self._handle_dbapi_exception(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1481, in _handle_dbapi_exception
    util.raise_(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 178, in raise_
    raise exception
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1247, in _execute_context
    self.dialect.do_execute(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 590, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.OperationalError: (psycopg2.errors.DeadlockDetected) deadlock detected
DETAIL:  Process 30664 waits for ShareLock on transaction 1399438; blocked by process 23950.
Process 23950 waits for ShareLock on transaction 1399440; blocked by process 30664.
HINT:  See server log for query details.
CONTEXT:  while updating tuple (1353,14) in relation "dataset"

[SQL: UPDATE dataset SET update_time=%(update_time)s, state=%(state)s, file_size=%(file_size)s WHERE dataset.id = %(dataset_id)s]
[parameters: {'update_time': datetime.datetime(2020, 9, 24, 22, 56, 47, 415991), 'state': 'ok', 'file_size': 1434, 'dataset_id': 90853}]
(Background on this error at: http://sqlalche.me/e/e3q8)
galaxy.jobs ERROR 2020-09-24 22:56:48,484 fail(): Missing output file in working directory: [Errno 2] No such file or directory: '/data/jobs_directory/053/53241/outputs/galaxy_dataset_d1fb5f67-06a0-4cd9-ac96-7a05a03e9f1f.dat'
galaxy.model DEBUG 2020-09-24 22:56:48,501 Dirty objects: IdentitySet([<galaxy.model.Job(53241) at 0x7efc603ad2b0>, <galaxy.model.Dataset(90853) at 0x7efc3e6ac280>])
urllib3.connectionpool DEBUG 2020-09-24 22:56:48,818 https://172.20.0.1:443 "GET /apis/batch/v1/namespaces/islandcompare/jobs?labelSelector=app%3Dgalaxy-islandcompare-53244 HTTP/1.1" 200 1930
galaxy.model DEBUG 2020-09-24 22:56:49,498 Dirty objects: IdentitySet([<galaxy.model.HistoryDatasetAssociation(118289) at 0x7efc72676b80>, <galaxy.model.Dataset(90853) at 0x7efc3e6ac280>])
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/scoping.py", line 162, in do
    return getattr(self.registry(), name)(*args, **kwargs)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2496, in flush
    self._flush(objects)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2525, in _flush
    self.dispatch.before_flush(self, flush_context, objects)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/event/attr.py", line 261, in __call__
    fn(*args, **kw)
  File "/srv/galaxy/lib/galaxy/model/__init__.py", line 6196, in before_flush
    traceback.print_stack(sys._current_frames()[threading.get_ident()])
  File "/usr/local/lib/python3.8/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
  File "/usr/local/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 708, in monitor
    self.check_watched_items()
  File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 735, in check_watched_items
    new_async_job_state = self.check_watched_item(async_job_state)
  File "/srv/galaxy/lib/galaxy/jobs/runners/kubernetes.py", line 447, in check_watched_item
    job_state.job_wrapper.change_state(model.Job.states.RUNNING)
  File "/srv/galaxy/lib/galaxy/jobs/__init__.py", line 1399, in change_state
    self.sa_session.flush()
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/scoping.py", line 162, in do
    return getattr(self.registry(), name)(*args, **kwargs)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2496, in flush
    self._flush(objects)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2525, in _flush
    self.dispatch.before_flush(self, flush_context, objects)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/event/attr.py", line 261, in __call__
    fn(*args, **kw)
  File "/srv/galaxy/lib/galaxy/model/__init__.py", line 6196, in before_flush
    traceback.print_stack(sys._current_frames()[threading.get_ident()])
  File "/usr/local/lib/python3.8/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
  File "/usr/local/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 137, in run_next
    method(arg)
  File "/srv/galaxy/lib/galaxy/jobs/runners/kubernetes.py", line 584, in finish_job
    super(KubernetesJobRunner, self).finish_job(job_state)
  File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 780, in finish_job
    self._finish_or_resubmit_job(job_state, stdout, stderr, job_id=galaxy_id_tag, external_job_id=external_job_id)
  File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 528, in _finish_or_resubmit_job
    self.sa_session.flush()
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/scoping.py", line 162, in do
    return getattr(self.registry(), name)(*args, **kwargs)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2496, in flush
    self._flush(objects)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2525, in _flush
    self.dispatch.before_flush(self, flush_context, objects)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/event/attr.py", line 261, in __call__
    fn(*args, **kw)
  File "/srv/galaxy/lib/galaxy/model/__init__.py", line 6196, in before_flush
    traceback.print_stack(sys._current_frames()[threading.get_ident()])
  File "/usr/local/lib/python3.8/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
  File "/usr/local/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 137, in run_next
    method(arg)
  File "/srv/galaxy/lib/galaxy/jobs/runners/kubernetes.py", line 584, in finish_job
    super(KubernetesJobRunner, self).finish_job(job_state)
  File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 780, in finish_job
    self._finish_or_resubmit_job(job_state, stdout, stderr, job_id=galaxy_id_tag, external_job_id=external_job_id)
  File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 540, in _finish_or_resubmit_job
    job_wrapper.finish(tool_stdout, tool_stderr, exit_code, check_output_detected_state=check_output_detected_state, job_stdout=job_stdout, job_stderr=job_stderr)
  File "/srv/galaxy/lib/galaxy/jobs/__init__.py", line 1713, in finish
    self.sa_session.flush()
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/scoping.py", line 162, in do
    return getattr(self.registry(), name)(*args, **kwargs)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2496, in flush
    self._flush(objects)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2525, in _flush
    self.dispatch.before_flush(self, flush_context, objects)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/event/attr.py", line 261, in __call__
    fn(*args, **kw)
  File "/srv/galaxy/lib/galaxy/model/__init__.py", line 6196, in before_flush
    traceback.print_stack(sys._current_frames()[threading.get_ident()])
  File "/usr/local/lib/python3.8/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
  File "/usr/local/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 137, in run_next
    method(arg)
  File "/srv/galaxy/lib/galaxy/jobs/runners/kubernetes.py", line 584, in finish_job
    super(KubernetesJobRunner, self).finish_job(job_state)
  File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 780, in finish_job
    self._finish_or_resubmit_job(job_state, stdout, stderr, job_id=galaxy_id_tag, external_job_id=external_job_id)
  File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 543, in _finish_or_resubmit_job
    job_wrapper.fail("Unable to finish job", exception=True)
  File "/srv/galaxy/lib/galaxy/jobs/__init__.py", line 1284, in fail
    dataset.state = dataset.states.ERROR
  File "/srv/galaxy/lib/galaxy/model/__init__.py", line 2532, in set_dataset_state
    object_session(self).flush()  # flush here, because hda.flush() won't flush the Dataset object
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2496, in flush
    self._flush(objects)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2525, in _flush
    self.dispatch.before_flush(self, flush_context, objects)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/event/attr.py", line 261, in __call__
    fn(*args, **kw)
  File "/srv/galaxy/lib/galaxy/model/__init__.py", line 6196, in before_flush
    traceback.print_stack(sys._current_frames()[threading.get_ident()])
  File "/usr/local/lib/python3.8/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
  File "/usr/local/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
  File "/usr/local/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
  File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 137, in run_next
    method(arg)
  File "/srv/galaxy/lib/galaxy/jobs/runners/kubernetes.py", line 584, in finish_job
    super(KubernetesJobRunner, self).finish_job(job_state)
  File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 780, in finish_job
    self._finish_or_resubmit_job(job_state, stdout, stderr, job_id=galaxy_id_tag, external_job_id=external_job_id)
  File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 543, in _finish_or_resubmit_job
    job_wrapper.fail("Unable to finish job", exception=True)
galaxy.model DEBUG 2020-09-24 22:56:49,544 Dirty objects: IdentitySet([<galaxy.model.Job(53241) at 0x7efc603ad2b0>])
galaxy.tools.error_reports DEBUG 2020-09-24 22:56:49,602 Bug report plugin <galaxy.tools.error_reports.plugins.sentry.SentryPlugin object at 0x7efc73c04df0> generated response None
urllib3.connectionpool DEBUG 2020-09-24 22:56:49,764 https://172.20.0.1:443 "GET /apis/batch/v1/namespaces/islandcompare/jobs?labelSelector=app%3Dgalaxy-islandcompare-53241 HTTP/1.1" 200 None
urllib3.connectionpool DEBUG 2020-09-24 22:56:49,772 https://172.20.0.1:443 "GET /apis/batch/v1/namespaces/islandcompare/jobs/galaxy-islandcompare-53241 HTTP/1.1" 200 1992
urllib3.connectionpool DEBUG 2020-09-24 22:56:49,780 https://172.20.0.1:443 "PATCH /apis/batch/v1/namespaces/islandcompare/jobs/galaxy-islandcompare-53241 HTTP/1.1" 200 1992
urllib3.connectionpool DEBUG 2020-09-24 22:56:49,789 https://172.20.0.1:443 "GET /apis/batch/v1/namespaces/islandcompare/jobs/galaxy-islandcompare-53241 HTTP/1.1" 200 1992
urllib3.connectionpool DEBUG 2020-09-24 22:56:49,802 https://172.20.0.1:443 "DELETE /apis/batch/v1/namespaces/islandcompare/jobs/galaxy-islandcompare-53241 HTTP/1.1" 200 192
urllib3.connectionpool DEBUG 2020-09-24 22:56:49,829 https://172.20.0.1:443 "GET /apis/batch/v1/namespaces/islandcompare/jobs?labelSelector=app%3Dgalaxy-islandcompare-53244 HTTP/1.1" 200 1930
urllib3.connectionpool DEBUG 2020-09-24 22:56:50,841 https://172.20.0.1:443 "GET /apis/batch/v1/namespaces/islandcompare/jobs?labelSelector=app%3Dgalaxy-islandcompare-53244 HTTP/1.1" 200 None
galaxy.model DEBUG 2020-09-24 22:56:50,956 Dirty objects: IdentitySet([<galaxy.model.Job(53244) at 0x7efc7031ab20>])
galaxy.jobs DEBUG 2020-09-24 22:56:51,027 finish(): Moved /data/jobs_directory/053/53244/outputs/galaxy_dataset_bb66e505-cbea-400b-b1fa-ad581e4cb988.dat to /data/database/files/090/dataset_90856.dat
galaxy.metadata DEBUG 2020-09-24 22:56:51,123 setting metadata externally failed for HistoryDatasetAssociation 118292: Metadata results could not be read from '/data/jobs_directory/053/53244/metadata/metadata_results_outfile'
galaxy.model DEBUG 2020-09-24 22:56:51,234 Dirty objects: IdentitySet([<galaxy.model.Dataset(90856) at 0x7efc606004c0>, <galaxy.model.HistoryDatasetAssociation(118292) at 0x7efc606007c0>, <galaxy.model.HistoryDatasetCollectionAssociation(1543) at 0x7efc70181400>])
galaxy.model DEBUG 2020-09-24 22:56:51,331 Dirty objects: IdentitySet([<galaxy.model.Job(53244) at 0x7efc603adcd0>])

logs-from-galaxy-worker-in-galaxy-worker-6db9967dd4-lstsz-2.txt
logs-from-galaxy-worker-in-galaxy-worker-6db9967dd4-7k4mg-2.txt
logs-from-galaxy-worker-in-galaxy-worker-6db9967dd4-26l6v-2.txt

@mvdbeek
Copy link
Member

mvdbeek commented Sep 25, 2020

Alright, so there appears to be a conflict for dataset 90853, and it looks like the kubernetes runner is attempting to set the job state to RUNNING, after the job has already failed ... and that does happen after the after the deadlock is reported. Which could be just a delay on logging or query execution of the transaction that is not rolled back. Can you try this without max_pod_retries="0" in your job config ? if that solves it I think we may have to place a lock around changing job states ?

Unfortunately the traceback logs got a bit tangled, I'll see if I can prefix them with an event id.

@mvdbeek
Copy link
Member

mvdbeek commented Sep 25, 2020

If you're doing another round of testing can you include mvdbeek@bac9fe7 ? That would make it easier to decipher mangled logs

@innovate-invent
Copy link
Contributor Author

innovate-invent commented Sep 25, 2020

Does max_pod_retries need to be explicitly set? I do not have it configured

@mvdbeek
Copy link
Member

mvdbeek commented Sep 25, 2020

Yes, k8s_pod_retrials is the right option. I see in your logs that retrial happens (or else I don't understand why the job status is being set to RUNNING after the job error.) If you don't set it it defaults to 3.

@innovate-invent
Copy link
Contributor Author

Looks like there are several duplicate params:

if 'max_pod_retries' in job_destination.params:
max_pod_retries = int(job_destination.params['max_pod_retries'])
elif 'k8s_pod_retries' in self.runner_params:
max_pod_retries = int(self.runner_params['k8s_pod_retries'])
elif 'max_pod_retrials' in job_destination.params:
# For backward compatibility
max_pod_retries = int(job_destination.params['max_pod_retrials'])
elif 'k8s_pod_retrials' in self.runner_params:
# For backward compatibility
max_pod_retries = int(self.runner_params['max_pod_retrials'])
else:
max_pod_retries = 1

To clarify, you would like me to set it to 0? or set it to 3? It defaults to 1 in the linked code.

@innovate-invent
Copy link
Contributor Author

I went ahead an reran with the patch and max_pod_retries unset (default value).

Here is the log around the deadlock
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
galaxy.tool_util.deps.container_resolvers.mulled INFO 2020-09-25 17:47:14,487 Call to `docker images` failed, configured container resolution may be broken
galaxy.tool_util.deps.containers INFO 2020-09-25 17:47:14,488 Checking with container resolver [CachedMulledDockerContainerResolver[namespace=biocontainers]] found description [None]
Cannot connect to the Docker daemon at unix:///var/run/docker.sock. Is the docker daemon running?
galaxy.tool_util.deps.container_resolvers.mulled INFO 2020-09-25 17:47:14,564 Call to `docker images` failed, configured container resolution may be broken
galaxy.tool_util.deps.containers INFO 2020-09-25 17:47:14,565 Checking with container resolver [CachedMulledDockerContainerResolver[namespace=local]] found description [None]
urllib3.connectionpool DEBUG 2020-09-25 17:47:14,569 Starting new HTTPS connection (1): quay.io:443
urllib3.connectionpool DEBUG 2020-09-25 17:47:14,730 https://172.20.0.1:443 "GET /apis/batch/v1/namespaces/islandcompare/jobs?labelSelector=app%3Dgalaxy-islandcompare-58581 HTTP/1.1" 200 None
galaxy.model DEBUG 2020-09-25 17:47:14,851 389547c0-b58f-4c0f-94e3-b599f20af340: Dirty objects: IdentitySet([<galaxy.model.Job(58581) at 0x7f29a06ebca0>]), stack:   File "/usr/local/lib/python3.8/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
389547c0-b58f-4c0f-94e3-b599f20af340:   File "/usr/local/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
389547c0-b58f-4c0f-94e3-b599f20af340:   File "/usr/local/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
389547c0-b58f-4c0f-94e3-b599f20af340:   File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 137, in run_next
    method(arg)
389547c0-b58f-4c0f-94e3-b599f20af340:   File "/srv/galaxy/lib/galaxy/jobs/runners/kubernetes.py", line 584, in finish_job
    super(KubernetesJobRunner, self).finish_job(job_state)
389547c0-b58f-4c0f-94e3-b599f20af340:   File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 780, in finish_job
    self._finish_or_resubmit_job(job_state, stdout, stderr, job_id=galaxy_id_tag, external_job_id=external_job_id)
389547c0-b58f-4c0f-94e3-b599f20af340:   File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 528, in _finish_or_resubmit_job
    self.sa_session.flush()
389547c0-b58f-4c0f-94e3-b599f20af340:   File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/scoping.py", line 162, in do
    return getattr(self.registry(), name)(*args, **kwargs)
389547c0-b58f-4c0f-94e3-b599f20af340:   File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2496, in flush
    self._flush(objects)
389547c0-b58f-4c0f-94e3-b599f20af340:   File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2525, in _flush
    self.dispatch.before_flush(self, flush_context, objects)
389547c0-b58f-4c0f-94e3-b599f20af340:   File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/event/attr.py", line 261, in __call__
    fn(*args, **kw)
389547c0-b58f-4c0f-94e3-b599f20af340:   File "/srv/galaxy/lib/galaxy/model/__init__.py", line 6196, in before_flush
    stack = traceback.format_list(traceback.extract_stack(sys._current_frames()[threading.get_ident()]))

galaxy.jobs DEBUG 2020-09-25 17:47:14,925 finish(): Moved /data/jobs_directory/058/58581/outputs/galaxy_dataset_08bb74d0-4c78-40ef-b6a7-5887176a90df.dat to /data/database/files/000/101/dataset_101056.dat
urllib3.connectionpool DEBUG 2020-09-25 17:47:14,926 https://quay.io:443 "GET /api/v1/repository/biocontainers/gawk HTTP/1.1" 200 1936
galaxy.tool_util.deps.containers INFO 2020-09-25 17:47:14,931 Checking with container resolver [MulledDockerContainerResolver[namespace=biocontainers]] found description [ContainerDescription[identifier=quay.io/biocontainers/gawk:5.0.1,type=docker]]
galaxy.model DEBUG 2020-09-25 17:47:14,936 457a50ef-e75d-4a89-8f1e-dc01816902f3: Dirty objects: IdentitySet([<galaxy.model.Job(58582) at 0x7f29c00e68e0>]), stack:   File "/usr/local/lib/python3.8/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
457a50ef-e75d-4a89-8f1e-dc01816902f3:   File "/usr/local/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
457a50ef-e75d-4a89-8f1e-dc01816902f3:   File "/usr/local/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
457a50ef-e75d-4a89-8f1e-dc01816902f3:   File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 137, in run_next
    method(arg)
457a50ef-e75d-4a89-8f1e-dc01816902f3:   File "/srv/galaxy/lib/galaxy/jobs/runners/kubernetes.py", line 131, in queue_job
    self.__get_k8s_job_spec(ajs)
457a50ef-e75d-4a89-8f1e-dc01816902f3:   File "/srv/galaxy/lib/galaxy/jobs/runners/kubernetes.py", line 220, in __get_k8s_job_spec
    k8s_job_spec = {"template": self.__get_k8s_job_spec_template(ajs),
457a50ef-e75d-4a89-8f1e-dc01816902f3:   File "/srv/galaxy/lib/galaxy/jobs/runners/kubernetes.py", line 235, in __get_k8s_job_spec_template
    "containers": self.__get_k8s_containers(ajs),
457a50ef-e75d-4a89-8f1e-dc01816902f3:   File "/srv/galaxy/lib/galaxy/jobs/runners/kubernetes.py", line 272, in __get_k8s_containers
    "image": self._find_container(ajs.job_wrapper).container_id,
457a50ef-e75d-4a89-8f1e-dc01816902f3:   File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 455, in _find_container
    job_wrapper.set_container(container)
457a50ef-e75d-4a89-8f1e-dc01816902f3:   File "/srv/galaxy/lib/galaxy/jobs/__init__.py", line 2285, in set_container
    self.sa_session.flush()
457a50ef-e75d-4a89-8f1e-dc01816902f3:   File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/scoping.py", line 162, in do
    return getattr(self.registry(), name)(*args, **kwargs)
457a50ef-e75d-4a89-8f1e-dc01816902f3:   File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2496, in flush
    self._flush(objects)
457a50ef-e75d-4a89-8f1e-dc01816902f3:   File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2525, in _flush
    self.dispatch.before_flush(self, flush_context, objects)
457a50ef-e75d-4a89-8f1e-dc01816902f3:   File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/event/attr.py", line 261, in __call__
    fn(*args, **kw)
457a50ef-e75d-4a89-8f1e-dc01816902f3:   File "/srv/galaxy/lib/galaxy/model/__init__.py", line 6196, in before_flush
    stack = traceback.format_list(traceback.extract_stack(sys._current_frames()[threading.get_ident()]))

urllib3.connectionpool DEBUG 2020-09-25 17:47:14,954 https://172.20.0.1:443 "GET /apis/batch/v1/namespaces/islandcompare/jobs/galaxy-islandcompare-58582 HTTP/1.1" 404 242
galaxy.metadata DEBUG 2020-09-25 17:47:14,962 setting metadata externally failed for HistoryDatasetAssociation 131275: Metadata results could not be read from '/data/jobs_directory/058/58581/metadata/metadata_results_outfile'
urllib3.connectionpool DEBUG 2020-09-25 17:47:14,963 https://172.20.0.1:443 "POST /apis/batch/v1/namespaces/islandcompare/jobs HTTP/1.1" 201 1770
galaxy.model DEBUG 2020-09-25 17:47:14,971 2d721fb0-a441-4208-9261-15c8c583c476: Dirty objects: IdentitySet([<galaxy.model.Job(58582) at 0x7f29c00e68e0>]), stack:   File "/usr/local/lib/python3.8/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
2d721fb0-a441-4208-9261-15c8c583c476:   File "/usr/local/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
2d721fb0-a441-4208-9261-15c8c583c476:   File "/usr/local/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
2d721fb0-a441-4208-9261-15c8c583c476:   File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 137, in run_next
    method(arg)
2d721fb0-a441-4208-9261-15c8c583c476:   File "/srv/galaxy/lib/galaxy/jobs/runners/kubernetes.py", line 164, in queue_job
    job_wrapper.set_external_id(k8s_job_name)
2d721fb0-a441-4208-9261-15c8c583c476:   File "/srv/galaxy/lib/galaxy/jobs/__init__.py", line 1434, in set_external_id
    self.sa_session.flush()
2d721fb0-a441-4208-9261-15c8c583c476:   File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/scoping.py", line 162, in do
    return getattr(self.registry(), name)(*args, **kwargs)
2d721fb0-a441-4208-9261-15c8c583c476:   File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2496, in flush
    self._flush(objects)
2d721fb0-a441-4208-9261-15c8c583c476:   File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2525, in _flush
    self.dispatch.before_flush(self, flush_context, objects)
2d721fb0-a441-4208-9261-15c8c583c476:   File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/event/attr.py", line 261, in __call__
    fn(*args, **kw)
2d721fb0-a441-4208-9261-15c8c583c476:   File "/srv/galaxy/lib/galaxy/model/__init__.py", line 6196, in before_flush
    stack = traceback.format_list(traceback.extract_stack(sys._current_frames()[threading.get_ident()]))

galaxy.model DEBUG 2020-09-25 17:47:15,095 b724c0ac-2c27-427b-b61b-292e50431da6: Dirty objects: IdentitySet([<galaxy.model.Dataset(101056) at 0x7f29c02fe3d0>, <galaxy.model.HistoryDatasetCollectionAssociation(1567) at 0x7f29a0052460>, <galaxy.model.HistoryDatasetAssociation(131275) at 0x7f29a06eb1c0>]), stack:   File "/usr/local/lib/python3.8/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
b724c0ac-2c27-427b-b61b-292e50431da6:   File "/usr/local/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
b724c0ac-2c27-427b-b61b-292e50431da6:   File "/usr/local/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
b724c0ac-2c27-427b-b61b-292e50431da6:   File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 137, in run_next
    method(arg)
b724c0ac-2c27-427b-b61b-292e50431da6:   File "/srv/galaxy/lib/galaxy/jobs/runners/kubernetes.py", line 584, in finish_job
    super(KubernetesJobRunner, self).finish_job(job_state)
b724c0ac-2c27-427b-b61b-292e50431da6:   File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 780, in finish_job
    self._finish_or_resubmit_job(job_state, stdout, stderr, job_id=galaxy_id_tag, external_job_id=external_job_id)
b724c0ac-2c27-427b-b61b-292e50431da6:   File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 540, in _finish_or_resubmit_job
    job_wrapper.finish(tool_stdout, tool_stderr, exit_code, check_output_detected_state=check_output_detected_state, job_stdout=job_stdout, job_stderr=job_stderr)
b724c0ac-2c27-427b-b61b-292e50431da6:   File "/srv/galaxy/lib/galaxy/jobs/__init__.py", line 1713, in finish
    self.sa_session.flush()
b724c0ac-2c27-427b-b61b-292e50431da6:   File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/scoping.py", line 162, in do
    return getattr(self.registry(), name)(*args, **kwargs)
b724c0ac-2c27-427b-b61b-292e50431da6:   File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2496, in flush
    self._flush(objects)
b724c0ac-2c27-427b-b61b-292e50431da6:   File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2525, in _flush
    self.dispatch.before_flush(self, flush_context, objects)
b724c0ac-2c27-427b-b61b-292e50431da6:   File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/event/attr.py", line 261, in __call__
    fn(*args, **kw)
b724c0ac-2c27-427b-b61b-292e50431da6:   File "/srv/galaxy/lib/galaxy/model/__init__.py", line 6196, in before_flush
    stack = traceback.format_list(traceback.extract_stack(sys._current_frames()[threading.get_ident()]))

galaxy.model DEBUG 2020-09-25 17:47:15,150 0023498e-43ca-48eb-ac56-77a8f40a122c: Dirty objects: IdentitySet([<galaxy.model.Job(58581) at 0x7f29c02fed30>]), stack:   File "/usr/local/lib/python3.8/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
0023498e-43ca-48eb-ac56-77a8f40a122c:   File "/usr/local/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
0023498e-43ca-48eb-ac56-77a8f40a122c:   File "/usr/local/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
0023498e-43ca-48eb-ac56-77a8f40a122c:   File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 137, in run_next
    method(arg)
0023498e-43ca-48eb-ac56-77a8f40a122c:   File "/srv/galaxy/lib/galaxy/jobs/runners/kubernetes.py", line 584, in finish_job
    super(KubernetesJobRunner, self).finish_job(job_state)
0023498e-43ca-48eb-ac56-77a8f40a122c:   File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 780, in finish_job
    self._finish_or_resubmit_job(job_state, stdout, stderr, job_id=galaxy_id_tag, external_job_id=external_job_id)
0023498e-43ca-48eb-ac56-77a8f40a122c:   File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 540, in _finish_or_resubmit_job
    job_wrapper.finish(tool_stdout, tool_stderr, exit_code, check_output_detected_state=check_output_detected_state, job_stdout=job_stdout, job_stderr=job_stderr)
0023498e-43ca-48eb-ac56-77a8f40a122c:   File "/srv/galaxy/lib/galaxy/jobs/__init__.py", line 1725, in finish
    self.discover_outputs(job, inp_data, out_data, out_collections, final_job_state=final_job_state)
0023498e-43ca-48eb-ac56-77a8f40a122c:   File "/srv/galaxy/lib/galaxy/jobs/__init__.py", line 1797, in discover_outputs
    self.tool.discover_outputs(
0023498e-43ca-48eb-ac56-77a8f40a122c:   File "/srv/galaxy/lib/galaxy/tools/__init__.py", line 1879, in discover_outputs
    collected = output_collect.collect_primary_datasets(
0023498e-43ca-48eb-ac56-77a8f40a122c:   File "/srv/galaxy/lib/galaxy/job_execution/output_collect.py", line 448, in collect_primary_datasets
    job_context.flush()
0023498e-43ca-48eb-ac56-77a8f40a122c:   File "/srv/galaxy/lib/galaxy/job_execution/output_collect.py", line 213, in flush
    self.sa_session.flush()
0023498e-43ca-48eb-ac56-77a8f40a122c:   File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/scoping.py", line 162, in do
    return getattr(self.registry(), name)(*args, **kwargs)
0023498e-43ca-48eb-ac56-77a8f40a122c:   File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2496, in flush
    self._flush(objects)
0023498e-43ca-48eb-ac56-77a8f40a122c:   File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2525, in _flush
    self.dispatch.before_flush(self, flush_context, objects)
0023498e-43ca-48eb-ac56-77a8f40a122c:   File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/event/attr.py", line 261, in __call__
    fn(*args, **kw)
0023498e-43ca-48eb-ac56-77a8f40a122c:   File "/srv/galaxy/lib/galaxy/model/__init__.py", line 6196, in before_flush
    stack = traceback.format_list(traceback.extract_stack(sys._current_frames()[threading.get_ident()]))

galaxy.model DEBUG 2020-09-25 17:47:15,178 e9f0687f-0a9b-46a5-8271-134491b633c0: Dirty objects: IdentitySet([<galaxy.model.Dataset(101056) at 0x7f29a0052070>, <galaxy.model.Job(58581) at 0x7f29c02fed30>, <galaxy.model.User(2) at 0x7f29c032dc40>]), stack:   File "/usr/local/lib/python3.8/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
e9f0687f-0a9b-46a5-8271-134491b633c0:   File "/usr/local/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
e9f0687f-0a9b-46a5-8271-134491b633c0:   File "/usr/local/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
e9f0687f-0a9b-46a5-8271-134491b633c0:   File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 137, in run_next
    method(arg)
e9f0687f-0a9b-46a5-8271-134491b633c0:   File "/srv/galaxy/lib/galaxy/jobs/runners/kubernetes.py", line 584, in finish_job
    super(KubernetesJobRunner, self).finish_job(job_state)
e9f0687f-0a9b-46a5-8271-134491b633c0:   File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 780, in finish_job
    self._finish_or_resubmit_job(job_state, stdout, stderr, job_id=galaxy_id_tag, external_job_id=external_job_id)
e9f0687f-0a9b-46a5-8271-134491b633c0:   File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 540, in _finish_or_resubmit_job
    job_wrapper.finish(tool_stdout, tool_stderr, exit_code, check_output_detected_state=check_output_detected_state, job_stdout=job_stdout, job_stderr=job_stderr)
e9f0687f-0a9b-46a5-8271-134491b633c0:   File "/srv/galaxy/lib/galaxy/jobs/__init__.py", line 1755, in finish
    self.sa_session.flush()
e9f0687f-0a9b-46a5-8271-134491b633c0:   File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/scoping.py", line 162, in do
    return getattr(self.registry(), name)(*args, **kwargs)
e9f0687f-0a9b-46a5-8271-134491b633c0:   File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2496, in flush
    self._flush(objects)
e9f0687f-0a9b-46a5-8271-134491b633c0:   File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2525, in _flush
    self.dispatch.before_flush(self, flush_context, objects)
e9f0687f-0a9b-46a5-8271-134491b633c0:   File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/event/attr.py", line 261, in __call__
    fn(*args, **kw)
e9f0687f-0a9b-46a5-8271-134491b633c0:   File "/srv/galaxy/lib/galaxy/model/__init__.py", line 6196, in before_flush
    stack = traceback.format_list(traceback.extract_stack(sys._current_frames()[threading.get_ident()]))

galaxy.jobs INFO 2020-09-25 17:47:15,227 Collecting metrics for Job 58581 in /data/jobs_directory/058/58581
galaxy.model DEBUG 2020-09-25 17:47:15,230 7604f5a1-4211-43fd-b3f1-e862b5bde185: Dirty objects: IdentitySet([<galaxy.model.Job(58581) at 0x7f29c02fed30>]), stack:   File "/usr/local/lib/python3.8/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
7604f5a1-4211-43fd-b3f1-e862b5bde185:   File "/usr/local/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
7604f5a1-4211-43fd-b3f1-e862b5bde185:   File "/usr/local/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
7604f5a1-4211-43fd-b3f1-e862b5bde185:   File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 137, in run_next
    method(arg)
7604f5a1-4211-43fd-b3f1-e862b5bde185:   File "/srv/galaxy/lib/galaxy/jobs/runners/kubernetes.py", line 584, in finish_job
    super(KubernetesJobRunner, self).finish_job(job_state)
7604f5a1-4211-43fd-b3f1-e862b5bde185:   File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 780, in finish_job
    self._finish_or_resubmit_job(job_state, stdout, stderr, job_id=galaxy_id_tag, external_job_id=external_job_id)
7604f5a1-4211-43fd-b3f1-e862b5bde185:   File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 540, in _finish_or_resubmit_job
    job_wrapper.finish(tool_stdout, tool_stderr, exit_code, check_output_detected_state=check_output_detected_state, job_stdout=job_stdout, job_stderr=job_stderr)
7604f5a1-4211-43fd-b3f1-e862b5bde185:   File "/srv/galaxy/lib/galaxy/jobs/__init__.py", line 1765, in finish
    self.sa_session.flush()
7604f5a1-4211-43fd-b3f1-e862b5bde185:   File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/scoping.py", line 162, in do
    return getattr(self.registry(), name)(*args, **kwargs)
7604f5a1-4211-43fd-b3f1-e862b5bde185:   File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2496, in flush
    self._flush(objects)
7604f5a1-4211-43fd-b3f1-e862b5bde185:   File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2525, in _flush
    self.dispatch.before_flush(self, flush_context, objects)
7604f5a1-4211-43fd-b3f1-e862b5bde185:   File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/event/attr.py", line 261, in __call__
    fn(*args, **kw)
7604f5a1-4211-43fd-b3f1-e862b5bde185:   File "/srv/galaxy/lib/galaxy/model/__init__.py", line 6196, in before_flush
    stack = traceback.format_list(traceback.extract_stack(sys._current_frames()[threading.get_ident()]))

galaxy.jobs DEBUG 2020-09-25 17:47:15,436 job_wrapper.finish for job 58581 executed (580.296 ms)
urllib3.connectionpool DEBUG 2020-09-25 17:47:15,442 https://172.20.0.1:443 "GET /apis/batch/v1/namespaces/islandcompare/jobs?labelSelector=app%3Dgalaxy-islandcompare-58581 HTTP/1.1" 200 None
urllib3.connectionpool DEBUG 2020-09-25 17:47:15,448 https://172.20.0.1:443 "GET /apis/batch/v1/namespaces/islandcompare/jobs/galaxy-islandcompare-58581 HTTP/1.1" 200 1992
urllib3.connectionpool DEBUG 2020-09-25 17:47:15,456 https://172.20.0.1:443 "PATCH /apis/batch/v1/namespaces/islandcompare/jobs/galaxy-islandcompare-58581 HTTP/1.1" 200 1992
urllib3.connectionpool DEBUG 2020-09-25 17:47:15,461 https://172.20.0.1:443 "GET /apis/batch/v1/namespaces/islandcompare/jobs/galaxy-islandcompare-58581 HTTP/1.1" 200 1992
urllib3.connectionpool DEBUG 2020-09-25 17:47:15,472 https://172.20.0.1:443 "DELETE /apis/batch/v1/namespaces/islandcompare/jobs/galaxy-islandcompare-58581 HTTP/1.1" 200 192
urllib3.connectionpool DEBUG 2020-09-25 17:47:15,739 https://172.20.0.1:443 "GET /apis/batch/v1/namespaces/islandcompare/jobs?labelSelector=app%3Dgalaxy-islandcompare-58582 HTTP/1.1" 200 1930
galaxy.model DEBUG 2020-09-25 17:47:15,767 3ad62da7-2eba-4c80-a477-5baf237b10ed: Dirty objects: IdentitySet([<galaxy.model.Dataset(101057) at 0x7f29c3c384f0>, <galaxy.model.HistoryDatasetAssociation(131276) at 0x7f29c02fee50>, <galaxy.model.Job(58582) at 0x7f29a0052700>]), stack:   File "/usr/local/lib/python3.8/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
3ad62da7-2eba-4c80-a477-5baf237b10ed:   File "/usr/local/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
3ad62da7-2eba-4c80-a477-5baf237b10ed:   File "/usr/local/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
3ad62da7-2eba-4c80-a477-5baf237b10ed:   File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 708, in monitor
    self.check_watched_items()
3ad62da7-2eba-4c80-a477-5baf237b10ed:   File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 735, in check_watched_items
    new_async_job_state = self.check_watched_item(async_job_state)
3ad62da7-2eba-4c80-a477-5baf237b10ed:   File "/srv/galaxy/lib/galaxy/jobs/runners/kubernetes.py", line 447, in check_watched_item
    job_state.job_wrapper.change_state(model.Job.states.RUNNING)
3ad62da7-2eba-4c80-a477-5baf237b10ed:   File "/srv/galaxy/lib/galaxy/jobs/__init__.py", line 1399, in change_state
    self.sa_session.flush()
3ad62da7-2eba-4c80-a477-5baf237b10ed:   File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/scoping.py", line 162, in do
    return getattr(self.registry(), name)(*args, **kwargs)
3ad62da7-2eba-4c80-a477-5baf237b10ed:   File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2496, in flush
    self._flush(objects)
3ad62da7-2eba-4c80-a477-5baf237b10ed:   File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2525, in _flush
    self.dispatch.before_flush(self, flush_context, objects)
3ad62da7-2eba-4c80-a477-5baf237b10ed:   File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/event/attr.py", line 261, in __call__
    fn(*args, **kw)
3ad62da7-2eba-4c80-a477-5baf237b10ed:   File "/srv/galaxy/lib/galaxy/model/__init__.py", line 6196, in before_flush
    stack = traceback.format_list(traceback.extract_stack(sys._current_frames()[threading.get_ident()]))

urllib3.connectionpool DEBUG 2020-09-25 17:47:16,913 https://172.20.0.1:443 "GET /apis/batch/v1/namespaces/islandcompare/jobs?labelSelector=app%3Dgalaxy-islandcompare-58582 HTTP/1.1" 200 1930
urllib3.connectionpool DEBUG 2020-09-25 17:47:17,920 https://172.20.0.1:443 "GET /apis/batch/v1/namespaces/islandcompare/jobs?labelSelector=app%3Dgalaxy-islandcompare-58582 HTTP/1.1" 200 None
galaxy.model DEBUG 2020-09-25 17:47:18,019 e60a1fb4-4953-4a0b-b122-4d0cef294ab6: Dirty objects: IdentitySet([<galaxy.model.Job(58582) at 0x7f29c02f7220>]), stack:   File "/usr/local/lib/python3.8/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
e60a1fb4-4953-4a0b-b122-4d0cef294ab6:   File "/usr/local/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
e60a1fb4-4953-4a0b-b122-4d0cef294ab6:   File "/usr/local/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
e60a1fb4-4953-4a0b-b122-4d0cef294ab6:   File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 137, in run_next
    method(arg)
e60a1fb4-4953-4a0b-b122-4d0cef294ab6:   File "/srv/galaxy/lib/galaxy/jobs/runners/kubernetes.py", line 584, in finish_job
    super(KubernetesJobRunner, self).finish_job(job_state)
e60a1fb4-4953-4a0b-b122-4d0cef294ab6:   File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 780, in finish_job
    self._finish_or_resubmit_job(job_state, stdout, stderr, job_id=galaxy_id_tag, external_job_id=external_job_id)
e60a1fb4-4953-4a0b-b122-4d0cef294ab6:   File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 528, in _finish_or_resubmit_job
    self.sa_session.flush()
e60a1fb4-4953-4a0b-b122-4d0cef294ab6:   File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/scoping.py", line 162, in do
    return getattr(self.registry(), name)(*args, **kwargs)
e60a1fb4-4953-4a0b-b122-4d0cef294ab6:   File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2496, in flush
    self._flush(objects)
e60a1fb4-4953-4a0b-b122-4d0cef294ab6:   File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2525, in _flush
    self.dispatch.before_flush(self, flush_context, objects)
e60a1fb4-4953-4a0b-b122-4d0cef294ab6:   File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/event/attr.py", line 261, in __call__
    fn(*args, **kw)
e60a1fb4-4953-4a0b-b122-4d0cef294ab6:   File "/srv/galaxy/lib/galaxy/model/__init__.py", line 6196, in before_flush
    stack = traceback.format_list(traceback.extract_stack(sys._current_frames()[threading.get_ident()]))

galaxy.jobs DEBUG 2020-09-25 17:47:18,073 finish(): Moved /data/jobs_directory/058/58582/outputs/galaxy_dataset_a1d12061-27a4-477e-a32d-2eff2e98ef20.dat to /data/database/files/000/101/dataset_101057.dat
galaxy.metadata DEBUG 2020-09-25 17:47:18,132 setting metadata externally failed for HistoryDatasetAssociation 131276: Metadata results could not be read from '/data/jobs_directory/058/58582/metadata/metadata_results_outfile'
galaxy.model DEBUG 2020-09-25 17:47:18,286 283c865d-a9d2-4efa-8fa5-5996b5b6831f: Dirty objects: IdentitySet([<galaxy.model.HistoryDatasetCollectionAssociation(1567) at 0x7f29a06eb100>, <galaxy.model.HistoryDatasetAssociation(131276) at 0x7f29c032d6a0>, <galaxy.model.Dataset(101057) at 0x7f29c032dc40>]), stack:   File "/usr/local/lib/python3.8/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
283c865d-a9d2-4efa-8fa5-5996b5b6831f:   File "/usr/local/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
283c865d-a9d2-4efa-8fa5-5996b5b6831f:   File "/usr/local/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
283c865d-a9d2-4efa-8fa5-5996b5b6831f:   File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 137, in run_next
    method(arg)
283c865d-a9d2-4efa-8fa5-5996b5b6831f:   File "/srv/galaxy/lib/galaxy/jobs/runners/kubernetes.py", line 584, in finish_job
    super(KubernetesJobRunner, self).finish_job(job_state)
283c865d-a9d2-4efa-8fa5-5996b5b6831f:   File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 780, in finish_job
    self._finish_or_resubmit_job(job_state, stdout, stderr, job_id=galaxy_id_tag, external_job_id=external_job_id)
283c865d-a9d2-4efa-8fa5-5996b5b6831f:   File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 540, in _finish_or_resubmit_job
    job_wrapper.finish(tool_stdout, tool_stderr, exit_code, check_output_detected_state=check_output_detected_state, job_stdout=job_stdout, job_stderr=job_stderr)
283c865d-a9d2-4efa-8fa5-5996b5b6831f:   File "/srv/galaxy/lib/galaxy/jobs/__init__.py", line 1713, in finish
    self.sa_session.flush()
283c865d-a9d2-4efa-8fa5-5996b5b6831f:   File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/scoping.py", line 162, in do
    return getattr(self.registry(), name)(*args, **kwargs)
283c865d-a9d2-4efa-8fa5-5996b5b6831f:   File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2496, in flush
    self._flush(objects)
283c865d-a9d2-4efa-8fa5-5996b5b6831f:   File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2525, in _flush
    self.dispatch.before_flush(self, flush_context, objects)
283c865d-a9d2-4efa-8fa5-5996b5b6831f:   File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/event/attr.py", line 261, in __call__
    fn(*args, **kw)
283c865d-a9d2-4efa-8fa5-5996b5b6831f:   File "/srv/galaxy/lib/galaxy/model/__init__.py", line 6196, in before_flush
    stack = traceback.format_list(traceback.extract_stack(sys._current_frames()[threading.get_ident()]))

galaxy.jobs.runners ERROR 2020-09-25 17:47:19,318 (58582/galaxy-islandcompare-58582) Job wrapper finish method failed
Traceback (most recent call last):
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1247, in _execute_context
    self.dialect.do_execute(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 590, in do_execute
    cursor.execute(statement, parameters)
psycopg2.errors.DeadlockDetected: deadlock detected
DETAIL:  Process 27568 waits for ShareLock on transaction 1519662; blocked by process 23942.
Process 23942 waits for ShareLock on transaction 1519661; blocked by process 27568.
HINT:  See server log for query details.
CONTEXT:  while updating tuple (2,55) in relation "history"


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 540, in _finish_or_resubmit_job
    job_wrapper.finish(tool_stdout, tool_stderr, exit_code, check_output_detected_state=check_output_detected_state, job_stdout=job_stdout, job_stderr=job_stderr)
  File "/srv/galaxy/lib/galaxy/jobs/__init__.py", line 1713, in finish
    self.sa_session.flush()
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/scoping.py", line 162, in do
    return getattr(self.registry(), name)(*args, **kwargs)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2496, in flush
    self._flush(objects)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2637, in _flush
    transaction.rollback(_capture_exception=True)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__
    compat.raise_(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 178, in raise_
    raise exception
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2597, in _flush
    flush_context.execute()
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/unitofwork.py", line 422, in execute
    rec.execute(self)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/unitofwork.py", line 586, in execute
    persistence.save_obj(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/persistence.py", line 230, in save_obj
    _emit_update_statements(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/persistence.py", line 994, in _emit_update_statements
    c = cached_connections[connection].execute(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 984, in execute
    return meth(self, multiparams, params)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/sql/elements.py", line 293, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1097, in _execute_clauseelement
    ret = self._execute_context(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1287, in _execute_context
    self._handle_dbapi_exception(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1481, in _handle_dbapi_exception
    util.raise_(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 178, in raise_
    raise exception
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1247, in _execute_context
    self.dialect.do_execute(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 590, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.OperationalError: (psycopg2.errors.DeadlockDetected) deadlock detected
DETAIL:  Process 27568 waits for ShareLock on transaction 1519662; blocked by process 23942.
Process 23942 waits for ShareLock on transaction 1519661; blocked by process 27568.
HINT:  See server log for query details.
CONTEXT:  while updating tuple (2,55) in relation "history"

[SQL: UPDATE history SET update_time=%(update_time)s WHERE history.id = %(history_id)s]
[parameters: {'update_time': datetime.datetime(2020, 9, 25, 17, 47, 18, 289458), 'history_id': 81}]
(Background on this error at: http://sqlalche.me/e/e3q8)
galaxy.jobs ERROR 2020-09-25 17:47:19,377 fail(): Missing output file in working directory: [Errno 2] No such file or directory: '/data/jobs_directory/058/58582/outputs/galaxy_dataset_a1d12061-27a4-477e-a32d-2eff2e98ef20.dat'
galaxy.model DEBUG 2020-09-25 17:47:19,393 cce4eefb-dc93-433f-9aa3-e8ae67657b95: Dirty objects: IdentitySet([<galaxy.model.Job(58582) at 0x7f29c032d070>, <galaxy.model.Dataset(101057) at 0x7f29c032dc40>]), stack:   File "/usr/local/lib/python3.8/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
cce4eefb-dc93-433f-9aa3-e8ae67657b95:   File "/usr/local/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
cce4eefb-dc93-433f-9aa3-e8ae67657b95:   File "/usr/local/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
cce4eefb-dc93-433f-9aa3-e8ae67657b95:   File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 137, in run_next
    method(arg)
cce4eefb-dc93-433f-9aa3-e8ae67657b95:   File "/srv/galaxy/lib/galaxy/jobs/runners/kubernetes.py", line 584, in finish_job
    super(KubernetesJobRunner, self).finish_job(job_state)
cce4eefb-dc93-433f-9aa3-e8ae67657b95:   File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 780, in finish_job
    self._finish_or_resubmit_job(job_state, stdout, stderr, job_id=galaxy_id_tag, external_job_id=external_job_id)
cce4eefb-dc93-433f-9aa3-e8ae67657b95:   File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 543, in _finish_or_resubmit_job
    job_wrapper.fail("Unable to finish job", exception=True)
cce4eefb-dc93-433f-9aa3-e8ae67657b95:   File "/srv/galaxy/lib/galaxy/jobs/__init__.py", line 1284, in fail
    dataset.state = dataset.states.ERROR
cce4eefb-dc93-433f-9aa3-e8ae67657b95:   File "/srv/galaxy/lib/galaxy/model/__init__.py", line 2532, in set_dataset_state
    object_session(self).flush()  # flush here, because hda.flush() won't flush the Dataset object
cce4eefb-dc93-433f-9aa3-e8ae67657b95:   File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2496, in flush
    self._flush(objects)
cce4eefb-dc93-433f-9aa3-e8ae67657b95:   File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2525, in _flush
    self.dispatch.before_flush(self, flush_context, objects)
cce4eefb-dc93-433f-9aa3-e8ae67657b95:   File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/event/attr.py", line 261, in __call__
    fn(*args, **kw)
cce4eefb-dc93-433f-9aa3-e8ae67657b95:   File "/srv/galaxy/lib/galaxy/model/__init__.py", line 6196, in before_flush
    stack = traceback.format_list(traceback.extract_stack(sys._current_frames()[threading.get_ident()]))

galaxy.model DEBUG 2020-09-25 17:47:20,313 035a49ce-8bec-45b0-90a0-3598ca0f09ca: Dirty objects: IdentitySet([<galaxy.model.Dataset(101057) at 0x7f29c032dc40>, <galaxy.model.HistoryDatasetAssociation(131276) at 0x7f29c032d6a0>]), stack:   File "/usr/local/lib/python3.8/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
035a49ce-8bec-45b0-90a0-3598ca0f09ca:   File "/usr/local/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
035a49ce-8bec-45b0-90a0-3598ca0f09ca:   File "/usr/local/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
035a49ce-8bec-45b0-90a0-3598ca0f09ca:   File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 137, in run_next
    method(arg)
035a49ce-8bec-45b0-90a0-3598ca0f09ca:   File "/srv/galaxy/lib/galaxy/jobs/runners/kubernetes.py", line 584, in finish_job
    super(KubernetesJobRunner, self).finish_job(job_state)
035a49ce-8bec-45b0-90a0-3598ca0f09ca:   File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 780, in finish_job
    self._finish_or_resubmit_job(job_state, stdout, stderr, job_id=galaxy_id_tag, external_job_id=external_job_id)
035a49ce-8bec-45b0-90a0-3598ca0f09ca:   File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 543, in _finish_or_resubmit_job
    job_wrapper.fail("Unable to finish job", exception=True)
035a49ce-8bec-45b0-90a0-3598ca0f09ca:   File "/srv/galaxy/lib/galaxy/jobs/__init__.py", line 1297, in fail
    self.sa_session.flush()
035a49ce-8bec-45b0-90a0-3598ca0f09ca:   File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/scoping.py", line 162, in do
    return getattr(self.registry(), name)(*args, **kwargs)
035a49ce-8bec-45b0-90a0-3598ca0f09ca:   File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2496, in flush
    self._flush(objects)
035a49ce-8bec-45b0-90a0-3598ca0f09ca:   File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2525, in _flush
    self.dispatch.before_flush(self, flush_context, objects)
035a49ce-8bec-45b0-90a0-3598ca0f09ca:   File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/event/attr.py", line 261, in __call__
    fn(*args, **kw)
035a49ce-8bec-45b0-90a0-3598ca0f09ca:   File "/srv/galaxy/lib/galaxy/model/__init__.py", line 6196, in before_flush
    stack = traceback.format_list(traceback.extract_stack(sys._current_frames()[threading.get_ident()]))

galaxy.model DEBUG 2020-09-25 17:47:20,343 f3109993-956b-40eb-a408-9ff423dbb470: Dirty objects: IdentitySet([<galaxy.model.Job(58582) at 0x7f29c032d070>]), stack:   File "/usr/local/lib/python3.8/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
f3109993-956b-40eb-a408-9ff423dbb470:   File "/usr/local/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
f3109993-956b-40eb-a408-9ff423dbb470:   File "/usr/local/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
f3109993-956b-40eb-a408-9ff423dbb470:   File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 137, in run_next
    method(arg)
f3109993-956b-40eb-a408-9ff423dbb470:   File "/srv/galaxy/lib/galaxy/jobs/runners/kubernetes.py", line 584, in finish_job
    super(KubernetesJobRunner, self).finish_job(job_state)
f3109993-956b-40eb-a408-9ff423dbb470:   File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 780, in finish_job
    self._finish_or_resubmit_job(job_state, stdout, stderr, job_id=galaxy_id_tag, external_job_id=external_job_id)
f3109993-956b-40eb-a408-9ff423dbb470:   File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 543, in _finish_or_resubmit_job
    job_wrapper.fail("Unable to finish job", exception=True)
f3109993-956b-40eb-a408-9ff423dbb470:   File "/srv/galaxy/lib/galaxy/jobs/__init__.py", line 1309, in fail
    self.sa_session.flush()
f3109993-956b-40eb-a408-9ff423dbb470:   File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/scoping.py", line 162, in do
    return getattr(self.registry(), name)(*args, **kwargs)
f3109993-956b-40eb-a408-9ff423dbb470:   File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2496, in flush
    self._flush(objects)
f3109993-956b-40eb-a408-9ff423dbb470:   File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2525, in _flush
    self.dispatch.before_flush(self, flush_context, objects)
f3109993-956b-40eb-a408-9ff423dbb470:   File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/event/attr.py", line 261, in __call__
    fn(*args, **kw)
f3109993-956b-40eb-a408-9ff423dbb470:   File "/srv/galaxy/lib/galaxy/model/__init__.py", line 6196, in before_flush
    stack = traceback.format_list(traceback.extract_stack(sys._current_frames()[threading.get_ident()]))

galaxy.tools.error_reports DEBUG 2020-09-25 17:47:20,369 Bug report plugin <galaxy.tools.error_reports.plugins.sentry.SentryPlugin object at 0x7f29c55c1e20> generated response None
urllib3.connectionpool DEBUG 2020-09-25 17:47:20,511 https://172.20.0.1:443 "GET /apis/batch/v1/namespaces/islandcompare/jobs?labelSelector=app%3Dgalaxy-islandcompare-58582 HTTP/1.1" 200 None
urllib3.connectionpool DEBUG 2020-09-25 17:47:20,517 https://172.20.0.1:443 "GET /apis/batch/v1/namespaces/islandcompare/jobs/galaxy-islandcompare-58582 HTTP/1.1" 200 1992
urllib3.connectionpool DEBUG 2020-09-25 17:47:20,527 https://172.20.0.1:443 "PATCH /apis/batch/v1/namespaces/islandcompare/jobs/galaxy-islandcompare-58582 HTTP/1.1" 200 1992
urllib3.connectionpool DEBUG 2020-09-25 17:47:20,533 https://172.20.0.1:443 "GET /apis/batch/v1/namespaces/islandcompare/jobs/galaxy-islandcompare-58582 HTTP/1.1" 200 1992
urllib3.connectionpool DEBUG 2020-09-25 17:47:20,546 https://172.20.0.1:443 "DELETE /apis/batch/v1/namespaces/islandcompare/jobs/galaxy-islandcompare-58582 HTTP/1.1" 200 192
galaxy.model DEBUG 2020-09-25 17:47:30,455 1f3340e6-8358-4f9c-a86b-cd59593abf81: Dirty objects: IdentitySet([<galaxy.model.WorkerProcess object at 0x7f29c3c381f0>]), stack:   File "/usr/local/lib/python3.8/threading.py", line 890, in _bootstrap
    self._bootstrap_inner()
1f3340e6-8358-4f9c-a86b-cd59593abf81:   File "/usr/local/lib/python3.8/threading.py", line 932, in _bootstrap_inner
    self.run()
1f3340e6-8358-4f9c-a86b-cd59593abf81:   File "/usr/local/lib/python3.8/threading.py", line 870, in run
    self._target(*self._args, **self._kwargs)
1f3340e6-8358-4f9c-a86b-cd59593abf81:   File "/srv/galaxy/lib/galaxy/model/database_heartbeat.py", line 100, in send_database_heartbeat
    self.update_watcher_designation()
1f3340e6-8358-4f9c-a86b-cd59593abf81:   File "/srv/galaxy/lib/galaxy/model/database_heartbeat.py", line 89, in update_watcher_designation
    self.sa_session.flush()
1f3340e6-8358-4f9c-a86b-cd59593abf81:   File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/scoping.py", line 162, in do
    return getattr(self.registry(), name)(*args, **kwargs)
1f3340e6-8358-4f9c-a86b-cd59593abf81:   File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2496, in flush
    self._flush(objects)
1f3340e6-8358-4f9c-a86b-cd59593abf81:   File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2525, in _flush
    self.dispatch.before_flush(self, flush_context, objects)
1f3340e6-8358-4f9c-a86b-cd59593abf81:   File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/event/attr.py", line 261, in __call__
    fn(*args, **kw)
1f3340e6-8358-4f9c-a86b-cd59593abf81:   File "/srv/galaxy/lib/galaxy/model/__init__.py", line 6196, in before_flush
    stack = traceback.format_list(traceback.extract_stack(sys._current_frames()[threading.get_ident()]))

galaxy.jobs.handler DEBUG 2020-09-25 17:47:57,812 Grabbed job(s): 58590
galaxy.jobs.mapper DEBUG 2020-09-25 17:47:57,853 (58590) Mapped job to destination id: k8s-tiny
galaxy.jobs.handler DEBUG 2020-09-25 17:47:57,874 (58590) Dispatching to k8s runner
galaxy.jobs DEBUG 2020-09-25 17:47:57,884 (58590) Persisting job destination (destination id: k8s-tiny)
galaxy.jobs DEBUG 2020-09-25 17:47:57,928 (58590) Working directory for job is: /data/jobs_directory/058/58590

@mvdbeek
Copy link
Member

mvdbeek commented Sep 25, 2020

0, the goal is to eliminate any retries.

@mvdbeek
Copy link
Member

mvdbeek commented Sep 25, 2020

This is also different deadlock from the one we saw without the triggers (it's the one in the initial post), is this on the branch with or without the triggers ?

@innovate-invent
Copy link
Contributor Author

innovate-invent commented Sep 25, 2020

The branch I am working with is the accumulation of all the patches you have suggested.

changes.txt

I will rerun setting max_pod_retries to 0.

@innovate-invent
Copy link
Contributor Author

I reran it and it failed again with the dataset "Unable to finish job". The logs are now saturated and the deadlock exception has fallen off the end of the log before I could capture it.

@innovate-invent
Copy link
Contributor Author

innovate-invent commented Sep 29, 2020

I reran the workflow with #10315, it continues to deadlock. The deadlock appeared in the workflow scheduler log on one run. I reran again and it reappeared in both the job handler log and workflow scheduler log.

Job Handler Log
galaxy.metadata DEBUG 2020-09-29 19:26:35,122 setting metadata externally failed for HistoryDatasetAssociation 205516: Metadata results could not be read from '/data/jobs_directory/089/89710/metadata/metadata_results_outfile'
galaxy.jobs.runners ERROR 2020-09-29 19:26:36,364 (89710/c26511b5-a1a9-43b1-9257-9a4bb0b8c371) Job wrapper finish method failed
Traceback (most recent call last):
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1247, in _execute_context
    self.dialect.do_execute(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 590, in do_execute
    cursor.execute(statement, parameters)
psycopg2.errors.DeadlockDetected: deadlock detected
DETAIL:  Process 4656 waits for ShareLock on transaction 2328545; blocked by process 22858.
Process 22858 waits for ShareLock on transaction 2328546; blocked by process 4656.
HINT:  See server log for query details.
CONTEXT:  while updating tuple (1,25) in relation "history"


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 540, in _finish_or_resubmit_job
    job_wrapper.finish(tool_stdout, tool_stderr, exit_code, check_output_detected_state=check_output_detected_state, job_stdout=job_stdout, job_stderr=job_stderr)
  File "/srv/galaxy/lib/galaxy/jobs/__init__.py", line 1713, in finish
    self.sa_session.flush()
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/scoping.py", line 162, in do
    return getattr(self.registry(), name)(*args, **kwargs)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2496, in flush
    self._flush(objects)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2637, in _flush
    transaction.rollback(_capture_exception=True)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__
    compat.raise_(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 178, in raise_
    raise exception
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2597, in _flush
    flush_context.execute()
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/unitofwork.py", line 422, in execute
    rec.execute(self)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/unitofwork.py", line 586, in execute
    persistence.save_obj(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/persistence.py", line 230, in save_obj
    _emit_update_statements(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/persistence.py", line 994, in _emit_update_statements
    c = cached_connections[connection].execute(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 984, in execute
    return meth(self, multiparams, params)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/sql/elements.py", line 293, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1097, in _execute_clauseelement
    ret = self._execute_context(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1287, in _execute_context
    self._handle_dbapi_exception(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1481, in _handle_dbapi_exception
    util.raise_(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 178, in raise_
    raise exception
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1247, in _execute_context
    self.dialect.do_execute(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 590, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.OperationalError: (psycopg2.errors.DeadlockDetected) deadlock detected
DETAIL:  Process 4656 waits for ShareLock on transaction 2328545; blocked by process 22858.
Process 22858 waits for ShareLock on transaction 2328546; blocked by process 4656.
HINT:  See server log for query details.
CONTEXT:  while updating tuple (1,25) in relation "history"

[SQL: UPDATE history SET update_time=%(update_time)s WHERE history.id = %(history_id)s]
[parameters: {'update_time': datetime.datetime(2020, 9, 29, 19, 26, 35, 353693), 'history_id': 94}]
(Background on this error at: http://sqlalche.me/e/e3q8)
galaxy.jobs ERROR 2020-09-29 19:26:36,422 fail(): Missing output file in working directory: [Errno 2] No such file or directory: '/data/jobs_directory/089/89710/outputs/galaxy_dataset_5f244287-4df8-494c-a327-60a65dfb285a.dat'
galaxy.tools.error_reports DEBUG 2020-09-29 19:26:36,658 Bug report plugin <galaxy.tools.error_reports.plugins.sentry.SentryPlugin object at 0x7fc22f2ace80> generated response None

Workflow Scheduler Log
galaxy.tools.actions INFO 2020-09-29 19:46:06,452 Setup for job Job[unflushed,tool_id=toolshed.g2.bx.psu.edu/repos/brinkmanlab/awkscript/awkscript/1.0] complete, ready to be enqueued (1.449 ms)
galaxy.web_stack.handlers INFO 2020-09-29 19:46:26,791 Flushed transaction for Job[id=89939,tool_id=toolshed.g2.bx.psu.edu/repos/brinkmanlab/awkscript/awkscript/1.0] (20338.910 ms)
galaxy.web_stack.handlers INFO 2020-09-29 19:46:26,791 (Job[id=89939,tool_id=toolshed.g2.bx.psu.edu/repos/brinkmanlab/awkscript/awkscript/1.0]) Handler '_default_' assigned using 'db-skip-locked' assignment method
galaxy.tools.execute DEBUG 2020-09-29 19:46:26,791 Tool toolshed.g2.bx.psu.edu/repos/brinkmanlab/awkscript/awkscript/1.0 created job 89939 (23095.373 ms)
galaxy.tools.actions INFO 2020-09-29 19:47:06,328 Handled output named outfile for tool toolshed.g2.bx.psu.edu/repos/brinkmanlab/awkscript/awkscript/1.0 (560.909 ms)
galaxy.tools.actions INFO 2020-09-29 19:47:06,853 Added output datasets to history (524.583 ms)
galaxy.tools.actions INFO 2020-09-29 19:47:06,854 Setup for job Job[unflushed,tool_id=toolshed.g2.bx.psu.edu/repos/brinkmanlab/awkscript/awkscript/1.0] complete, ready to be enqueued (1.405 ms)
galaxy.web_stack.handlers INFO 2020-09-29 19:47:08,083 Flushed transaction for Job[id=89940,tool_id=toolshed.g2.bx.psu.edu/repos/brinkmanlab/awkscript/awkscript/1.0] (1228.760 ms)
galaxy.web_stack.handlers INFO 2020-09-29 19:47:08,083 (Job[id=89940,tool_id=toolshed.g2.bx.psu.edu/repos/brinkmanlab/awkscript/awkscript/1.0]) Handler '_default_' assigned using 'db-skip-locked' assignment method
galaxy.tools.execute DEBUG 2020-09-29 19:47:08,084 Tool toolshed.g2.bx.psu.edu/repos/brinkmanlab/awkscript/awkscript/1.0 created job 89940 (2706.416 ms)
galaxy.tools.actions INFO 2020-09-29 19:47:40,899 Handled output named outfile for tool toolshed.g2.bx.psu.edu/repos/brinkmanlab/awkscript/awkscript/1.0 (2088.549 ms)
galaxy.tools.actions INFO 2020-09-29 19:47:43,708 Added output datasets to history (2809.126 ms)
galaxy.tools.actions INFO 2020-09-29 19:47:43,710 Setup for job Job[unflushed,tool_id=toolshed.g2.bx.psu.edu/repos/brinkmanlab/awkscript/awkscript/1.0] complete, ready to be enqueued (1.826 ms)
galaxy.web_stack.handlers INFO 2020-09-29 19:47:43,945 Flushed transaction for Job[id=89941,tool_id=toolshed.g2.bx.psu.edu/repos/brinkmanlab/awkscript/awkscript/1.0] (234.214 ms)
galaxy.web_stack.handlers INFO 2020-09-29 19:47:43,945 (Job[id=89941,tool_id=toolshed.g2.bx.psu.edu/repos/brinkmanlab/awkscript/awkscript/1.0]) Handler '_default_' assigned using 'db-skip-locked' assignment method
galaxy.tools.execute DEBUG 2020-09-29 19:47:43,945 Tool toolshed.g2.bx.psu.edu/repos/brinkmanlab/awkscript/awkscript/1.0 created job 89941 (7986.451 ms)
galaxy.workflow.run ERROR 2020-09-29 19:47:50,834 Failed to schedule Workflow[id=105], problem occurred on WorkflowStep[index=4,type=tool].
Traceback (most recent call last):
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1247, in _execute_context
    self.dialect.do_execute(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 590, in do_execute
    cursor.execute(statement, parameters)
psycopg2.errors.DeadlockDetected: deadlock detected
DETAIL:  Process 1903 waits for ShareLock on transaction 2333951; blocked by process 4656.
Process 4656 waits for ShareLock on transaction 2333952; blocked by process 1903.
HINT:  See server log for query details.
CONTEXT:  while updating tuple (2165,13) in relation "dataset"


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/srv/galaxy/lib/galaxy/workflow/run.py", line 190, in invoke
    incomplete_or_none = self._invoke_step(workflow_invocation_step)
  File "/srv/galaxy/lib/galaxy/workflow/run.py", line 263, in _invoke_step
    incomplete_or_none = invocation_step.workflow_step.module.execute(self.trans,
  File "/srv/galaxy/lib/galaxy/workflow/modules.py", line 1745, in execute
    execution_tracker = execute(
  File "/srv/galaxy/lib/galaxy/tools/execute.py", line 102, in execute
    execute_single_job(execution_slice, completed_jobs[i])
  File "/srv/galaxy/lib/galaxy/tools/execute.py", line 74, in execute_single_job
    execution_tracker.record_success(execution_slice, job, result)
  File "/srv/galaxy/lib/galaxy/tools/execute.py", line 412, in record_success
    self.job_callback(job)
  File "/srv/galaxy/lib/galaxy/workflow/modules.py", line 1755, in <lambda>
    job_callback=lambda job: self._handle_post_job_actions(step, job, invocation.replacement_dict),
  File "/srv/galaxy/lib/galaxy/workflow/modules.py", line 1815, in _handle_post_job_actions
    self.trans.sa_session.flush()
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/scoping.py", line 162, in do
    return getattr(self.registry(), name)(*args, **kwargs)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2496, in flush
    self._flush(objects)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2637, in _flush
    transaction.rollback(_capture_exception=True)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__
    compat.raise_(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 178, in raise_
    raise exception
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2597, in _flush
    flush_context.execute()
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/unitofwork.py", line 422, in execute
    rec.execute(self)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/unitofwork.py", line 586, in execute
    persistence.save_obj(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/persistence.py", line 230, in save_obj
    _emit_update_statements(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/persistence.py", line 994, in _emit_update_statements
    c = cached_connections[connection].execute(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 984, in execute
    return meth(self, multiparams, params)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/sql/elements.py", line 293, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1097, in _execute_clauseelement
    ret = self._execute_context(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1287, in _execute_context
    self._handle_dbapi_exception(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1481, in _handle_dbapi_exception
    util.raise_(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 178, in raise_
    raise exception
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1247, in _execute_context
    self.dialect.do_execute(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 590, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.OperationalError: (psycopg2.errors.DeadlockDetected) deadlock detected
DETAIL:  Process 1903 waits for ShareLock on transaction 2333951; blocked by process 4656.
Process 4656 waits for ShareLock on transaction 2333952; blocked by process 1903.
HINT:  See server log for query details.
CONTEXT:  while updating tuple (2165,13) in relation "dataset"

[SQL: UPDATE dataset SET update_time=%(update_time)s, file_size=%(file_size)s WHERE dataset.id = %(dataset_id)s]
[parameters: {'update_time': datetime.datetime(2020, 9, 29, 19, 47, 48, 407165), 'file_size': 132950, 'dataset_id': 158121}]
(Background on this error at: http://sqlalche.me/e/e3q8)
galaxy.workflow.run ERROR 2020-09-29 19:47:52,984 Failed to schedule Workflow[id=114,name=IslandCompare unpacked w. BLAST (imported from uploaded file) (imported from uploaded file)], problem occurred on WorkflowStep[index=17,type=subworkflow].
Traceback (most recent call last):
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1247, in _execute_context
    self.dialect.do_execute(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 590, in do_execute
    cursor.execute(statement, parameters)
psycopg2.errors.DeadlockDetected: deadlock detected
DETAIL:  Process 1903 waits for ShareLock on transaction 2333951; blocked by process 4656.
Process 4656 waits for ShareLock on transaction 2333952; blocked by process 1903.
HINT:  See server log for query details.
CONTEXT:  while updating tuple (2165,13) in relation "dataset"


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/srv/galaxy/lib/galaxy/workflow/run.py", line 190, in invoke
    incomplete_or_none = self._invoke_step(workflow_invocation_step)
  File "/srv/galaxy/lib/galaxy/workflow/run.py", line 263, in _invoke_step
    incomplete_or_none = invocation_step.workflow_step.module.execute(self.trans,
  File "/srv/galaxy/lib/galaxy/workflow/modules.py", line 514, in execute
    subworkflow_invoker.invoke()
  File "/srv/galaxy/lib/galaxy/workflow/run.py", line 190, in invoke
    incomplete_or_none = self._invoke_step(workflow_invocation_step)
  File "/srv/galaxy/lib/galaxy/workflow/run.py", line 263, in _invoke_step
    incomplete_or_none = invocation_step.workflow_step.module.execute(self.trans,
  File "/srv/galaxy/lib/galaxy/workflow/modules.py", line 1745, in execute
    execution_tracker = execute(
  File "/srv/galaxy/lib/galaxy/tools/execute.py", line 102, in execute
    execute_single_job(execution_slice, completed_jobs[i])
  File "/srv/galaxy/lib/galaxy/tools/execute.py", line 74, in execute_single_job
    execution_tracker.record_success(execution_slice, job, result)
  File "/srv/galaxy/lib/galaxy/tools/execute.py", line 412, in record_success
    self.job_callback(job)
  File "/srv/galaxy/lib/galaxy/workflow/modules.py", line 1755, in <lambda>
    job_callback=lambda job: self._handle_post_job_actions(step, job, invocation.replacement_dict),
  File "/srv/galaxy/lib/galaxy/workflow/modules.py", line 1815, in _handle_post_job_actions
    self.trans.sa_session.flush()
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/scoping.py", line 162, in do
    return getattr(self.registry(), name)(*args, **kwargs)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2496, in flush
    self._flush(objects)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2637, in _flush
    transaction.rollback(_capture_exception=True)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__
    compat.raise_(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 178, in raise_
    raise exception
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2597, in _flush
    flush_context.execute()
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/unitofwork.py", line 422, in execute
    rec.execute(self)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/unitofwork.py", line 586, in execute
    persistence.save_obj(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/persistence.py", line 230, in save_obj
    _emit_update_statements(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/persistence.py", line 994, in _emit_update_statements
    c = cached_connections[connection].execute(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 984, in execute
    return meth(self, multiparams, params)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/sql/elements.py", line 293, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1097, in _execute_clauseelement
    ret = self._execute_context(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1287, in _execute_context
    self._handle_dbapi_exception(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1481, in _handle_dbapi_exception
    util.raise_(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 178, in raise_
    raise exception
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1247, in _execute_context
    self.dialect.do_execute(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 590, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.OperationalError: (psycopg2.errors.DeadlockDetected) deadlock detected
DETAIL:  Process 1903 waits for ShareLock on transaction 2333951; blocked by process 4656.
Process 4656 waits for ShareLock on transaction 2333952; blocked by process 1903.
HINT:  See server log for query details.
CONTEXT:  while updating tuple (2165,13) in relation "dataset"

[SQL: UPDATE dataset SET update_time=%(update_time)s, file_size=%(file_size)s WHERE dataset.id = %(dataset_id)s]
[parameters: {'update_time': datetime.datetime(2020, 9, 29, 19, 47, 48, 407165), 'file_size': 132950, 'dataset_id': 158121}]
(Background on this error at: http://sqlalche.me/e/e3q8)
galaxy.workflow.run ERROR 2020-09-29 19:47:52,984 Failed to execute scheduled workflow.
Traceback (most recent call last):
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1247, in _execute_context
    self.dialect.do_execute(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 590, in do_execute
    cursor.execute(statement, parameters)
psycopg2.errors.DeadlockDetected: deadlock detected
DETAIL:  Process 1903 waits for ShareLock on transaction 2333951; blocked by process 4656.
Process 4656 waits for ShareLock on transaction 2333952; blocked by process 1903.
HINT:  See server log for query details.
CONTEXT:  while updating tuple (2165,13) in relation "dataset"


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/srv/galaxy/lib/galaxy/workflow/run.py", line 83, in __invoke
    outputs = invoker.invoke()
  File "/srv/galaxy/lib/galaxy/workflow/run.py", line 190, in invoke
    incomplete_or_none = self._invoke_step(workflow_invocation_step)
  File "/srv/galaxy/lib/galaxy/workflow/run.py", line 263, in _invoke_step
    incomplete_or_none = invocation_step.workflow_step.module.execute(self.trans,
  File "/srv/galaxy/lib/galaxy/workflow/modules.py", line 514, in execute
    subworkflow_invoker.invoke()
  File "/srv/galaxy/lib/galaxy/workflow/run.py", line 190, in invoke
    incomplete_or_none = self._invoke_step(workflow_invocation_step)
  File "/srv/galaxy/lib/galaxy/workflow/run.py", line 263, in _invoke_step
    incomplete_or_none = invocation_step.workflow_step.module.execute(self.trans,
  File "/srv/galaxy/lib/galaxy/workflow/modules.py", line 1745, in execute
    execution_tracker = execute(
  File "/srv/galaxy/lib/galaxy/tools/execute.py", line 102, in execute
    execute_single_job(execution_slice, completed_jobs[i])
  File "/srv/galaxy/lib/galaxy/tools/execute.py", line 74, in execute_single_job
    execution_tracker.record_success(execution_slice, job, result)
  File "/srv/galaxy/lib/galaxy/tools/execute.py", line 412, in record_success
    self.job_callback(job)
  File "/srv/galaxy/lib/galaxy/workflow/modules.py", line 1755, in <lambda>
    job_callback=lambda job: self._handle_post_job_actions(step, job, invocation.replacement_dict),
  File "/srv/galaxy/lib/galaxy/workflow/modules.py", line 1815, in _handle_post_job_actions
    self.trans.sa_session.flush()
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/scoping.py", line 162, in do
    return getattr(self.registry(), name)(*args, **kwargs)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2496, in flush
    self._flush(objects)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2637, in _flush
    transaction.rollback(_capture_exception=True)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__
    compat.raise_(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 178, in raise_
    raise exception
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2597, in _flush
    flush_context.execute()
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/unitofwork.py", line 422, in execute
    rec.execute(self)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/unitofwork.py", line 586, in execute
    persistence.save_obj(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/persistence.py", line 230, in save_obj
    _emit_update_statements(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/persistence.py", line 994, in _emit_update_statements
    c = cached_connections[connection].execute(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 984, in execute
    return meth(self, multiparams, params)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/sql/elements.py", line 293, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1097, in _execute_clauseelement
    ret = self._execute_context(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1287, in _execute_context
    self._handle_dbapi_exception(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1481, in _handle_dbapi_exception
    util.raise_(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 178, in raise_
    raise exception
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1247, in _execute_context
    self.dialect.do_execute(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 590, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.OperationalError: (psycopg2.errors.DeadlockDetected) deadlock detected
DETAIL:  Process 1903 waits for ShareLock on transaction 2333951; blocked by process 4656.
Process 4656 waits for ShareLock on transaction 2333952; blocked by process 1903.
HINT:  See server log for query details.
CONTEXT:  while updating tuple (2165,13) in relation "dataset"

[SQL: UPDATE dataset SET update_time=%(update_time)s, file_size=%(file_size)s WHERE dataset.id = %(dataset_id)s]
[parameters: {'update_time': datetime.datetime(2020, 9, 29, 19, 47, 48, 407165), 'file_size': 132950, 'dataset_id': 158121}]
(Background on this error at: http://sqlalche.me/e/e3q8)
galaxy.workflow.scheduling_manager DEBUG 2020-09-29 19:47:54,286 Workflow invocation [313] scheduled

@innovate-invent
Copy link
Contributor Author

innovate-invent commented Sep 30, 2020

Upgraded to 20.09, removing all patches except the k8s id patch. Reran job and got a scheduler deadlock much earlier in the workflow execution than before:

Scheduler Log
galaxy.tools.actions INFO 2020-09-30 19:09:16,866 Setup for job Job[unflushed,tool_id=toolshed.g2.bx.psu.edu/repos/brinkmanlab/biopython_convert/biopython-convert/1.0] complete, ready to be enqueued (1.702 ms)
galaxy.web_stack.handlers INFO 2020-09-30 19:09:16,923 Flushed transaction for Job[id=90270,tool_id=toolshed.g2.bx.psu.edu/repos/brinkmanlab/biopython_convert/biopython-convert/1.0] (56.255 ms)
galaxy.web_stack.handlers INFO 2020-09-30 19:09:16,923 (Job[id=90270,tool_id=toolshed.g2.bx.psu.edu/repos/brinkmanlab/biopython_convert/biopython-convert/1.0]) Handler '_default_' assigned using 'db-skip-locked' assignment method
galaxy.tools.execute DEBUG 2020-09-30 19:09:16,923 Tool toolshed.g2.bx.psu.edu/repos/brinkmanlab/biopython_convert/biopython-convert/1.0 created job 90270 (194.642 ms)
galaxy.workflow.run ERROR 2020-09-30 19:09:18,008 Failed to schedule Workflow[id=114,name=IslandCompare unpacked w. BLAST (imported from uploaded file) (imported from uploaded file)], problem occurred on WorkflowStep[index=6,type=tool].
Traceback (most recent call last):
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1227, in _execute_context
    self.dialect.do_executemany(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/dialects/postgresql/psycopg2.py", line 857, in do_executemany
    cursor.executemany(statement, parameters)
psycopg2.errors.DeadlockDetected: deadlock detected
DETAIL:  Process 8872 waits for ShareLock on transaction 2363021; blocked by process 20256.
Process 20256 waits for ShareLock on transaction 2363020; blocked by process 8872.
HINT:  See server log for query details.
CONTEXT:  while updating tuple (26,50) in relation "history_dataset_collection_association"


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/srv/galaxy/lib/galaxy/workflow/run.py", line 190, in invoke
    incomplete_or_none = self._invoke_step(workflow_invocation_step)
  File "/srv/galaxy/lib/galaxy/workflow/run.py", line 263, in _invoke_step
    incomplete_or_none = invocation_step.workflow_step.module.execute(self.trans,
  File "/srv/galaxy/lib/galaxy/workflow/modules.py", line 1745, in execute
    execution_tracker = execute(
  File "/srv/galaxy/lib/galaxy/tools/execute.py", line 102, in execute
    execute_single_job(execution_slice, completed_jobs[i])
  File "/srv/galaxy/lib/galaxy/tools/execute.py", line 74, in execute_single_job
    execution_tracker.record_success(execution_slice, job, result)
  File "/srv/galaxy/lib/galaxy/tools/execute.py", line 412, in record_success
    self.job_callback(job)
  File "/srv/galaxy/lib/galaxy/workflow/modules.py", line 1755, in <lambda>
    job_callback=lambda job: self._handle_post_job_actions(step, job, invocation.replacement_dict),
  File "/srv/galaxy/lib/galaxy/workflow/modules.py", line 1815, in _handle_post_job_actions
    self.trans.sa_session.flush()
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/scoping.py", line 162, in do
    return getattr(self.registry(), name)(*args, **kwargs)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2496, in flush
    self._flush(objects)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2637, in _flush
    transaction.rollback(_capture_exception=True)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__
    compat.raise_(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 178, in raise_
    raise exception
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2597, in _flush
    flush_context.execute()
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/unitofwork.py", line 422, in execute
    rec.execute(self)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/unitofwork.py", line 586, in execute
    persistence.save_obj(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/persistence.py", line 230, in save_obj
    _emit_update_statements(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/persistence.py", line 994, in _emit_update_statements
    c = cached_connections[connection].execute(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 984, in execute
    return meth(self, multiparams, params)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/sql/elements.py", line 293, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1097, in _execute_clauseelement
    ret = self._execute_context(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1287, in _execute_context
    self._handle_dbapi_exception(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1481, in _handle_dbapi_exception
    util.raise_(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 178, in raise_
    raise exception
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1227, in _execute_context
    self.dialect.do_executemany(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/dialects/postgresql/psycopg2.py", line 857, in do_executemany
    cursor.executemany(statement, parameters)
sqlalchemy.exc.OperationalError: (psycopg2.errors.DeadlockDetected) deadlock detected
DETAIL:  Process 8872 waits for ShareLock on transaction 2363021; blocked by process 20256.
Process 20256 waits for ShareLock on transaction 2363020; blocked by process 8872.
HINT:  See server log for query details.
CONTEXT:  while updating tuple (26,50) in relation "history_dataset_collection_association"

[SQL: UPDATE history_dataset_collection_association SET name=%(name)s, update_time=%(update_time)s WHERE history_dataset_collection_association.id = %(history_dataset_collection_association_id)s]
[parameters: ({'name': 'ERR375071.gbk Record Info', 'update_time': datetime.datetime(2020, 9, 30, 19, 9, 16, 953215), 'history_dataset_collection_association_id': 1717}, {'name': 'ERR375071.gbk as Genbank', 'update_time': datetime.datetime(2020, 9, 30, 19, 9, 16, 953219), 'history_dataset_collection_association_id': 1718})]
(Background on this error at: http://sqlalche.me/e/e3q8)
galaxy.workflow.run ERROR 2020-09-30 19:09:18,013 Failed to execute scheduled workflow.
Traceback (most recent call last):
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1227, in _execute_context
    self.dialect.do_executemany(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/dialects/postgresql/psycopg2.py", line 857, in do_executemany
    cursor.executemany(statement, parameters)
psycopg2.errors.DeadlockDetected: deadlock detected
DETAIL:  Process 8872 waits for ShareLock on transaction 2363021; blocked by process 20256.
Process 20256 waits for ShareLock on transaction 2363020; blocked by process 8872.
HINT:  See server log for query details.
CONTEXT:  while updating tuple (26,50) in relation "history_dataset_collection_association"


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/srv/galaxy/lib/galaxy/workflow/run.py", line 83, in __invoke
    outputs = invoker.invoke()
  File "/srv/galaxy/lib/galaxy/workflow/run.py", line 190, in invoke
    incomplete_or_none = self._invoke_step(workflow_invocation_step)
  File "/srv/galaxy/lib/galaxy/workflow/run.py", line 263, in _invoke_step
    incomplete_or_none = invocation_step.workflow_step.module.execute(self.trans,
  File "/srv/galaxy/lib/galaxy/workflow/modules.py", line 1745, in execute
    execution_tracker = execute(
  File "/srv/galaxy/lib/galaxy/tools/execute.py", line 102, in execute
    execute_single_job(execution_slice, completed_jobs[i])
  File "/srv/galaxy/lib/galaxy/tools/execute.py", line 74, in execute_single_job
    execution_tracker.record_success(execution_slice, job, result)
  File "/srv/galaxy/lib/galaxy/tools/execute.py", line 412, in record_success
    self.job_callback(job)
  File "/srv/galaxy/lib/galaxy/workflow/modules.py", line 1755, in <lambda>
    job_callback=lambda job: self._handle_post_job_actions(step, job, invocation.replacement_dict),
  File "/srv/galaxy/lib/galaxy/workflow/modules.py", line 1815, in _handle_post_job_actions
    self.trans.sa_session.flush()
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/scoping.py", line 162, in do
    return getattr(self.registry(), name)(*args, **kwargs)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2496, in flush
    self._flush(objects)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2637, in _flush
    transaction.rollback(_capture_exception=True)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__
    compat.raise_(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 178, in raise_
    raise exception
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2597, in _flush
    flush_context.execute()
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/unitofwork.py", line 422, in execute
    rec.execute(self)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/unitofwork.py", line 586, in execute
    persistence.save_obj(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/persistence.py", line 230, in save_obj
    _emit_update_statements(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/persistence.py", line 994, in _emit_update_statements
    c = cached_connections[connection].execute(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 984, in execute
    return meth(self, multiparams, params)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/sql/elements.py", line 293, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1097, in _execute_clauseelement
    ret = self._execute_context(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1287, in _execute_context
    self._handle_dbapi_exception(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1481, in _handle_dbapi_exception
    util.raise_(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 178, in raise_
    raise exception
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1227, in _execute_context
    self.dialect.do_executemany(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/dialects/postgresql/psycopg2.py", line 857, in do_executemany
    cursor.executemany(statement, parameters)
sqlalchemy.exc.OperationalError: (psycopg2.errors.DeadlockDetected) deadlock detected
DETAIL:  Process 8872 waits for ShareLock on transaction 2363021; blocked by process 20256.
Process 20256 waits for ShareLock on transaction 2363020; blocked by process 8872.
HINT:  See server log for query details.
CONTEXT:  while updating tuple (26,50) in relation "history_dataset_collection_association"

[SQL: UPDATE history_dataset_collection_association SET name=%(name)s, update_time=%(update_time)s WHERE history_dataset_collection_association.id = %(history_dataset_collection_association_id)s]
[parameters: ({'name': 'ERR375071.gbk Record Info', 'update_time': datetime.datetime(2020, 9, 30, 19, 9, 16, 953215), 'history_dataset_collection_association_id': 1717}, {'name': 'ERR375071.gbk as Genbank', 'update_time': datetime.datetime(2020, 9, 30, 19, 9, 16, 953219), 'history_dataset_collection_association_id': 1718})]
(Background on this error at: http://sqlalche.me/e/e3q8)
galaxy.workflow.scheduling_manager DEBUG 2020-09-30 19:09:18,046 Workflow invocation [318] scheduled
galaxy.queue_worker INFO 2020-09-30 19:23:47,562 Instance 'workflow-scheduler3' received 'reconfigure_watcher' task, executing now.
Job handler log
urllib3.connectionpool DEBUG 2020-09-30 19:09:04,295 https://172.20.0.1:443 "GET /apis/batch/v1/namespaces/islandcompare/jobs?labelSelector=controller-uid%3Df5a0f234-01eb-4b25-b277-31017493efe4 HTTP/1.1" 200 None
urllib3.connectionpool DEBUG 2020-09-30 19:09:04,309 https://172.20.0.1:443 "GET /apis/batch/v1/namespaces/islandcompare/jobs?labelSelector=controller-uid%3D957f7d2f-446f-40f2-b56b-3bd6600e54ae HTTP/1.1" 200 None
galaxy.jobs.handler ERROR 2020-09-30 19:09:04,320 failure running job 90236
Traceback (most recent call last):
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1283, in _execute_context
    self.dialect.do_execute(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 590, in do_execute
    cursor.execute(statement, parameters)
psycopg2.errors.DeadlockDetected: deadlock detected
DETAIL:  Process 20210 waits for ShareLock on transaction 2362395; blocked by process 8872.
Process 8872 waits for ShareLock on transaction 2362396; blocked by process 20210.
HINT:  See server log for query details.
CONTEXT:  while updating tuple (29,94) in relation "history_dataset_collection_association"


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/srv/galaxy/lib/galaxy/jobs/handler.py", line 407, in __handle_waiting_jobs
    self.dispatcher.put(self.job_wrappers.pop(job.id))
  File "/srv/galaxy/lib/galaxy/jobs/handler.py", line 1006, in put
    self.job_runners[runner_name].put(job_wrapper)
  File "/srv/galaxy/lib/galaxy/jobs/runners/__init__.py", line 152, in put
    job_wrapper.enqueue()
  File "/srv/galaxy/lib/galaxy/jobs/__init__.py", line 1445, in enqueue
    self.change_state(model.Job.states.QUEUED, flush=False, job=job)
  File "/srv/galaxy/lib/galaxy/jobs/__init__.py", line 1385, in change_state
    job.update_output_states()
  File "/srv/galaxy/lib/galaxy/model/__init__.py", line 1262, in update_output_states
    sa_session.execute(statement, params)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 1291, in execute
    return self._connection_for_bind(bind, close_with_result=True).execute(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1020, in execute
    return meth(self, multiparams, params)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/sql/elements.py", line 298, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1133, in _execute_clauseelement
    ret = self._execute_context(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1323, in _execute_context
    self._handle_dbapi_exception(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1517, in _handle_dbapi_exception
    util.raise_(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 178, in raise_
    raise exception
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1283, in _execute_context
    self.dialect.do_execute(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/default.py", line 590, in do_execute
    cursor.execute(statement, parameters)
sqlalchemy.exc.OperationalError: (psycopg2.errors.DeadlockDetected) deadlock detected
DETAIL:  Process 20210 waits for ShareLock on transaction 2362395; blocked by process 8872.
Process 8872 waits for ShareLock on transaction 2362396; blocked by process 20210.
HINT:  See server log for query details.
CONTEXT:  while updating tuple (29,94) in relation "history_dataset_collection_association"

[SQL: 
            UPDATE history_dataset_collection_association
            SET update_time = %(update_time)s
            WHERE id in (
                SELECT hdca.id
                FROM history_dataset_collection_association hdca
                INNER JOIN implicit_collection_jobs icj
                    on icj.id = hdca.implicit_collection_jobs_id
                INNER JOIN implicit_collection_jobs_job_association icjja
                    on icj.id = icjja.implicit_collection_jobs_id
                WHERE icjja.job_id = %(job_id)s
                UNION
                SELECT hdca2.id
                FROM history_dataset_collection_association hdca2
                WHERE hdca2.job_id = %(job_id)s
            );
        ]
[parameters: {'update_time': datetime.datetime(2020, 9, 30, 19, 9, 3, 313290), 'job_id': 90236}]
(Background on this error at: http://sqlalche.me/e/e3q8)
urllib3.connectionpool DEBUG 2020-09-30 19:09:04,334 https://172.20.0.1:443 "GET /apis/batch/v1/namespaces/islandcompare/jobs?labelSelector=controller-uid%3D53a2e44d-6bd6-43af-a02f-b6f24052b521 HTTP/1.1" 200 None
urllib3.connectionpool DEBUG 2020-09-30 19:09:04,364 https://172.20.0.1:443 "GET /apis/batch/v1/namespaces/islandcompare/jobs?labelSelector=controller-uid%3D9e642325-ab39-4c2e-bab3-d034e1e10ce7 HTTP/1.1" 200 None
urllib3.connectionpool DEBUG 2020-09-30 19:09:04,381 https://172.20.0.1:443 "GET /apis/batch/v1/namespaces/islandcompare/jobs?labelSelector=controller-uid%3D15751f10-492a-43a1-bf11-80f80c99a932 HTTP/1.1" 200 None
urllib3.connectionpool DEBUG 2020-09-30 19:09:04,396 https://172.20.0.1:443 "GET /apis/batch/v1/namespaces/islandcompare/jobs?labelSelector=controller-uid%3Dd4f49b61-bf2b-4eee-a1d7-b8b00bbcdde9 HTTP/1.1" 200 None
urllib3.connectionpool DEBUG 2020-09-30 19:09:04,411 https://172.20.0.1:443 "GET /apis/batch/v1/namespaces/islandcompare/jobs?labelSelector=controller-uid%3D59fdfcb9-f502-4585-9aef-d0890da50cb7 HTTP/1.1" 200 None

@dannon
Copy link
Member

dannon commented Sep 30, 2020

To keep a log here, we talked in gitter and are testing dropping the triggers to try to isolate the issue, using https://gist.github.com/00119ec94ef74574fe3173807429a14e

(so far, seems likely it's the triggers and we need to think of a clean way to avoid them locking)

@mvdbeek
Copy link
Member

mvdbeek commented Oct 1, 2020

We've already tried removing the triggers, that's what #10254 was about.

@innovate-invent
Copy link
Contributor Author

20.09 + the patch may have been the magic sauce. I can't actually confirm that it worked because any time I try to access the history Galaxy locks up for half an hour.

@mvdbeek
Copy link
Member

mvdbeek commented Oct 1, 2020

I guess if it works with Dannon's patch but not with #10254 it doesn't really matter if it's a trigger or observing the sqlalchemy session. Which is cool, we can pop over the manual update time statement over the message queue and do some debouncing there.

@innovate-invent
Copy link
Contributor Author

Waiting on #10322 to confirm the deadlock issue is resolved.

@innovate-invent
Copy link
Contributor Author

The deadlock is resolved by the patch and upgrading to 20.09.
I will leave this issue open until a permanent solution is merged.

@jmchilton
Copy link
Member

I think this has been resolved in an unpatched 20.09 now. Can I close this?

@innovate-invent
Copy link
Contributor Author

What was the relevant PR?

@jmchilton
Copy link
Member

#10360

@innovate-invent
Copy link
Contributor Author

Give me a week to test this, with so many database schema changes/patches I mangled my database and need to wipe and start over before testing.

@jmchilton
Copy link
Member

We definitely improved it, but I think I heard reports that this is still happening at EU.

@innovate-invent
Copy link
Contributor Author

Rerunning again

galaxy.tools ERROR 2020-11-18 19:39:19,936 Exception caught while attempting to execute tool with id 'toolshed.g2.bx.psu.edu/repos/brinkmanlab/biopython_convert/biopython-convert/1.0':
Traceback (most recent call last):
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1263, in _execute_context
    self.dialect.do_executemany(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/dialects/postgresql/psycopg2.py", line 873, in do_executemany
    cursor.executemany(statement, parameters)
psycopg2.errors.DeadlockDetected: deadlock detected
DETAIL:  Process 9150 waits for ShareLock on transaction 2271001; blocked by process 14817.
Process 14817 waits for ShareLock on transaction 2270999; blocked by process 9150.
HINT:  See server log for query details.
CONTEXT:  while locking tuple (13,49) in relation "history_dataset_collection_association"


The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/srv/galaxy/lib/galaxy/tools/__init__.py", line 1647, in handle_single_execution
    rval = self.execute(
  File "/srv/galaxy/lib/galaxy/tools/__init__.py", line 1745, in execute
    return self.tool_action.execute(self, trans, incoming=incoming, set_output_hid=set_output_hid, history=history, **kwargs)
  File "/srv/galaxy/lib/galaxy/tools/actions/__init__.py", line 524, in execute
    handle_output(name, output)
  File "/srv/galaxy/lib/galaxy/tools/actions/__init__.py", line 417, in handle_output
    trans.sa_session.flush()
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/scoping.py", line 163, in do
    return getattr(self.registry(), name)(*args, **kwargs)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2523, in flush
    self._flush(objects)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2664, in _flush
    transaction.rollback(_capture_exception=True)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/util/langhelpers.py", line 68, in __exit__
    compat.raise_(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 178, in raise_
    raise exception
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/session.py", line 2624, in _flush
    flush_context.execute()
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/unitofwork.py", line 419, in execute
    n.execute_aggregate(self, set_)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/unitofwork.py", line 521, in execute_aggregate
    self.execute(uow)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/unitofwork.py", line 586, in execute
    persistence.save_obj(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/persistence.py", line 230, in save_obj
    _emit_update_statements(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/orm/persistence.py", line 994, in _emit_update_statements
    c = cached_connections[connection].execute(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1020, in execute
    return meth(self, multiparams, params)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/sql/elements.py", line 298, in _execute_on_connection
    return connection._execute_clauseelement(self, multiparams, params)
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1133, in _execute_clauseelement
    ret = self._execute_context(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1323, in _execute_context
    self._handle_dbapi_exception(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1517, in _handle_dbapi_exception
    util.raise_(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/util/compat.py", line 178, in raise_
    raise exception
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/engine/base.py", line 1263, in _execute_context
    self.dialect.do_executemany(
  File "/srv/galaxy/venv/lib/python3.8/site-packages/sqlalchemy/dialects/postgresql/psycopg2.py", line 873, in do_executemany
    cursor.executemany(statement, parameters)
sqlalchemy.exc.OperationalError: (psycopg2.errors.DeadlockDetected) deadlock detected
DETAIL:  Process 9150 waits for ShareLock on transaction 2271001; blocked by process 14817.
Process 14817 waits for ShareLock on transaction 2270999; blocked by process 9150.
HINT:  See server log for query details.
CONTEXT:  while locking tuple (13,49) in relation "history_dataset_collection_association"

[SQL: UPDATE history_dataset_collection_association SET name=%(name)s, update_time=%(update_time)s WHERE history_dataset_collection_association.id = %(history_dataset_collection_association_id)s]
[parameters: ({'name': 'ERR375017.gbk Record Info', 'update_time': datetime.datetime(2020, 11, 18, 19, 39, 18, 910437), 'history_dataset_collection_association_id': 810}, {'name': 'ERR375017.gbk as Genbank', 'update_time': datetime.datetime(2020, 11, 18, 19, 39, 18, 910440), 'history_dataset_collection_association_id': 811})]
(Background on this error at: http://sqlalche.me/e/e3q8)
galaxy.tools.execute WARNING 2020-11-18 19:39:19,940 There was a failure executing a job for tool [toolshed.g2.bx.psu.edu/repos/brinkmanlab/biopython_convert/biopython-convert/1.0] - Error executing tool with id 'toolshed.g2.bx.psu.edu/repos/brinkmanlab/biopython_convert/biopython-convert/1.0': (psycopg2.errors.DeadlockDetected) deadlock detected
DETAIL:  Process 9150 waits for ShareLock on transaction 2271001; blocked by process 14817.
Process 14817 waits for ShareLock on transaction 2270999; blocked by process 9150.
HINT:  See server log for query details.
CONTEXT:  while locking tuple (13,49) in relation "history_dataset_collection_association"

[SQL: UPDATE history_dataset_collection_association SET name=%(name)s, update_time=%(update_time)s WHERE history_dataset_collection_association.id = %(history_dataset_collection_association_id)s]
[parameters: ({'name': 'ERR375017.gbk Record Info', 'update_time': datetime.datetime(2020, 11, 18, 19, 39, 18, 910437), 'history_dataset_collection_association_id': 810}, {'name': 'ERR375017.gbk as Genbank', 'update_time': datetime.datetime(2020, 11, 18, 19, 39, 18, 910440), 'history_dataset_collection_association_id': 811})]
(Background on this error at: http://sqlalche.me/e/e3q8)
galaxy.web_stack.handlers INFO 2020-11-18 19:39:20,238 (Job[id=98858,tool_id=toolshed.g2.bx.psu.edu/repos/brinkmanlab/biopython_convert/biopython-convert/1.0]) Handler '_default_' assigned using 'db-skip-locked' assignment method
galaxy.web_stack.handlers INFO 2020-11-18 19:39:20,245 (Job[id=98859,tool_id=toolshed.g2.bx.psu.edu/repos/brinkmanlab/biopython_convert/biopython-convert/1.0]) Handler '_default_' assigned using 'db-skip-locked' assignment method
galaxy.web_stack.handlers INFO 2020-11-18 19:39:20,252 (Job[id=98860,tool_id=toolshed.g2.bx.psu.edu/repos/brinkmanlab/biopython_convert/biopython-convert/1.0]) Handler '_default_' assigned using 'db-skip-locked' assignment method
galaxy.web_stack.handlers INFO 2020-11-18 19:39:20,259 (Job[id=98861,tool_id=toolshed.g2.bx.psu.edu/repos/brinkmanlab/biopython_convert/biopython-convert/1.0]) Handler '_default_' assigned using 'db-skip-locked' assignment method
galaxy.web_stack.handlers INFO 2020-11-18 19:39:20,265 (Job[id=98862,tool_id=toolshed.g2.bx.psu.edu/repos/brinkmanlab/biopython_convert/biopython-convert/1.0]) Handler '_default_' assigned using 'db-skip-locked' assignment method
galaxy.web_stack.handlers INFO 2020-11-18 19:39:20,271 (Job[id=98863,tool_id=toolshed.g2.bx.psu.edu/repos/brinkmanlab/biopython_convert/biopython-convert/1.0]) Handler '_default_' assigned using 'db-skip-locked' assignment method
galaxy.web_stack.handlers INFO 2020-11-18 19:39:20,278 (Job[id=98864,tool_id=toolshed.g2.bx.psu.edu/repos/brinkmanlab/biopython_convert/biopython-convert/1.0]) Handler '_default_' assigned using 'db-skip-locked' assignment method
galaxy.web_stack.handlers INFO 2020-11-18 19:39:20,284 (Job[id=98865,tool_id=toolshed.g2.bx.psu.edu/repos/brinkmanlab/biopython_convert/biopython-convert/1.0]) Handler '_default_' assigned using 'db-skip-locked' assignment method
galaxy.web_stack.handlers INFO 2020-11-18 19:39:20,284 (Job[id=98866,tool_id=toolshed.g2.bx.psu.edu/repos/brinkmanlab/biopython_convert/biopython-convert/1.0]) Handler '_default_' assigned using 'db-skip-locked' assignment method
galaxy.workflow.run ERROR 2020-11-18 19:39:20,530 Failed to schedule Workflow[id=42,name=IslandCompare unpacked w. BLAST], problem occurred on WorkflowStep[index=6,type=tool].
Traceback (most recent call last):
  File "/srv/galaxy/lib/galaxy/workflow/run.py", line 190, in invoke
    incomplete_or_none = self._invoke_step(workflow_invocation_step)
  File "/srv/galaxy/lib/galaxy/workflow/run.py", line 263, in _invoke_step
    incomplete_or_none = invocation_step.workflow_step.module.execute(self.trans,
  File "/srv/galaxy/lib/galaxy/workflow/modules.py", line 1786, in execute
    raise Exception(message)
Exception: Failed to create one or more job(s) for workflow step.
galaxy.workflow.run ERROR 2020-11-18 19:39:20,531 Failed to execute scheduled workflow.
Traceback (most recent call last):
  File "/srv/galaxy/lib/galaxy/workflow/run.py", line 83, in __invoke
    outputs = invoker.invoke()
  File "/srv/galaxy/lib/galaxy/workflow/run.py", line 190, in invoke
    incomplete_or_none = self._invoke_step(workflow_invocation_step)
  File "/srv/galaxy/lib/galaxy/workflow/run.py", line 263, in _invoke_step
    incomplete_or_none = invocation_step.workflow_step.module.execute(self.trans,
  File "/srv/galaxy/lib/galaxy/workflow/modules.py", line 1786, in execute
    raise Exception(message)
Exception: Failed to create one or more job(s) for workflow step.
galaxy.workflow.scheduling_manager DEBUG 2020-11-18 19:39:20,556 Workflow invocation [104] scheduled

@mvdbeek
Copy link
Member

mvdbeek commented Nov 27, 2020

So I looked carefully at your last log there. What's weird is that (Background on this error at: http://sqlalche.me/e/e3q8) appears twice there, and also this is now happening while creating jobs, not finishing them.
So I'm not sure if 2 handlers are picking up the same invocations simultaneously (only way I can think of this happening is of they have the same server name), or if this is also still a problem with the triggers.

Looking back at the thread, did you actually try if #10254 resolved the issue ?
If not, can you run with Dannon's patch for now (assuming that resolved the issue), and we work on eliminating the triggers for 21.01 ?

@mvdbeek
Copy link
Member

mvdbeek commented Nov 27, 2020

The traceback is also triggered at https://github.com/mvdbeek/galaxy/blob/d8358c602c1ce23119b4b44cfa2f124674a47cde/lib/galaxy/tools/actions/__init__.py#L417 which implies you're using legacy_eager_objectstore_initialization, is there a reason for setting this ? This should really be avoided for good performance.

@innovate-invent
Copy link
Contributor Author

innovate-invent commented Nov 27, 2020

My last exception report was with #10360. legacy_eager_objectstore_initialization was set to true for some reason while we were testing things.
If you need to verify any of my configuration settings, they are here: https://github.com/brinkmanlab/galaxy-container/blob/master/roles/galaxy_app/defaults/main/galaxy.yml

@mvdbeek
Copy link
Member

mvdbeek commented Nov 27, 2020

That PR fixed something that broke in 20.09, so I'm not surprised you're still seeing deadlocks that you also reported against 20.05. If things run smoothly with @dannon's patch I think it would be safe to run that in production. I don't think we actually use the update_time that those triggers set in our current UI.

@innovate-invent
Copy link
Contributor Author

doesn't #10360 conflict with #10254? Should I be using both for 20.09?

@mvdbeek
Copy link
Member

mvdbeek commented Nov 27, 2020

#10360 is part of 20.09. It doesn't conflict with #10254 or dannon's patch, except that the migration number should be higher now. I will update #10254 to match the current migration, but #10254 just moves the trigger logic into sqlalchemy-space, it might still have the same problems (but maybe it doesn't, would be awesome to know that). I can also create a variant that is like the dropped migration, but that also removes the additional update time queries, which I don't think we need before merging the big history PR.

@mvdbeek
Copy link
Member

mvdbeek commented Nov 27, 2020

I've added #10821 which drops all manual update_time triggers and manipulation on HDAs and HDCAs (their update_time is set with onupdate anyway), that should solve the problem. I'm not sure we'll actually merge it into 20.09, since it would require a migration, which we try to avoid for minor updates.

@innovate-invent
Copy link
Contributor Author

innovate-invent commented Nov 27, 2020

Is the goal to simply test that it works? If I use that patch in a production system then the next release will interfere with the migration count.

@mvdbeek
Copy link
Member

mvdbeek commented Nov 27, 2020

It would be good to confirm this works solves the deadlocks, yes. Dropping the triggers is also something we could do outside of the migration, I think. let me see what we can do there, and the. maybe you could give this a go next week?

@innovate-invent
Copy link
Contributor Author

sure

@mvdbeek
Copy link
Member

mvdbeek commented Jun 15, 2021

This should be resolved for good with #11849 and #11914 🤞

@mvdbeek mvdbeek closed this as completed Jun 15, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants