Redeployment unable to startup again #166

andrewklau · 2017-03-16T00:47:18Z

Updated the resource limits for a postgresql-persistent 9.5 deployment

pg_ctl: another server might be running; trying to start server anyway
waiting for server to start....LOG:  redirecting log output to logging collector process
HINT:  Future log output will appear in directory "pg_log".
 done
server started
ERROR:  tuple already updated by self

It seems the first pod did not shutdown cleanly and left the PID in /var/lib/pgsql/data/userdata/postmaster.pid volume thus preventing the container from starting up automatically without manual intervention

Perhaps an edge case as this is the first time seeing this with many other postgresql deployments

The text was updated successfully, but these errors were encountered:

pedro-dlfa · 2018-05-25T08:47:27Z

Hello, here facing the same problem: container unable to start, with same output detailed above.
[Is there / does anyone have found] any solution or workaround for this problem? I quickly tried removing /var/lib/pgsql/data/userdata/postmaster.pid file but when starting the container I'm getting the same issue.

EDIT: I double checked, and in my case the output is:

 pg_ctl: another server might be running; trying to start server anyway
 waiting for server to start....LOG:  redirecting log output to logging collector process
 HINT:  Future log output will appear in directory "pg_log".
 ... done
 server started
 => sourcing /usr/share/container-scripts/postgresql/start/set_passwords.sh ...
 ERROR:  tuple already updated by self

praiskup · 2018-05-25T09:35:46Z

Thanks for the report. Interesting, @pedro-dlfa so you manually dropped the pid file, and immediately after that the container again refused to start (because of the pid file)? Smells like pg_ctl stop isn't really doing what we expect.

martin123218 · 2018-07-30T07:54:43Z

Hello, I am facing the same issue as @pedro-dlfa , I tried to delete the file manually and redeploy pod but with no success. My solution is to recreate pod again. I am using version 9.5 of postgresql.

praiskup · 2018-10-17T09:04:49Z

If you are affected by this, can you confirm that deployment strategy is Recreate?

martin123218 · 2018-10-19T12:57:38Z

Hi,
My strategy is Rolling and I was affected last week again.

pkubatrh · 2018-10-19T13:18:32Z

Hi @martin123218

The problem with the Rolling strategy is that it tells Openshift to first create a new pod with the same data volume as the old one and only shut down the original pod when the new pod is up and running. Since there are at one time two pods accessing (and presumably writing to) the same data volume you can run into this issue.

Please use the Recreate strategy instead. There will be some downtime since the new pod is only started after the old pod gets shut down but you should not run into this issue anymore.

bkabrda · 2018-11-16T09:44:15Z

I've also just run into this issue. Is there a way to make this work with "Rolling" strategy to have zero downtime upgrades?

praiskup · 2018-11-16T10:16:39Z

Not with this trivial layout. This problem is equivalent to non-container scenario where you do dnf update postgresql-server. You have to shut down the old server, and start a new one. I.e. you can not let two servers write into the same data directory.

Btw., PostgreSQL server has a guard against "multiple servers writing to the same data directory" situation, but unfortunately in container scenario - it has deterministic PID number (PID=1). So concurrent PostgreSQL server (in different container) checks the pid/lock file, compares the PID with it's own PID and assumes "I'm the PID=1, so the PID file is some leftover from previous run". So it removes the PID file and continues with data directory modifications. This has a disaster potential.

Our templates only support Recreate strategy. The fact Rolling "mostly" works is matter of luck that the old server is not under heavy load.

That said, zero downtime problem needs to be solved on higher logical layer.

bkabrda · 2018-11-16T10:21:04Z

Ok, that makes sense, thanks. If I wanted to solve this on higher logical layer, how would I go about this? Do you have any good pointers?

praiskup · 2019-03-04T14:42:13Z

At this point, you'd have to start thinking about pgpool or similar thing (I'd prefer to have a separate issue for such RFEs, to not go off-topic in this bug report).

praiskup · 2019-03-18T16:00:00Z

This issue seems to be caused by concurrent run of multiple postgresql
containers against the same data directory (persistent VOLUME), e.g.
caused by Rolling strategy in OpenShift.

I've heard an idea that it could also happen if OpenShift happens to be
moving the container to idle state (because HA proxy decided so), while -
during that time - some traffic makes the container to be woken up (ie new
container is started even before it was successfully moved to idle state).
Anyone able to confirm that this could happen?

Anyways, I'd like to have opinions how to handle this situation properly;
how to protect against over-mounting the same storage - since detecting
this reasonably from within container seems to be close to hard problem.
The only way which comes to my mind is implementing a "best effort" guard
by some daemon implementing "leader election" mechanism. Any links to how
others do this?

We might delegate this to OpenShift operators, but I suspect that
templates will have to stay supported anyways - or at least that
postgresql-container should be also usable from (some)templates; and thus
the problem won't disappear from non-operator use-cases, or plain "docker"
and "podman" use-cases.

flo-ryan · 2020-04-09T15:22:01Z

Hi, facing the same issue, using the Recreate strategy. Deleting the postmaster.pid also did not help, as I got the same error at the next pod startup.
Any idea on how to fix or work around this?

ShaunDave · 2020-07-30T08:31:25Z

Had this problem after there was an issue with the underlying node that caused it to terminate very ungracefully. A new pod got spun up (as it is supposed to) on a new node but the container got stuck in a crash-back loop with this exact error message. Surely there needs to be an automated way to get around this problem? Especially because only a single replica is supported, there's not a lot of wiggle room for high-availability if the container can't start

drobus · 2020-09-18T20:59:16Z

This is old issue, but just faced the same with Recreate strategy. Next article explains how to reanimate failing pod and it helped me.
https://pathfinder-faq-ocio-pathfinder-prod.pathfinder.gov.bc.ca/DB/PostgresqlCrashLoopTupleError.html
https://serverfault.com/questions/942743/postgres-crash-loop-caused-by-a-tuple-concurrently-updated-error

We use only one database pod, so I believe that maybe it will not solve high availability issue, but at least database will work with one pod. Maybe will be useful for somebody.

hhorak · 2021-09-07T15:29:58Z

We've also had an off-line discussion with Daniel Messer from RH who've hit this problem in his team as well. After changing the strategy to Recreate, it problem seems to disappear, but there was a good point to start testing the crash scenario in the CI tests (run the OpenShift template, then kill the pod or postgres deamon directly). This seems like a good addition to our test coverage.

phracek · 2024-02-28T09:25:23Z

@drobus We changed the DeploymentConfig -> Deployment, here: https://github.com/sclorg/postgresql-container/blob/master/examples/postgresql-persistent-template.json. And strategy is mentioned 'Recrete'. So please closing this issue.

In case it is not yet fixed, Feel free to re-open it again.

MonicaLemay mentioned this issue May 4, 2022

Failover of noobaa-db-pg0 works but fall back leaves pod in CrashLoopback noobaa/noobaa-core#6953

Closed

phracek closed this as completed Feb 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Redeployment unable to startup again #166

Redeployment unable to startup again #166

andrewklau commented Mar 16, 2017

pedro-dlfa commented May 25, 2018 •

edited

Loading

praiskup commented May 25, 2018

martin123218 commented Jul 30, 2018

praiskup commented Oct 17, 2018

martin123218 commented Oct 19, 2018 •

edited

Loading

pkubatrh commented Oct 19, 2018

bkabrda commented Nov 16, 2018

praiskup commented Nov 16, 2018

bkabrda commented Nov 16, 2018

praiskup commented Mar 4, 2019

praiskup commented Mar 18, 2019

flo-ryan commented Apr 9, 2020

ShaunDave commented Jul 30, 2020

drobus commented Sep 18, 2020

hhorak commented Sep 7, 2021

phracek commented Feb 28, 2024

Redeployment unable to startup again #166

Redeployment unable to startup again #166

Comments

andrewklau commented Mar 16, 2017

pedro-dlfa commented May 25, 2018 • edited Loading

praiskup commented May 25, 2018

martin123218 commented Jul 30, 2018

praiskup commented Oct 17, 2018

martin123218 commented Oct 19, 2018 • edited Loading

pkubatrh commented Oct 19, 2018

bkabrda commented Nov 16, 2018

praiskup commented Nov 16, 2018

bkabrda commented Nov 16, 2018

praiskup commented Mar 4, 2019

praiskup commented Mar 18, 2019

flo-ryan commented Apr 9, 2020

ShaunDave commented Jul 30, 2020

drobus commented Sep 18, 2020

hhorak commented Sep 7, 2021

phracek commented Feb 28, 2024

pedro-dlfa commented May 25, 2018 •

edited

Loading

martin123218 commented Oct 19, 2018 •

edited

Loading