Add logic for upgrading from Postgres 13 to 15 #122

rooftopcellist · 2023-08-15T03:25:23Z

SUMMARY

Upgrading from postgres 13 to 15.

Related PR:

Upgrading to PostgreSQL 15 and moving to sclorg images awx-operator#1486

ISSUE TYPE

New or Enhanced Feature

roles/backup/vars/main.yml

roles/postgres/tasks/check_postgres_version.yml

roles/postgres/tasks/scale_down_deployment.yml

rooftopcellist · 2024-03-11T22:58:37Z

I can't figure out why CI is failing when running this line:

pulp content list

@aknochow @dsavineau any ideas?

dsavineau · 2024-03-12T17:12:55Z

I can't figure out why CI is failing when running this line:
pulp content list
@aknochow @dsavineau any ideas?

I think this comment should have landed on the galaxy-operator repository instead of the eda operator right ?

dsavineau · 2024-03-21T18:09:08Z

not sure if I'm testing this correctly because this should be the same code than awx-operator but the upgrade fails when searching for the previous postgresql pod.

TASK [Set info for previous postgres pod] **************************************
task path: /opt/ansible/roles/postgres/tasks/check_postgres_version.yml:22

ok: [localhost] => {
    "ansible_facts": {
        "sorted_old_postgres_pods": "<list_reverseiterator object at 0x7f539eaa5610>"
    },
    "changed": false
}

TASK [Set info for previous postgres pod] **************************************
task path: /opt/ansible/roles/postgres/tasks/check_postgres_version.yml:29

        "old_postgres_pod": "<"
    },
    "changed": false
}
(...)
TASK [postgres : Get old PostgreSQL version] ***********************************
task path: /opt/ansible/roles/postgres/tasks/check_postgres_version.yml:56

fatal: [localhost]: FAILED! => {
    "msg": "The task includes an option with an undefined variable. The error was: 'ansible.utils.unsafe_proxy.AnsibleUnsafeText object' has no attribute 'metadata'\n\nThe error appears to be in '/opt/ansible/roles/postgres/tasks/check_postgres_version.yml': line 56, column 7, but may\nbe elsewhere in the file depending on the exact syntax problem.\n\nThe offending line appears to be:\n\n\n    - name: Get old PostgreSQL version\n      ^ here\n"

Also I noticed that postgres_keep_pvc_after_upgrade is used but not defined in defaults nor the CRD

dsavineau · 2024-03-21T19:06:51Z

alright this seems related to the ansible version (and/or python version) used by the operator

EDA operator uses v1.27.0
- ansible 2.9.27
- python 3.8
AWX operator uses v1.34.0
- ansible-core 2.15.8
- python 3.9

if I switch to v1.34.0 then I don't have the issue anymore. I'll need to look more to understand why it's failing with the old version because we're likely to still use ansible 2.9 downstream

dsavineau · 2024-03-21T19:16:39Z

The reverse filter in ansible 2.9 returns an iterator and we cast it to list to avoid this issue.
With ansible-core 2.15 we don't have to do that but keeping the list filter isn't an issue.
I would say that we should add the list filter even if we plan to update the operator sdk to a newer image because ansible 2.9 is still use for downstream

dsavineau · 2024-03-22T00:53:12Z

@rooftopcellist I've pushed some fixes for this PR so don't forget to fetch the changes if you want to push new things

With those fixes, the error I'm facing occurs during the postgresql dump/restore on the pg 15 pod

pg_restore: error: could not execute query: ERROR:  must be member of role "postgres"
Command was: ALTER SCHEMA public OWNER TO postgres;

pg_restore: warning: errors ignored on restore: 1

Note sure why we have this. I'll continue to debug it next week and try to do a similar pg upgrade for the awx operator to see if we don't have the same symptoms

dsavineau · 2024-03-25T19:01:46Z

@rooftopcellist with the latest changes I pushed, I was able to upgrade from 13 to 15 without issue.

However, the change about granting the postgres role temporary to the eda postgresql user seems weird to me because we never had to do this in other operators for other postgresql upgrade (12 or 13)

c2b13fd

kurokobo · 2024-03-27T06:40:59Z

ansible/galaxy-operator#80 (comment)

dsavineau · 2024-03-27T13:22:16Z

ansible/galaxy-operator#80 (comment)

I have less concerns about the postgresql upgrade from 13 to 15 for the EDA operator because it was already using postgresql image from sclorg as default.

rooftopcellist · 2024-04-09T00:13:26Z

@kurokobo I tested out a fresh install on k3s following similar steps to what you've shown in your awx-on-k3s repo and was not able to reproduce the permission error in the postgres pod that caused us to add these changes to the awx-operator:

Add postgres init container to resolve permissions for some k3s deployments awx-operator#1805

I pinned the redis image to c9s and will cut an eda-server-operator 1.0.2 release so that users have a working release before the PostgreSQL 15 upgrade (which we will likely put out as a 2.0.0 release). @kurokobo I plan to wait on releasing the PG15 changes until you and/or some other community folks get a chance to try out these.

Since all of the comments have been addressed and tests are passing in every scenario I can think of, I plan to merge this shortly.

Testing

On Openshift:

I ran through a fresh install, backup, restore, and upgraded from main --> the pg15 branch all without error.

On k3s:
I did a fresh install, backup, restore and upgrade with no operator errors and no errors in the logs of the resulting deployments.

Notes for follow-up:
I noticed two things about how the pods scale down before upgrading the database:

Currently, it loops through deployments and scales them down one by one, which is slow. It would be faster to scale down all of the deployments at once if possible. But this is only for postgresql upgrades, which are rare, so I'm inclined to leave it the way it is.
we may want to scale down ui and redis as well while upgrade (just so that errors don't show in the eda-ui pod.

With ansible 2.9.27 (operator-sdk v1.27.0) then the reverse filter returns an iterator so we need to cast it to list. The behavior doesn't exist when using a more recent operator-sdk version like v1.34.0 (ansible-core 2.15.8) but using the list filter on that version works too (even if not needed) "sorted_old_postgres_pods": "<list_reverseiterator object at 0x7f539eaa5610>" Signed-off-by: Dimitri Savineau <[email protected]>

During postgresql upgrade we're scaling down EDA deployments but this requires to have proper permissions configured (patch on deployments/scale) message: deployments.apps "eda-default-worker" is forbidden: User "system:serviceaccount:aap:eda-server-operator-controller-manager" cannot patch resource "deployments/scale" in API group "apps" in the namespace aap reason:Forbidden Signed-off-by: Dimitri Savineau <[email protected]>

The variable for the postgresql version had a typo. It was using supported_postgres_version rather than supported_pg_version. fatal: [localhost]: FAILED! => { "msg": "The task includes an option with an undefined variable. The error was: 'supported_postgres_version' is undefined } Signed-off-by: Dimitri Savineau <[email protected]>

The postgres_keep_pvc_after_upgrade variable wasn't defined in the postgres role leading to the following error: "msg": "The conditional check 'postgres_keep_pvc_after_upgrade' failed. The error was: error while evaluating conditional (postgres_keep_pvc_after_upgrade): 'postgres_keep_pvc_after_upgrade' is undefined Signed-off-by: Dimitri Savineau <[email protected]>

During the postgresql upgrade, we need to grant temporary the postgres role to the eda postgresql user and remove it after the pg_restore is over. pg_restore: error: could not execute query: ERROR: must be member of role "postgres" Command was: ALTER SCHEMA public OWNER TO postgres; Signed-off-by: Dimitri Savineau <[email protected]>

- This was removed because it is not respected by the sclorg PostgreSQL image

kurokobo · 2024-04-09T03:45:08Z

@rooftopcellist
Thanks, I will test 1.0.2 first, then test main branch, and let you know the results.
F.Y.I., I don't know if you know this but my repository already contains the guide to deploy EDA Server using EDA Server Operator. I've added this the last summer when I started contribution for EDA Server Operator : https://github.com/kurokobo/awx-on-k3s/tree/main/rulebooks

rooftopcellist · 2024-04-09T19:06:55Z

I hadn't seen that yet, thanks for the link! I'll check out the similar one for galaxy. Thanks for checking out the changes!

kurokobo · 2024-04-11T14:56:10Z

@rooftopcellist
Sorry for my delay, and sorry for adding a comment on closed PR.
I've tested for both 1.0.2 and main branch and I can confirm that both work as expected in senarios for fresh installation and upgrading.

One concern is that the postgres_data_volume_init implemented in AWX Operator is not implemented in EDA Server Operator.
As long as we use sclorg's PSQL, the permission error for UID:26 that occurred in AWX Operator can also occur in this Operator if users are using specific backend for PVs, e.g. hostPath, Longhorn, etc..
I suspect that the reason you could not reproduce it in your environment was because you reused the directory you created for AWX with chowned to 26:0.

To reproduce, please follow my guide again, with skipping sudo chown 26:0 /data/eda/postgres-13/data. Currently my guide deployes 1.0.2: https://github.com/kurokobo/awx-on-k3s/blob/main/rulebooks/README.md

I think it is a good idea to implement postgres_data_volume_init at this time, as the same issues will probably be reported as the number of users grows. I guess the reason why this is not reported yet is simply there are not many users for this Operator. Also, if the users uses my guide to deploy EDA Server, they would not faced this issue since I already added chown for UID:26 to the guide from the beginning.

Any thought?

rooftopcellist · 2024-04-11T18:53:09Z

@kurokobo ahh you are absolutely correct. I did have the chown 26 line in my reproducer script because I saw it in your repo instructions.

But I agree, it's better to implement postgres_data_volume_init here as well, then users won't need that extra line for k3s hostPath and longhorn, etc.

rooftopcellist force-pushed the pg15 branch 4 times, most recently from 7845b42 to 3977146 Compare August 15, 2023 21:55

rooftopcellist requested review from dsavineau and rcarrillocruz August 24, 2023 14:55

rooftopcellist force-pushed the pg15 branch from 3977146 to a3a7815 Compare March 7, 2024 17:53

dsavineau reviewed Mar 7, 2024

View reviewed changes

roles/backup/vars/main.yml Outdated Show resolved Hide resolved

rooftopcellist force-pushed the pg15 branch from 1d495b8 to 99a52eb Compare March 8, 2024 02:44

rcarrillocruz reviewed Mar 8, 2024

View reviewed changes

roles/postgres/tasks/check_postgres_version.yml Show resolved Hide resolved

dsavineau reviewed Mar 8, 2024

View reviewed changes

roles/postgres/tasks/scale_down_deployment.yml Show resolved Hide resolved

rooftopcellist force-pushed the pg15 branch from 99a52eb to c5473ba Compare March 11, 2024 19:35

rooftopcellist requested review from dsavineau and rcarrillocruz March 11, 2024 19:37

dsavineau force-pushed the pg15 branch from 809ac57 to c2b13fd Compare March 25, 2024 18:10

rooftopcellist force-pushed the pg15 branch from 6447cb8 to 76ddc53 Compare April 1, 2024 16:49

rooftopcellist force-pushed the pg15 branch from 76ddc53 to 10f706a Compare April 8, 2024 21:32

rooftopcellist added 3 commits April 8, 2024 20:30

Add logic for upgrading from Postgres 13 to 15

9eb5643

Add keepalive logic to avoid timeout when upgrading postgres

d74a909

Cast supported_pg_version to int where needed and delete old pg pvc

43ecb08

dsavineau and others added 7 commits April 8, 2024 20:30

Add Upgrade docs and PostgreSQL upgrade considerations

84b6f48

Remove the ability to configure the postgres_data_path parameter

e409043

- This was removed because it is not respected by the sclorg PostgreSQL image

rooftopcellist force-pushed the pg15 branch from 10f706a to 5511f93 Compare April 9, 2024 00:30

rooftopcellist added 2 commits April 8, 2024 20:32

Add database configuration docs

54222d0

Fix bug when injecting override database_secret during backup

1b76110

rooftopcellist force-pushed the pg15 branch from 5511f93 to 1b76110 Compare April 9, 2024 00:32

rooftopcellist merged commit 3c441ff into ansible:main Apr 9, 2024

kurokobo mentioned this pull request Apr 11, 2024

Implement init container for PSQL to avoid permission error on specific implementation of PVs #192

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add logic for upgrading from Postgres 13 to 15 #122

Add logic for upgrading from Postgres 13 to 15 #122

rooftopcellist commented Aug 15, 2023

rooftopcellist commented Mar 11, 2024

dsavineau commented Mar 12, 2024

dsavineau commented Mar 21, 2024 •

edited

Loading

dsavineau commented Mar 21, 2024

dsavineau commented Mar 21, 2024 •

edited

Loading

dsavineau commented Mar 22, 2024

dsavineau commented Mar 25, 2024

kurokobo commented Mar 27, 2024

dsavineau commented Mar 27, 2024

rooftopcellist commented Apr 9, 2024 •

edited

Loading

kurokobo commented Apr 9, 2024

rooftopcellist commented Apr 9, 2024

kurokobo commented Apr 11, 2024

rooftopcellist commented Apr 11, 2024

Add logic for upgrading from Postgres 13 to 15 #122

Add logic for upgrading from Postgres 13 to 15 #122

Conversation

rooftopcellist commented Aug 15, 2023

SUMMARY

ISSUE TYPE

rooftopcellist commented Mar 11, 2024

dsavineau commented Mar 12, 2024

dsavineau commented Mar 21, 2024 • edited Loading

dsavineau commented Mar 21, 2024

dsavineau commented Mar 21, 2024 • edited Loading

dsavineau commented Mar 22, 2024

dsavineau commented Mar 25, 2024

kurokobo commented Mar 27, 2024

dsavineau commented Mar 27, 2024

rooftopcellist commented Apr 9, 2024 • edited Loading

Testing

kurokobo commented Apr 9, 2024

rooftopcellist commented Apr 9, 2024

kurokobo commented Apr 11, 2024

rooftopcellist commented Apr 11, 2024

dsavineau commented Mar 21, 2024 •

edited

Loading

dsavineau commented Mar 21, 2024 •

edited

Loading

rooftopcellist commented Apr 9, 2024 •

edited

Loading