Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Upgrading to PostgreSQL 15 and moving to sclorg images #80

Merged
merged 27 commits into from
Apr 30, 2024

Conversation

rooftopcellist
Copy link
Member

@rooftopcellist rooftopcellist commented Mar 7, 2024

SUMMARY

Upgrading from PostgreSQL 13 to 15.

Major changes

  • Use new v1.34.1 ansible-operator base image and operator_sdk.util 0.5.0 - commit
  • The boolean for deleting the old Postgres PVC by default after Postgres upgrade is now fixed - commit
    • postgres_keep_pvc_after_upgrade: false means the old PG13 PVC will be deleted after upgrade by default
  • Add checksum for secrets and configmaps to deployments so containers - commit
  • Set new postgres configuration secret if managed database. This means you no longer need to delete existing postgres_configuration secrets in the namespace before restoring - commit
  • Add initContainer to initial Postgres data volume permissions if needed - commit

This is for k8s deployments only.

Add postgres init container if postgres_data_volume_init is true.

This is aimed to solve the issue where users may need to chmod or chown the postgres data volume for user 26, which is the user that is running postgres in the sclorg image.

For example, one can now set the follow on the AWX spec:

spec:
  postgres_data_volume_init: true
  postgres_init_container_commands: |
    chown 26:0 /var/lib/pgsql/data
    chmod 700 /var/lib/pgsql/data

Minor changes

  • Update scale down utility playbook to scale down the correct deployments - commit
  • Improved logic for checking PG_VERSION paths for both postgres and sclorg postgresql image commit
  • Set POSTGRES_* env vars from secrets so when pods cycle values are updated - commit
  • Scale up web and content replicas after upgrade and restore (previously it waited for the next reconciliation loop which took more time - commit
  • Specify Redis 7 image explicitly so that when we bump the version it will be intentional - commit
  • PostgreSQL: Cast sorted_old_postgres_pods as list for compatiblity with Ansible 2.9.x and 2.15.x - commit
  • PostgreSQL: Grant postgres role to galaxy for compatibility with new sclorg image - commit
  • Change default postgres_data_path to that of the new sclorg image (/var/lib/pgsql/data/userdata) - commit
  • Add database configuration docs - commit
  • Refactor backup content logic into k8s job to enable custom securityContext - commit
    • Set runAsUser to 1000 (galaxy user) for management pods in k8s
      so that it can access the contents of /var/lib/pulp
    • Add initContainer for copying content from /var/lib/pulp during
      backups
    • Add separate k8s job for copying content for file storage during
      restores
    • add rbac rules for cronjobs and jobs to operator SA
    • In k8s, set content pod user to 1000 like in the api deployment
    • Set UID 1000 in k8s for backup and restore management pods
    • Add a ttl for the k8s jobs so that the content PVC can be deleted if
      desired without ownerReference issues
  • Fix a bug where ! characters in the PGPASSWORD are not parsed correctly - commit
  • Add --ansible-log-events flag to Dockerfile to make it easier to change verbosity - commit
  • Rewrite backup and restore cr_object handling to store in yaml instead of json that can be invalid in some scenarios - commit
    • Refactor how cluster_name is set so that the period is excluded if set
      to an empty string.
  • Wait for the Postgres Service to be ready & create server secret earlier - commit
    • If the content or worker pods come up too fast and the postgres
      service is not ready, we get errors in those containers and it crash
      loops. This resolves that.
    • Move all of the configuration and variables needed to create the
      galaxy-server secret to the common role. Prior to this, we would see
      errors when applying the content and worker deployments because the
      server-secret did not yet exist.
    • Before applying the content and worker deployments, check to make sure
      the galaxy-server secret exists.
  • Refactor: Split Galaxy Server secret into a dedicated role - commit
    • without this change, nodeport deployments failed because the web
      service was node created by the time get_node_ip.yml was run.
    • Re-enable all PR checks
  • During backup, use the deployed application image specified on the CR for the backup k8s job that copies the contents of /var/lib/pulp - commit
    • Add ability to configure backup_resource_requirements and
      restore_resource_requirements
    • Deprecate backup_pvc_namespace parameter
    • Only add file-storage-pvc checksum if file_storage is enabled
  • Always specify api_version with k8s tasks using the Pod resource - commit

@rooftopcellist rooftopcellist marked this pull request as draft March 7, 2024 22:33
@rooftopcellist rooftopcellist requested a review from aknochow March 8, 2024 02:41
@rooftopcellist rooftopcellist marked this pull request as ready for review March 8, 2024 02:41
@rooftopcellist rooftopcellist requested a review from dsavineau March 8, 2024 13:59
@rooftopcellist rooftopcellist force-pushed the pg15 branch 3 times, most recently from 5726c68 to fccdc09 Compare March 8, 2024 19:06
@rooftopcellist
Copy link
Member Author

I am about to be OOO for 2 weeks, so just a note to anyone who merges this. We should not squash the commits here as some of the changes here we may want to port to other operators (specifically the change that checks both paths for PG_VERSION).

@rooftopcellist rooftopcellist force-pushed the pg15 branch 4 times, most recently from 3c8b29e to a10ad34 Compare March 11, 2024 18:31
roles/postgres/defaults/main.yml Outdated Show resolved Hide resolved
roles/postgres/tasks/upgrade_postgres.yml Outdated Show resolved Hide resolved
roles/redis/defaults/main.yml Outdated Show resolved Hide resolved
@aknochow
Copy link
Member

Looking into why those 2 checks are failing...

@dsavineau
Copy link
Contributor

@rooftopcellist I rebased this PR and added two commits (same as eda operator) in order to get the upgrade working

@kurokobo
Copy link
Contributor

Hi, thanks for working on this!

In changing PSQL from docker.io to sclorg, some considerations need to be made.

AWX Operator 2.13.0 has adopted sqlorg's PSQL, but there are a number of open issues, so it may be safer for Galaxy Operator and EDA Server Operator to collectively address them after things settle down.
EDA Server Operator has been using sclorg's PSQL 13 from the start, so the postgres_data_path problem seems unlikely.

@rooftopcellist
Copy link
Member Author

Thank you for the patches @dsavineau !

@kurokobo Thank for your notes. I think I have addressed all of these now:

  • The default postgres data path is now /var/lib/pgsql/data/userdata
  • Since the PGDATA environment variable is no longer respected by the sclorg image, I removed the ability to configure it as we did in the awx-operator. If a user has postgres_data_path defined on their Galaxy spec, it will be ignored (the k8s validator omits it silently when Galaxy CR is applied).

Note: I confirmed that in the sclorg and registry.redhat.io provided PostgreSQL images, the data dir path that is actually used is /var/lib/pgsql/data/userdata/, even though PGDATA gets set to the value passed into the StatefulSet yaml. So we no longer specify the PGDATA env var to avoid confusion, since it is not respected by the PostgreSQL image anyways.

I still need to try to reproduce the UID 26 permissions issue you mentioned (link), I don't see it on my openshift cluster, but that makes sense because permissions are handled differently there. I'll try to reproduce on a k8s cluster.

@kurokobo
Copy link
Contributor

kurokobo commented Apr 2, 2024

Just added link to my comment on the issue: ansible/awx-operator#1770 (comment)

@rooftopcellist
Copy link
Member Author

I have now added in the postgres_data_path related changes from this PR:

I also just included the changes to make it possible initialize the postgresql data volume with the correct permissions if running on k8s with a hostMounted volume.

@rooftopcellist rooftopcellist force-pushed the pg15 branch 2 times, most recently from 568f3a5 to 98536bf Compare April 4, 2024 19:08
@rooftopcellist rooftopcellist force-pushed the pg15 branch 3 times, most recently from a7f8f83 to e99c6b9 Compare April 30, 2024 00:39
…ontext

- Set runAsUser to 1000 (galaxy user) for management pods in k8s
  so that it can access the contents of /var/lib/pulp
- Add initContainer for copying content from /var/lib/pulp during
  backups
- Add separate k8s job for copying content for file storage during
  restores
- add rbac rules for cronjobs and jobs to operator SA
- In k8s, set content pod user to 1000 like in the api deployment
- Set UID 1000 in k8s for backup and restore management pods
- Add a ttl for the k8s jobs so that the content PVC can be deleted if
  desired without ownerReference issues
* This makes fixes a bug where ! characters in the PGPASSWORD are not parsed correctly
…nvalid json

* Refactor how cluster_name is set so that the period is excluded if set
  to an empty string.
* If the content or worker pods come up too fast and the postgres
  service is not ready, we get errors in those containers and it crash
  loops. This resolves that.
* Move all of the configuration and variables needed to create the
  galaxy-server secret to the common role. Prior to this, we would see
  errors when applying the content and worker deployments because the
  server-secret did not yet exist.
* Before applying the content and worker deployments, check to make sure
  the galaxy-server secret exists.
- without this change, nodeport deployments failed because the web
  service was node created by the time get_node_ip.yml was run.
- Re-enable all PR checks
@rooftopcellist
Copy link
Member Author

@dsavineau @kurokobo @aknochow This PR is now ready for merge IMO. I was able to do a fresh install, upgrade (main --> pg15 branch) and backup and restore on k3s using mounted volumes. There are details in the commit history for the changes needed and why.

* Add ability to configure backup_resource_requirements and
  restore_resource_requirements
* Deprecate backup_pvc_namespace parameter
* Only add file-storage-pvc checksum if file_storage is enabled
@rooftopcellist
Copy link
Member Author

Ok, testing went well. There is one caveat, upstream users on k3s will have a manual step:

If you see the following error in the postgres pod's logs upon upgrade, or when running a restore after the fact, then you are likely running on k3s and need to enable the init container to set permissions on the postgresql directory.

$ kubectl -n galaxy logs statefulset/new-galaxy-postgres-15
mkdir: cannot create directory '/var/lib/pgsql/data/userdata': Permission denied

Follow these steps to remediate the issue by setting the postgres_data_volume_init parameter true and deleting the new postgresql StatefulSet.

# Patch your Galaxy custom resource
kubectl -n galaxy patch galaxy <deployment-name> --type=merge -p '{"spec": {"postgres_data_volume_init": true}}'

# Delete the new postgres stateful set
kubectl -n galaxy delete statefulset <deployment-name>-postgres-15

cc @kurokobo

I will be sure to include a note in the release notes.

As we are ready to merge this, I will cut an upstream release first since a few things have merged since the last one, so that community members could get those fixes without doing the database upgrade if desired.

@rooftopcellist rooftopcellist merged commit 9035df1 into ansible:main Apr 30, 2024
5 checks passed
@kurokobo
Copy link
Contributor

kurokobo commented May 1, 2024

@rooftopcellist
Thanks for pointing that out. Currently my guide does not include any upgrade guide for Galaxy Operator (I have only for AWX Operator), but you are right, some manualy interventions are required.
For AWX Operator, I was provided the guide to create new PV for PSQL 15 that includes chown 26:0 command, so if the users read my guide carefully (I believe so 😃 ), I believe users never face the issue: https://github.com/kurokobo/awx-on-k3s/blob/main/tips/upgrade-operator.md

Anyway I will give new Galaxy Operator try in next days. Thanks!

@rooftopcellist
Copy link
Member Author

Ahh, even better! And should they miss a step and end up needing the initContainer to set the permissions to get the deployment back in a good state, they have the remediation steps on my earlier comment, which I'll include in the release notes.

Thanks for taking a look!

@rooftopcellist
Copy link
Member Author

I plan to cut another release including the changes in this PR tomorrow morning.

@kurokobo
Copy link
Contributor

kurokobo commented May 1, 2024

@rooftopcellist
Made a minimal test and seems both 2024.4.30 and 2024.5.1 work fine😃 I've updated my guide to support 2024.5.1.
Thanks for working hard and releasing this with detailed release notes!

@rooftopcellist
Copy link
Member Author

Wonderful news! Thanks for testing that out @kurokobo

@@ -14,66 +14,27 @@

- name: Set custom resource spec variable from backup
set_fact:
cr_spec: "{{ cr_object.stdout }}"
cr_spec_from_backup: "{{ cr_object.stdout }}"
cr_spec_strip: "{ "
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should remove this cr_spec_strip var, it looks like it is no longer used.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same with the other variables here that seem to have been for building the cr_spec... unless I've misunderstood this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants