Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Postgres 15 pod: cannot create directory '/var/lib/pgsql/data/userdata': Permission denied" #1770

Closed
3 tasks done
Tfinn92 opened this issue Mar 13, 2024 · 56 comments
Closed
3 tasks done

Comments

@Tfinn92
Copy link

Tfinn92 commented Mar 13, 2024

Please confirm the following

  • I agree to follow this project's code of conduct.
  • I have checked the current issues for duplicates.
  • I understand that the AWX Operator is open source software provided for free and that I might not receive a timely response.

Bug Summary

Updating to 2.13.1 through helm results in the postgres15 pod having the following error:
cannot create directory '/var/lib/pgsql/data/userdata': Permission denied"

AWX Operator version

2.13.1

AWX version

24.0.0

Kubernetes platform

kubernetes

Kubernetes/Platform version

Rancher RKE2 v1.26.8+rke2r1 and another on v1.27.10+rke2r1

Modifications

no

Steps to reproduce

Have cluster with 2.12.2 installed and run helm upgrade awx-operator awx-operator/awx-operator

Expected results

pods come up no problem

Actual results

postgres15 pod CrashLoopBackOff
Logs show "mkdir: cannot create directory '/var/lib/pgsql/data/userdata': Permission denied"

Additional information

No response

Operator Logs

No response

@kurokobo
Copy link
Contributor

@Tfinn92
New PSQL is running as UID:26, so PV has to be writable by this UID.

@kurokobo
Copy link
Contributor

@TheRealHaoLiu @fosterseth
The same consideration can be applied for backup. UID:26 has to have write perm on PV to create backup directory.

@mooky31
Copy link

mooky31 commented Mar 14, 2024

Same problem here

@TheRealHaoLiu
Copy link
Member

@kurokobo where should that modification be made? through a root user init container or is there something that could be done when setting up the PV

@mooky31
Copy link

mooky31 commented Mar 14, 2024

@kurokobo I think your advice is only valid if you use PV on local storage. Since I use rook-ceph, I can't set rights on the filesystem.
Note : it was working with 2.13.0

@JSN-1
Copy link

JSN-1 commented Mar 14, 2024

i had to create the volume, scale down the deployment / statefulset, mount volume into another pod and do

mkdir userdata
chown 26:26 userdata

after the pod started and upgrade continued.

@TheRealHaoLiu
Copy link
Member

init container to run as root user and does chown?

@Tfinn92
Copy link
Author

Tfinn92 commented Mar 15, 2024

@Tfinn92 New PSQL is running as UID:26, so PV has to be writable by this UID.

While this is true, the old postgres 13 container that was deployed by the operator before was using root as it's user, so it seems like the devs got used to that freedom and tried applying the same logic in the 15 container, which as we are seeing, fails.

Screenshot 2024-03-15 at 5 51 19 AM

@Tfinn92
Copy link
Author

Tfinn92 commented Mar 15, 2024

Heck, even looking in the PG13 container, the permissions expected now wouldn't be possible without the root user:
Screenshot 2024-03-15 at 5 58 26 AM
Obviously the pathing is a little different as well, but I imagine the same principles could be applied to the PG15 container

@jyanesnotariado
Copy link

This issue is not just for updates. I'm trying to start a new AWX instance from scratch and ran into the same problem.

@mooky31
Copy link

mooky31 commented Mar 15, 2024

I confirm, this was also a new install for me.

@fosterseth
Copy link
Member

fosterseth commented Mar 15, 2024

https://github.com/ansible/awx-operator/compare/devel...fosterseth:add_postgres_init_container?expand=1

@mooky31 @jyanesancert @Tfinn92 maybe something like that could help?

you can deploy image quay.io/fosterseth/awx-operator:postgres_init which has that change

to use, add whatever commands you want to your awx spec, e.g.

  init_postgres_extra_commands: |
    sudo touch /var/lib/pgsql/data/foo
    sudo touch /var/lib/pgsql/data/bar
    chown 26:26 /var/lib/pgsql/data/foo
    chown root:root /var/lib/pgsql/data/bar

so in your case maybe mkdir /var/lib/pgsql/data/userdata and chmod / chown it for user 26

IF that works for you let me know and we can get this change into devel

@deepblue868
Copy link

deepblue868 commented Mar 15, 2024

For the new install adding this to spec fix it for me. It suppose to be the default in pervious version.
postgres_data_path: /var/lib/postgresql/data/pgdata
The install went through, but the postgress is still use /var/lib/pgsql/data/userdata which is not the pv.

@kurokobo
Copy link
Contributor

@TheRealHaoLiu
Sorry for my delayed response.

where should that modification be made? through a root user init container or is there something that could be done when setting up the PV

Since the images under sclorg are mostly maintained by Red Hat, so I think Red Hat should have best practices on this matter as well, rather than me 😞 Anyway as @fosterseth is trying using init container with root is possible solution.

Another well-known non-root PSQL implementation is Bitnami by VMware which has almost same restriction:

NOTE: As this is a non-root container, the mounted files and directories must have the proper permissions for the UID 1001.
https://hub.docker.com/r/bitnami/postgresql/

In their charts for this PSQL, there are params to control initContainer to invoke chown / mkdir / chmod. If we enable this, PSQL has initContainer with runAsUser: 0 by default.

$ helm install bitnami/postgresql --generate-name --set volumePermissions.enabled=true
...

$ kubectl get statefulset postgresql-1710598237 -o yaml
...
      initContainers:
      - command:
        - /bin/sh
        - -ec
        - |
          chown 1001:1001 /bitnami/postgresql
          mkdir -p /bitnami/postgresql/data
          chmod 700 /bitnami/postgresql/data
          find /bitnami/postgresql -mindepth 1 -maxdepth 1 -not -name "conf" -not -name ".snapshot" -not -name "lost+found" | \
            xargs -r chown -R 1001:1001
          chmod -R 777 /dev/shm
        image: docker.io/bitnami/os-shell:12-debian-12-r16
        imagePullPolicy: IfNotPresent
        name: init-chmod-data
        resources: {}
        securityContext:
          runAsGroup: 0
          runAsNonRoot: false
          runAsUser: 0
          seccompProfile:
            type: RuntimeDefault
        terminationMessagePath: /dev/termination-log
        terminationMessagePolicy: File
        volumeMounts:
        - mountPath: /tmp
          name: empty-dir
          subPath: tmp-dir
        - mountPath: /bitnami/postgresql
          name: data
        - mountPath: /dev/shm
          name: dshm
...

There are related docs by Bitnami:

@craph
Copy link
Contributor

craph commented Mar 18, 2024

How can we solve this kind of issue when default storage class is longhorn ? 🤔

@kurokobo
Copy link
Contributor

@craph
Try a workaround #1770 (comment) by @fosterseth, or deploy temporary working pod that mounts the same PVC for PSQL for AWX and modify permissions.

@craph
Copy link
Contributor

craph commented Mar 18, 2024

Hi @kurokobo ,

Thank you very much for the update

@craph Try a workaround #1770 (comment) by @fosterseth, or deploy temporary working pod that mounts the same PVC for PSQL for AWX and modify permissions.

I have just created a new temporary pod, and change the permission on data as requested.

But it look like the previous data haven't been migrated. I can't login anymore in AWX

I can see a job : awx-demo-migration and a pod awx-demo-migration-24.0.0 with state completed. but I can't login anymore

And the old StatefulSets for postgres 13 doesn't exist anymore but I still have the old pvc for postgres 13

@craph
Copy link
Contributor

craph commented Mar 18, 2024

@kurokobo here is the log of the migration. But now, I can't login into my AWX instance

Operations to perform:
  Apply all migrations: auth, conf, contenttypes, dab_resource_registry, main, oauth2_provider, sessions, sites, social_django, sso
Running migrations:
  Applying contenttypes.0001_initial... OK
  Applying contenttypes.0002_remove_content_type_name... OK
  Applying auth.0001_initial... OK
  Applying main.0001_initial... OK
  Applying main.0002_squashed_v300_release... OK
  Applying main.0003_squashed_v300_v303_updates... OK
  Applying main.0004_squashed_v310_release... OK
  Applying conf.0001_initial... OK
  Applying conf.0002_v310_copy_tower_settings... OK
  Applying main.0005_squashed_v310_v313_updates... OK
  Applying main.0006_v320_release... OK
  Applying main.0007_v320_data_migrations... OK
  Applying main.0008_v320_drop_v1_credential_fields... OK
  Applying main.0009_v322_add_setting_field_for_activity_stream... OK
  Applying main.0010_v322_add_ovirt4_tower_inventory... OK
  Applying main.0011_v322_encrypt_survey_passwords... OK
  Applying main.0012_v322_update_cred_types... OK
  Applying main.0013_v330_multi_credential... OK
  Applying auth.0002_alter_permission_name_max_length... OK
  Applying auth.0003_alter_user_email_max_length... OK
  Applying auth.0004_alter_user_username_opts... OK
  Applying auth.0005_alter_user_last_login_null... OK
  Applying auth.0006_require_contenttypes_0002... OK
  Applying auth.0007_alter_validators_add_error_messages... OK
  Applying auth.0008_alter_user_username_max_length... OK
  Applying auth.0009_alter_user_last_name_max_length... OK
  Applying auth.0010_alter_group_name_max_length... OK
  Applying auth.0011_update_proxy_permissions... OK
  Applying auth.0012_alter_user_first_name_max_length... OK
  Applying conf.0003_v310_JSONField_changes... OK
  Applying conf.0004_v320_reencrypt... OK
  Applying conf.0005_v330_rename_two_session_settings... OK
  Applying conf.0006_v331_ldap_group_type... OK
  Applying conf.0007_v380_rename_more_settings... OK
  Applying conf.0008_subscriptions... OK
  Applying conf.0009_rename_proot_settings... OK
  Applying conf.0010_change_to_JSONField... OK
  Applying dab_resource_registry.0001_initial... OK
  Applying dab_resource_registry.0002_remove_resource_id... OK
  Applying dab_resource_registry.0003_alter_resource_object_id... OK
  Applying sessions.0001_initial... OK
  Applying main.0014_v330_saved_launchtime_configs... OK
  Applying main.0015_v330_blank_start_args... OK
  Applying main.0016_v330_non_blank_workflow... OK
  Applying main.0017_v330_move_deprecated_stdout... OK
  Applying main.0018_v330_add_additional_stdout_events... OK
  Applying main.0019_v330_custom_virtualenv... OK
  Applying main.0020_v330_instancegroup_policies... OK
  Applying main.0021_v330_declare_new_rbac_roles... OK
  Applying main.0022_v330_create_new_rbac_roles... OK
  Applying main.0023_v330_inventory_multicred... OK
  Applying main.0024_v330_create_user_session_membership... OK
  Applying main.0025_v330_add_oauth_activity_stream_registrar... OK
  Applying oauth2_provider.0001_initial... OK
  Applying oauth2_provider.0002_auto_20190406_1805... OK
  Applying oauth2_provider.0003_auto_20201211_1314... OK
  Applying oauth2_provider.0004_auto_20200902_2022... OK
  Applying oauth2_provider.0005_auto_20211222_2352... OK
  Applying main.0026_v330_delete_authtoken... OK
  Applying main.0027_v330_emitted_events... OK
  Applying main.0028_v330_add_tower_verify... OK
  Applying main.0030_v330_modify_application... OK
  Applying main.0031_v330_encrypt_oauth2_secret... OK
  Applying main.0032_v330_polymorphic_delete... OK
  Applying main.0033_v330_oauth_help_text... OK
2024-03-18 12:37:32,320 INFO     [-] rbac_migrations Computing role roots..
2024-03-18 12:37:32,321 INFO     [-] rbac_migrations Found 0 roots in 0.000113 seconds, rebuilding ancestry map
2024-03-18 12:37:32,321 INFO     [-] rbac_migrations Rebuild ancestors completed in 0.000004 seconds
2024-03-18 12:37:32,321 INFO     [-] rbac_migrations Done.
  Applying main.0034_v330_delete_user_role... OK
  Applying main.0035_v330_more_oauth2_help_text... OK
  Applying main.0036_v330_credtype_remove_become_methods... OK
  Applying main.0037_v330_remove_legacy_fact_cleanup... OK
  Applying main.0038_v330_add_deleted_activitystream_actor... OK
  Applying main.0039_v330_custom_venv_help_text... OK
  Applying main.0040_v330_unifiedjob_controller_node... OK
  Applying main.0041_v330_update_oauth_refreshtoken... OK
2024-03-18 12:37:33,605 INFO     [-] rbac_migrations Computing role roots..
2024-03-18 12:37:33,606 INFO     [-] rbac_migrations Found 0 roots in 0.000108 seconds, rebuilding ancestry map
2024-03-18 12:37:33,606 INFO     [-] rbac_migrations Rebuild ancestors completed in 0.000004 seconds
2024-03-18 12:37:33,606 INFO     [-] rbac_migrations Done.
  Applying main.0042_v330_org_member_role_deparent... OK
  Applying main.0043_v330_oauth2accesstoken_modified... OK
  Applying main.0044_v330_add_inventory_update_inventory... OK
  Applying main.0045_v330_instance_managed_by_policy... OK
  Applying main.0046_v330_remove_client_credentials_grant... OK
  Applying main.0047_v330_activitystream_instance... OK
  Applying main.0048_v330_django_created_modified_by_model_name... OK
  Applying main.0049_v330_validate_instance_capacity_adjustment... OK
  Applying main.0050_v340_drop_celery_tables... OK
  Applying main.0051_v340_job_slicing... OK
  Applying main.0052_v340_remove_project_scm_delete_on_next_update... OK
  Applying main.0053_v340_workflow_inventory... OK
  Applying main.0054_v340_workflow_convergence... OK
  Applying main.0055_v340_add_grafana_notification... OK
  Applying main.0056_v350_custom_venv_history... OK
  Applying main.0057_v350_remove_become_method_type... OK
  Applying main.0058_v350_remove_limit_limit... OK
  Applying main.0059_v350_remove_adhoc_limit... OK
  Applying main.0060_v350_update_schedule_uniqueness_constraint... OK
  Applying main.0061_v350_track_native_credentialtype_source... OK
  Applying main.0062_v350_new_playbook_stats... OK
  Applying main.0063_v350_org_host_limits... OK
  Applying main.0064_v350_analytics_state... OK
  Applying main.0065_v350_index_job_status... OK
  Applying main.0066_v350_inventorysource_custom_virtualenv... OK
  Applying main.0067_v350_credential_plugins... OK
  Applying main.0068_v350_index_event_created... OK
  Applying main.0069_v350_generate_unique_install_uuid... OK
  Applying main.0070_v350_gce_instance_id... OK
  Applying main.0071_v350_remove_system_tracking... OK
  Applying main.0072_v350_deprecate_fields... OK
  Applying main.0073_v360_create_instance_group_m2m... OK
  Applying main.0074_v360_migrate_instance_group_relations... OK
  Applying main.0075_v360_remove_old_instance_group_relations... OK
  Applying main.0076_v360_add_new_instance_group_relations... OK
  Applying main.0077_v360_add_default_orderings... OK
  Applying main.0078_v360_clear_sessions_tokens_jt... OK
  Applying main.0079_v360_rm_implicit_oauth2_apps... OK
  Applying main.0080_v360_replace_job_origin... OK
  Applying main.0081_v360_notify_on_start... OK
  Applying main.0082_v360_webhook_http_method... OK
  Applying main.0083_v360_job_branch_override... OK
  Applying main.0084_v360_token_description... OK
  Applying main.0085_v360_add_notificationtemplate_messages... OK
  Applying main.0086_v360_workflow_approval... OK
  Applying main.0087_v360_update_credential_injector_help_text... OK
  Applying main.0088_v360_dashboard_optimizations... OK
  Applying main.0089_v360_new_job_event_types... OK
  Applying main.0090_v360_WFJT_prompts... OK
  Applying main.0091_v360_approval_node_notifications... OK
  Applying main.0092_v360_webhook_mixin... OK
  Applying main.0093_v360_personal_access_tokens... OK
  Applying main.0094_v360_webhook_mixin2... OK
  Applying main.0095_v360_increase_instance_version_length... OK
  Applying main.0096_v360_container_groups... OK
  Applying main.0097_v360_workflowapproval_approved_or_denied_by... OK
  Applying main.0098_v360_rename_cyberark_aim_credential_type... OK
  Applying main.0099_v361_license_cleanup... OK
  Applying main.0100_v370_projectupdate_job_tags... OK
  Applying main.0101_v370_generate_new_uuids_for_iso_nodes... OK
  Applying main.0102_v370_unifiedjob_canceled... OK
  Applying main.0103_v370_remove_computed_fields... OK
  Applying main.0104_v370_cleanup_old_scan_jts... OK
  Applying main.0105_v370_remove_jobevent_parent_and_hosts... OK
  Applying main.0106_v370_remove_inventory_groups_with_active_failures... OK
  Applying main.0107_v370_workflow_convergence_api_toggle... OK
  Applying main.0108_v370_unifiedjob_dependencies_processed... OK
2024-03-18 12:37:54,433 INFO     [-] rbac_migrations Unified organization migration completed in 0.0183 seconds
2024-03-18 12:37:54,452 INFO     [-] rbac_migrations Unified organization migration completed in 0.0184 seconds
2024-03-18 12:37:55,391 INFO     [-] rbac_migrations Rebuild parentage completed in 0.003237 seconds
  Applying main.0109_v370_job_template_organization_field... OK
  Applying main.0110_v370_instance_ip_address... OK
  Applying main.0111_v370_delete_channelgroup... OK
  Applying main.0112_v370_workflow_node_identifier... OK
  Applying main.0113_v370_event_bigint... OK
  Applying main.0114_v370_remove_deprecated_manual_inventory_sources... OK
  Applying main.0115_v370_schedule_set_null... OK
  Applying main.0116_v400_remove_hipchat_notifications... OK
  Applying main.0117_v400_remove_cloudforms_inventory... OK
  Applying main.0118_add_remote_archive_scm_type... OK
  Applying main.0119_inventory_plugins... OK
  Applying main.0120_galaxy_credentials... OK
  Applying main.0121_delete_toweranalyticsstate... OK
  Applying main.0122_really_remove_cloudforms_inventory... OK
  Applying main.0123_drop_hg_support... OK
  Applying main.0124_execution_environments... OK
  Applying main.0125_more_ee_modeling_changes... OK
  Applying main.0126_executionenvironment_container_options... OK
  Applying main.0127_reset_pod_spec_override... OK
  Applying main.0128_organiaztion_read_roles_ee_admin... OK
  Applying main.0129_unifiedjob_installed_collections... OK
  Applying main.0130_ee_polymorphic_set_null... OK
  Applying main.0131_undo_org_polymorphic_ee... OK
  Applying main.0132_instancegroup_is_container_group... OK
  Applying main.0133_centrify_vault_credtype... OK
  Applying main.0134_unifiedjob_ansible_version... OK
  Applying main.0135_schedule_sort_fallback_to_id... OK
  Applying main.0136_scm_track_submodules... OK
  Applying main.0137_custom_inventory_scripts_removal_data... OK
  Applying main.0138_custom_inventory_scripts_removal... OK
  Applying main.0139_isolated_removal... OK
  Applying main.0140_rename... OK
  Applying main.0141_remove_isolated_instances... OK
  Applying main.0142_update_ee_image_field_description... OK
  Applying main.0143_hostmetric... OK
  Applying main.0144_event_partitions... OK
  Applying main.0145_deregister_managed_ee_objs... OK
  Applying main.0146_add_insights_inventory... OK
  Applying main.0147_validate_ee_image_field... OK
  Applying main.0148_unifiedjob_receptor_unit_id... OK
  Applying main.0149_remove_inventory_insights_credential... OK
  Applying main.0150_rename_inv_sources_inv_updates... OK
  Applying main.0151_rename_managed_by_tower... OK
  Applying main.0152_instance_node_type... OK
  Applying main.0153_instance_last_seen... OK
  Applying main.0154_set_default_uuid... OK
  Applying main.0155_improved_health_check... OK
  Applying main.0156_capture_mesh_topology... OK
  Applying main.0157_inventory_labels... OK
  Applying main.0158_make_instance_cpu_decimal... OK
  Applying main.0159_deprecate_inventory_source_UoPU_field... OK
  Applying main.0160_alter_schedule_rrule... OK
  Applying main.0161_unifiedjob_host_status_counts... OK
  Applying main.0162_alter_unifiedjob_dependent_jobs... OK
  Applying main.0163_convert_job_tags_to_textfield... OK
  Applying main.0164_remove_inventorysource_update_on_project_update... OK
  Applying main.0165_task_manager_refactor... OK
  Applying main.0166_alter_jobevent_host... OK
  Applying main.0167_project_signature_validation_credential... OK
  Applying main.0168_inventoryupdate_scm_revision... OK
  Applying main.0169_jt_prompt_everything_on_launch... OK
  Applying main.0170_node_and_link_state... OK
  Applying main.0171_add_health_check_started... OK
  Applying main.0172_prevent_instance_fallback... OK
  Applying main.0173_instancegroup_max_limits... OK
  Applying main.0174_ensure_org_ee_admin_roles... OK
  Applying main.0175_workflowjob_is_bulk_job... OK
  Applying main.0176_inventorysource_scm_branch... OK
  Applying main.0177_instance_group_role_addition... OK
2024-03-18 12:38:18,686 INFO     [-] awx.main.migrations Initiated migration from Org admin to use role
  Applying main.0178_instance_group_admin_migration... OK
  Applying main.0179_change_cyberark_plugin_names... OK
  Applying main.0180_add_hostmetric_fields... OK
  Applying main.0181_hostmetricsummarymonthly... OK
  Applying main.0182_constructed_inventory... OK
  Applying main.0183_pre_django_upgrade... OK
  Applying main.0184_django_indexes... OK
  Applying main.0185_move_JSONBlob_to_JSONField... OK
  Applying main.0186_drop_django_taggit... OK
  Applying main.0187_hop_nodes... OK
  Applying main.0188_add_bitbucket_dc_webhook... OK
  Applying main.0189_inbound_hop_nodes... OK
  Applying main.0190_alter_inventorysource_source_and_more... OK
  Applying sites.0001_initial... OK
  Applying sites.0002_alter_domain_unique... OK
  Applying social_django.0001_initial... OK
  Applying social_django.0002_add_related_name... OK
  Applying social_django.0003_alter_email_max_length... OK
  Applying social_django.0004_auto_20160423_0400... OK
  Applying social_django.0005_auto_20160727_2333... OK
  Applying social_django.0006_partial... OK
  Applying social_django.0007_code_timestamp... OK
  Applying social_django.0008_partial_timestamp... OK
  Applying social_django.0009_auto_20191118_0520... OK
  Applying social_django.0010_uid_db_index... OK
  Applying social_django.0011_alter_id_fields... OK
  Applying social_django.0012_usersocialauth_extra_data_new... OK
  Applying social_django.0013_migrate_extra_data... OK
  Applying social_django.0014_remove_usersocialauth_extra_data... OK
  Applying social_django.0015_rename_extra_data_new_usersocialauth_extra_data... OK
  Applying sso.0001_initial... OK
  Applying sso.0002_expand_provider_options... OK
  Applying sso.0003_convert_saml_string_to_list... OK

@craph
Copy link
Contributor

craph commented Mar 18, 2024

No more data 😢 and password have been reinitialized ... Why all the data haven't been migrated but the log says OK ?
image

@kurokobo
Copy link
Contributor

@craph
Could you provide:

kubectl -n <namespace> get pod
kubectl -n <namespace> get pod <psql pod> -o yaml

@craph
Copy link
Contributor

craph commented Mar 18, 2024

@kurokobo,

> kubectl -n awx get pod
NAME                                               READY   STATUS      RESTARTS   AGE
awx-demo-migration-24.0.0-rc45z                    0/1     Completed   0          41m
awx-demo-postgres-15-0                             1/1     Running     0          42m
awx-demo-task-676cbb9bb5-wm6db                     4/4     Running     0          42m
awx-demo-web-7cfb6d6d8-9f4gs                       3/3     Running     0          42m
awx-operator-controller-manager-865d646cd8-k7ldz   2/2     Running     0          3d5h

and

> kubectl -n awx get pod awx-demo-postgres-15-0 -o yaml
apiVersion: v1
kind: Pod
metadata:
  annotations:
    cattle.io/timestamp: "2024-03-18T12:35:22Z"
    cni.projectcalico.org/containerID: c8c74a832a87163b6835a4b62528e56c94e573a778c4494770c94ddc5b825cca
    cni.projectcalico.org/podIP: 10.42.42.169/32
    cni.projectcalico.org/podIPs: 10.42.42.169/32
  creationTimestamp: "2024-03-18T12:35:45Z"
  generateName: awx-demo-postgres-15-
  labels:
    app.kubernetes.io/component: database
    app.kubernetes.io/instance: postgres-15-awx-demo
    app.kubernetes.io/managed-by: awx-operator
    app.kubernetes.io/name: postgres-15
    app.kubernetes.io/part-of: awx-demo
    controller-revision-hash: awx-demo-postgres-15-7fb855c556
    statefulset.kubernetes.io/pod-name: awx-demo-postgres-15-0
  name: awx-demo-postgres-15-0
  namespace: awx
  ownerReferences:
  - apiVersion: apps/v1
    blockOwnerDeletion: true
    controller: true
    kind: StatefulSet
    name: awx-demo-postgres-15
    uid: 335a3c7f-db87-470c-ba3f-04a3e43c1368
  resourceVersion: "218222804"
  uid: 78ddf50b-d558-4105-ab0a-3ae409a0c610
spec:
  containers:
  - env:
    - name: POSTGRESQL_DATABASE
      valueFrom:
        secretKeyRef:
          key: database
          name: awx-demo-postgres-configuration
    - name: POSTGRESQL_USER
      valueFrom:
        secretKeyRef:
          key: username
          name: awx-demo-postgres-configuration
    - name: POSTGRESQL_PASSWORD
      valueFrom:
        secretKeyRef:
          key: password
          name: awx-demo-postgres-configuration
    - name: POSTGRES_DB
      valueFrom:
        secretKeyRef:
          key: database
          name: awx-demo-postgres-configuration
    - name: POSTGRES_USER
      valueFrom:
        secretKeyRef:
          key: username
          name: awx-demo-postgres-configuration
    - name: POSTGRES_PASSWORD
      valueFrom:
        secretKeyRef:
          key: password
          name: awx-demo-postgres-configuration
    - name: PGDATA
      value: /var/lib/pgsql/data/pgdata
    - name: POSTGRES_INITDB_ARGS
      value: --auth-host=scram-sha-256
    - name: POSTGRES_HOST_AUTH_METHOD
      value: scram-sha-256
    image: quay.io/sclorg/postgresql-15-c9s:latest
    imagePullPolicy: IfNotPresent
    name: postgres
    ports:
    - containerPort: 5432
      name: postgres-15
      protocol: TCP
    resources:
      requests:
        cpu: 10m
        memory: 64Mi
    terminationMessagePath: /dev/termination-log
    terminationMessagePolicy: File
    volumeMounts:
    - mountPath: /var/lib/pgsql/data
      name: postgres-15
      subPath: data
    - mountPath: /var/run/secrets/kubernetes.io/serviceaccount
      name: kube-api-access-gr59c
      readOnly: true
  dnsPolicy: ClusterFirst
  enableServiceLinks: true
  hostname: awx-demo-postgres-15-0
  nodeName: myk8sw1
  preemptionPolicy: PreemptLowerPriority
  priority: 0
  restartPolicy: Always
  schedulerName: default-scheduler
  securityContext: {}
  serviceAccount: default
  serviceAccountName: default
  subdomain: awx-demo
  terminationGracePeriodSeconds: 30
  tolerations:
  - effect: NoExecute
    key: node.kubernetes.io/not-ready
    operator: Exists
    tolerationSeconds: 300
  - effect: NoExecute
    key: node.kubernetes.io/unreachable
    operator: Exists
    tolerationSeconds: 300
  volumes:
  - name: postgres-15
    persistentVolumeClaim:
      claimName: postgres-15-awx-demo-postgres-15-0
  - name: kube-api-access-gr59c
    projected:
      defaultMode: 420
      sources:
      - serviceAccountToken:
          expirationSeconds: 3607
          path: token
      - configMap:
          items:
          - key: ca.crt
            path: ca.crt
          name: kube-root-ca.crt
      - downwardAPI:
          items:
          - fieldRef:
              apiVersion: v1
              fieldPath: metadata.namespace
            path: namespace
status:
  conditions:
  - lastProbeTime: null
    lastTransitionTime: "2024-03-18T12:35:45Z"
    status: "True"
    type: Initialized
  - lastProbeTime: null
    lastTransitionTime: "2024-03-18T12:35:57Z"
    status: "True"
    type: Ready
  - lastProbeTime: null
    lastTransitionTime: "2024-03-18T12:35:57Z"
    status: "True"
    type: ContainersReady
  - lastProbeTime: null
    lastTransitionTime: "2024-03-18T12:35:45Z"
    status: "True"
    type: PodScheduled
  containerStatuses:
  - containerID: containerd://bdccd6a0896147020f969a6f3830a03c10e78772edb75457d7b8b2cb2b0b34a3
    image: quay.io/sclorg/postgresql-15-c9s:latest
    imageID: quay.io/sclorg/postgresql-15-c9s@sha256:0a88d11f9d15cf10014c25a8dab4be33a1f9b956f4ab1fbb51522ab10c70bdca
    lastState: {}
    name: postgres
    ready: true
    restartCount: 0
    started: true
    state:
      running:
        startedAt: "2024-03-18T12:35:57Z"
  hostIP: 10.80.98.196
  phase: Running
  podIP: 10.42.42.169
  podIPs:
  - ip: 10.42.42.169
  qosClass: Burstable
  startTime: "2024-03-18T12:35:45Z"

@craph
Copy link
Contributor

craph commented Mar 18, 2024

I still have the old postgres 13 pvc, is it possible to redeploy awx-operator in version 2.12.2 to use the old pvc ?

> kubectl get pvc -n awx
NAME                                 STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
postgres-13-awx-demo-postgres-13-0   Bound    pvc-eac8a8d5-6d74-4d37-819b-ee89154cd60a   8Gi        RWO            longhorn       213d
postgres-15-awx-demo-postgres-15-0   Bound    pvc-1ebb7d11-d2f5-4f50-9bc6-ca2e045f6031   8Gi        RWO            longhorn       3d6h

@kurokobo
Copy link
Contributor

@craph
Seems the mounted path is correct, so indeed the DB is not migrated and initialized as fresh install.
Operator repeats reconciliation loops until the playbook completes without failed tasks, but if the playbook is run without the old statefulsets present, data migration will not occur. I have not been able to check the implementation in detail, but it may be a situation where migration is deemed unnecessary in the next loop, depending on the failed task (not sure).

Anyway, I recommend that you first get a backup of the 13 PVCs in some way: pgdump or just deploy a working pod and make a tar.gz and copy it to hand with a kubectl cp.

I assume that just deploying AWX with 2.12.2 will reuse the old PVCs, but if not, you should be able to get the data back by temporarily setting the kubectl scale ... --replicas=0 for Operator, Task and Web after fresh deployment, then restoring PSQL and setting the replicas back to 1.

@craph
Copy link
Contributor

craph commented Mar 18, 2024

@kurokobo how can I do a pgdump on the old 13 PVCs ? any advise ?

@rooftopcellist
Copy link
Member

Could you give this PR a try and see if it solves your issue?

@RaceFPV
Copy link

RaceFPV commented Mar 28, 2024

You can recover and rollback to version 2.12.2 if your postgresql 13 statefulset is still online and you edit the secret: 'awx-postgres-configuration' 'host: awx-postgres-15' to 'host: awx-postgres-13' after changing back the version in helm. You may need to restart your pods after doing so

@kzinas-adv
Copy link

Fresh install of awx-operator 2.14.0 still got this issue

@rooftopcellist
Copy link
Member

Was anyone able to test the PR I linked?

I am unable to reproduce this issue on Openshift and minikube. Could someone who is seeing this issue please share their k8s cluster type, cluster version, awx-operator version, storage class, and cloud provider used if applicable?

@wonkyooh
Copy link

wonkyooh commented Apr 2, 2024

k8s cluster type: on-prem
cluster version

Client Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.5", GitCommit:"93e0d7146fb9c3e9f68aa41b2b4265b2fcdb0a4c", GitTreeState:"clean", BuildDate:"2023-08-24T00:48:26Z", GoVersion:"go1.20.7", Compiler:"gc", Platform:"linux/amd64"} Kustomize Version: v5.0.1 Server Version: version.Info{Major:"1", Minor:"27", GitVersion:"v1.27.5", GitCommit:"93e0d7146fb9c3e9f68aa41b2b4265b2fcdb0a4c", GitTreeState:"clean", BuildDate:"2023-08-24T00:42:11Z", GoVersion:"go1.20.7", Compiler:"gc", Platform:"linux/amd64"}

awx-operator version: quay.io/ansible/awx-operator:2.13.1
storageclass: rook-cephfs
cloud provider : N/A

solved this issue by adding

postgres_security_context_settings:
  fsGroup: 26

option to AWX CR (cc. @Rory-Z)

if you have already deployed it try editing the postgres statefulset and add fsGroup: 26 to securitycontext

@kurokobo
Copy link
Contributor

kurokobo commented Apr 2, 2024

The default permissions and owners of PVs and their subPaths depend on the storage provisioner implementation for the storage class.
Also, securityContext.fsGroups may not be valid in all environments, as it is ignored for some types of PVs, such as hostPath and nfs, etc.

@rooftopcellist
The default storage provisioner for minikube creates directories with 777 for PVC so this issue can't be reproduced.
It should be possible to reproduce this if explicitly configured to use hostPath on minikube:

  • Create /data/demo on minikube intance (in docker container or VM, depends on your driver)
  • Create PV
    ---
    apiVersion: v1
    kind: PersistentVolume
    metadata:
      name: awx-postgres-15-volume
    spec:
      accessModes:
        - ReadWriteOnce
      persistentVolumeReclaimPolicy: Retain
      capacity:
        storage: 8Gi
      storageClassName: awx-postgres-volume
      hostPath:
        path: /data/demo
  • Create AWX CR with postgres_storage_class: awx-postgres-volume

Or following my guide with ignoring chown and chmod for /data/postgres-15/data can also reproduce this.

I've made minimal tests on #1799 and I can confirm that once my comments in #1799 are resolved, it appears to work as expected.

@hhk7734
Copy link

hhk7734 commented Apr 2, 2024

@wonkyooh

security_context_settings is for web and task PodSecurityContext(pod.spec.securityContext). but postgres_security_context_settings is for SecurityContext in postgresql container(pod.spec.containers.securityContext). It confuses users.

When I added postgres_security_context_settings: {"fsGroup":26} to AWX CR, it was ignored.

@craph
Copy link
Contributor

craph commented Apr 2, 2024

Was anyone able to test the PR I linked?

I am unable to reproduce this issue on Openshift and minikube. Could someone who is seeing this issue please share their k8s cluster type, cluster version, awx-operator version, storage class, and cloud provider used if applicable?

@rooftopcellist you have all the details here too if needed : #1775 (comment)

AWX Operator version
2.13.1

AWX version
24.0.0
Kubernetes platform
kubernetes selfhosted with Rancher

Kubernetes/Platform version
v1.25.16+rke2r1

Storage Class
Longhorn

Upgrade from 2.12.2 to 2.13.1

@kennethacurtis
Copy link

I'm also getting this issue when going from 2.10.0 to 2.14.0. I'm using AKS.

@rooftopcellist here are my details

store class (default in this case means Azure Disk):

$ kubectl get pvc postgres-13-awx-postgres-13-0 -o jsonpath='{.spec.storageClassName}' -n awx
default

When doing an upgrade, the postgres 15 pod crashes:

kubectl get pods -n awx
NAME                                              READY   STATUS             RESTARTS      AGE
awx-operator-controller-manager-cb46cc5dd-qv5db   2/2     Running            0             13m
awx-postgres-13-0                                 1/1     Running            0             3d23h
awx-postgres-15-0                                 0/1     CrashLoopBackOff   7 (45s ago)   12m

Logs in the postgres 15 pod:

kubectl logs awx-postgres-15-0 -n awx
mkdir: cannot create directory '/var/lib/pgsql/data/userdata': Permission denied

Here are my deployment details. Kustomization file (when trying to upgrade to 2.14.0 from 2.10.0:

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  # Find the latest tag here: https://github.com/ansible/awx-operator/releases
  - github.com/ansible/awx-operator/config/default?ref=2.14.0
  - awx.yml

# Set the image tags to match the git version from above
images:
  - name: quay.io/ansible/awx-operator
    newTag: 2.14.0

# Specify a custom namespace in which to install AWX
namespace: awx

And here's my awx.yml file. Im using the AGIC:

apiVersion: awx.ansible.com/v1beta1
kind: AWX
metadata:
  name: awx
  labels:
    app: awx
spec:
  service_type: clusterip
  ingress_type: ingress
  ingress_path: /
  ingress_path_type: Exact
  ingress_tls_secret: tlssecret
  hostname: awx.example.org
  projects_storage_size: 500Gi
  ingress_annotations: |
    kubernetes.io/ingress.class: azure/application-gateway
    appgw.ingress.kubernetes.io/appgw-ssl-certificate: tlssecret
    appgw.ingress.kubernetes.io/health-probe-path: /api/v2/ping
    appgw.ingress.kubernetes.io/backend-protocol: http
    appgw.ingress.kubernetes.io/backend-hostname: awx.example.org

---
apiVersion: v1
kind: Service
metadata:
  name: awx-service
spec:
  selector:
    app: awx
  ports:
    - protocol: TCP
      port: 80
      targetPort: 8052

---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  labels:
    app: awx
  name: awx-ingress
  annotations:
    kubernetes.io/ingress.class: azure/application-gateway
    appgw.ingress.kubernetes.io/appgw-ssl-certificate: tlssecret
    appgw.ingress.kubernetes.io/health-probe-path: /api/v2/ping
    appgw.ingress.kubernetes.io/backend-protocol: http
    appgw.ingress.kubernetes.io/backend-hostname: awx.example.org
spec:
  rules:
    - host: awx.example.org
      http:
        paths:
          - path: /
            backend:
              service:
                name: awx
                port:
                  number: 80
            pathType: Exact

One thing I did notice is that when the pvc is created for postgres 15, it doesn't allocate the correct amount of storage specified for projects_storage_size, not sure if that is related or not.

kubectl get pvc -n awx
NAME                            STATUS   VOLUME                                     CAPACITY   ACCESS MODES   STORAGECLASS   AGE
postgres-13-awx-postgres-13-0   Bound    pvc-a69fc0f3-929b-4ba8-8c72-8ca1ad15b8af   500Gi      RWO            default        166d
postgres-15-awx-postgres-15-0   Bound    pvc-114e9ae9-9376-496c-b59d-edbe8b5ce4d5   8Gi        RWO            default        20m

I was able to recover by deleting AWX , the postgres 15 pod and pvc, and redeploying with operator 2.10

@rooftopcellist
Copy link
Member

Please weight in on which PR approach you like better:

spec:
  postgres_data_volume_init: true
  init_postgres_extra_commands: |
    chown 26:0 /var/lib/pgsql/data
    chmod 700 /var/lib/pgsql/data 

@kurokobo
Copy link
Contributor

kurokobo commented Apr 3, 2024

+1 for postgres_data_volume_init

@craph
Copy link
Contributor

craph commented Apr 3, 2024

Please weight in on which PR approach you like better:

spec:
  postgres_data_volume_init: true
  init_postgres_extra_commands: |
    chown 26:0 /var/lib/pgsql/data
    chmod 700 /var/lib/pgsql/data 

👍 PR #1805 will provide a better user experience I think

@rooftopcellist
Copy link
Member

Thanks for weighing in all and for the review of the PR. There is one more potential issue to resolve because of the removal of the postgres_init_container_resource_requirements parameter. More details on the PR.

@rooftopcellist
Copy link
Member

This was resolved by #1805, which just merged.

@daneov
Copy link

daneov commented Apr 5, 2024

Awesome work. I'm hitting this as well. I'm using Kustomize, but referring to the commit sha doesn't seem to change anything.

Any tips on how to include this fix without manual fiddling in the cluster?

@fubz
Copy link

fubz commented Apr 11, 2024

How does one fix their environment if they already went to version 2.12. I waited for 2.15 in hopes that Operator would fix the issue; however, the environment is currently down due to this issue and am unsure how to correct it. What steps need to be done to correct the broken environment. I see some mentions of init_postgres_extra_commands but am unsure of where values to this parameter need to be placed.

@miki-akamai
Copy link

miki-akamai commented Apr 12, 2024

I had same issue, you need to spawn following pod:

apiVersion: v1
kind: Pod
metadata:
  name: pvc-inspector
  namespace: awx-prod
spec:
  containers:
  - image: busybox
    name: pvc-inspector
    command: ["tail"]
    args: ["-f", "/dev/null"]
    volumeMounts:
    - mountPath: /pvc
      name: pvc-mount
  volumes:
  - name: pvc-mount
    persistentVolumeClaim:
      claimName: postgres-15-awx-postgres-15-0

shell to it and run chown -R 26:26 /pvc/data/

Later on you will also need to update CRDs by kubectl apply -n 'awx-prod' --server-side -k "github.com/ansible/awx-operator/config/crd?ref=2.15.0" --force-conflicts

@nan0viol3t
Copy link

Having same issue with Postgre 15 pod, in time of troubleshooting, by accident I remove whole namespace (by executing "kustomize delete -k ."). I noticed that later by troubleshooting postgre db connectivity problems, that kustomize is also deleting namspace itself.

My task pods wont start and web is saying:
"awx.main.utils.encryption Failed to decrypt.... ....check that you 'SECRET_KEY' value is correct".

I'm sure that "awx-app-secret-key" was rewritten by kustomize execution and I dont have backup of old secret.
I can connect to postgre DB instance and to AWX db as well, but have no valid awx-secret-key.

Is there a way to retrieve it from DB itself or it is not store there anywhere? In other words, is this instance lost by loosing "awx-secret-key" ??

@DerPhysikeR
Copy link

DerPhysikeR commented Jun 28, 2024

I just deployed a new AWX instance in my k3s cluster and also stumbled upon the same problem using version 2.19.0.

To clarify, the postgres 15 pod is in CrashLoopBackOff with the error mkdir: cannot create directory '/var/lib/pgsql/data/userdata': Permission denied.

This is my kustomization.yaml:

apiVersion: kustomize.config.k8s.io/v1beta1
kind: Kustomization
resources:
  # Find the latest tag here: https://github.com/ansible/awx-operator/releases
  - github.com/ansible/awx-operator/config/default?ref=2.19.0
  - awx-demo.yaml

# Set the image tags to match the git version from above
images:
  - name: quay.io/ansible/awx-operator
    newTag: 2.19.0

and this is my awx-demo.yaml:

---
apiVersion: awx.ansible.com/v1beta1
kind: AWX
metadata:
  name: awx-demo
---
apiVersion: traefik.containo.us/v1alpha1
kind: IngressRoute
metadata:
  name: awx-demo
  namespace: awx
spec:
  entryPoints:
    - websecure
  routes:
    - kind: Rule
      match: Host(`awx.cluster.lan`)
      services:
        - name: awx-demo-service
          port: 80

The associated physical volume was successfully provisioned, so that is not the issue.

@nan0viol3t
Copy link

nan0viol3t commented Jun 28, 2024

Based on comments, I did add these below in "spec" section of deployment file (in your case awx-demo.yaml)

spec:
  ...
  postgres_data_volume_init: true

  init_postgres_extra_commands: |
      chown 26:0 /var/lib/pgsql/data
      chmod 700 /var/lib/pgsql/data 

This solved problem for me, with "create directory" issues in the past, you can try it.
Strangely, from comments it looks like it should not appear anymore / was addressed already...

@r2DoesInc
Copy link

r2DoesInc commented Jul 1, 2024

I am still seeing this issue on a clean install of 2.19.0

I had same issue, you need to spawn following pod:

apiVersion: v1
kind: Pod
metadata:
  name: pvc-inspector
  namespace: awx-prod
spec:
  containers:
  - image: busybox
    name: pvc-inspector
    command: ["tail"]
    args: ["-f", "/dev/null"]
    volumeMounts:
    - mountPath: /pvc
      name: pvc-mount
  volumes:
  - name: pvc-mount
    persistentVolumeClaim:
      claimName: postgres-15-awx-postgres-15-0

shell to it and run chown -R 26:26 /pvc/data/

Later on you will also need to update CRDs by kubectl apply -n 'awx-prod' --server-side -k "github.com/ansible/awx-operator/config/crd?ref=2.15.0" --force-conflicts

You can actually just set the chown command in the pod directly so no need to ssh in

apiVersion: v1
kind: Pod
metadata:
  name: pvc-inspector
  namespace: awx
spec:
  restartPolicy: Never # << Dont restart after completion
  containers:
  - image: busybox
    name: pvc-inspector
    securityContext:
      runAsUser: 0
    command: # << chmod in the pod directly
    - /bin/chown
    - -R
    - "26:26"
    - /pvc/data
    volumeMounts:
    - mountPath: /pvc
      name: pvc-mount
  volumes:
  - name: pvc-mount
    persistentVolumeClaim:
      claimName: postgres-15-awx-postgres-15-0

@Cristiano-Rosa
Copy link

It worked for me in 2.19.1

AWX:
  enabled: true 
  name: awx
  spec:
    admin_user: admin
    service_type: LoadBalancer  
    postgres_data_volume_init: true
    postgres_init_container_commands: |
      chown 26:0 /var/lib/pgsql/data
      chmod 700 /var/lib/pgsql/data

@alexandrud
Copy link

This should ensure the running user has access to mounted volumes.

spec.template.spec.securityContext:
        fsGroup: 26

@kcjones91
Copy link

kcjones91 commented Dec 9, 2024

apiVersion: awx.ansible.com/v1beta1
kind: AWX
metadata:
  name: awx-demo
spec:
  admin_user: admin
  service_type: nodeport
  postgres_security_context_settings:
  fsGroup: 26
  postgres_data_volume_init: true
  postgres_init_container_commands: |
    chown 26:0 /var/lib/pgsql/data
    chmod 700 /var/lib/pgsql/data

I am running into this on my end. Probably an easy fix?

kubectl logs awx-demo-postgres-15-0 -n awx-dev -c init
chown: changing ownership of '/var/lib/pgsql/data': Permission denied
chmod: changing permissions of '/var/lib/pgsql/data': Permission denied
swipe@swipe-worker-1:/mnt$ ls -la /mnt/data/postgres/
total 0
drwxrwxrwx. 2 swipe swipe  6 Dec  9 15:48 .
drwxr-xr-x. 3 swipe swipe 22 Dec  9 13:17 ..

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests