Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AtlasMigration k8s deployment issues #3232

Closed
talsuk5 opened this issue Nov 25, 2024 · 5 comments
Closed

AtlasMigration k8s deployment issues #3232

talsuk5 opened this issue Nov 25, 2024 · 5 comments
Assignees

Comments

@talsuk5
Copy link

talsuk5 commented Nov 25, 2024

Hi Atlas team!
I'm trying to set up an AtlasMigration in my k8s cluster as per this guide.

My yaml definition looks like this:

apiVersion: db.atlasgo.io/v1alpha1
kind: AtlasMigration
metadata:
  name: migration
  annotations:
    argocd.argoproj.io/sync-wave: "30"
spec:
  cloud:
    project: ### Redacted ###
    tokenFrom:
      secretKeyRef:
        name: atlas-credentials
        key: token
  credentials:
    host: postgresql-cluster-pooler-rw.cnpg-system.svc.cluster.local
    port: 5432
    userFrom:
      secretKeyRef:
        key: username
        name: postgresql-{{.Values.environment}}-{{ .Values.clusterName }}-secrets
    passwordFrom:
      secretKeyRef:
          key: password
          name: postgresql-{{.Values.environment}}-{{ .Values.clusterName }}-secrets
    database: app
    scheme: postgres
    parameters:
      sslmode: disable
  dir:
    remote:
      name: ### Redacted ###
      tag: {{ .Values.migrationTag }}

The resource then creates a deployment of the dev db but fails inside with:

The files belonging to this database system will be owned by user "postgres".
This user must also own the server process.
The database cluster will be initialized with locale "en_US.utf8".
The default database encoding has accordingly been set to "UTF8".
The default text search configuration will be set to "english".
Data page checksums are disabled.
fixing permissions on existing directory /var/lib/postgresql/data ... ok
creating subdirectories ... ok
selecting dynamic shared memory implementation ... posix
selecting default "max_connections" ... 100
selecting default "shared_buffers" ... 128MB
selecting default time zone ... Etc/UTC
creating configuration files ... ok
running bootstrap script ... ok
performing post-bootstrap initialization ... ok
initdb: warning: enabling "trust" authentication for local connections
initdb: hint: You can change this by editing pg_hba.conf or using the option -A, or --auth-local and --auth-host, the next time you run initdb.
syncing data to disk ... ok
Success. You can now start the database server using:
pg_ctl -D /var/lib/postgresql/data -l logfile start
waiting for server to start....2024-11-25 12:06:35.587 UTC [35] LOG: starting PostgreSQL 17.2 (Debian 17.2-1.pgdg120+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
2024-11-25 12:06:35.588 UTC [35] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2024-11-25 12:06:35.593 UTC [38] LOG: database system was shut down at 2024-11-25 12:06:35 UTC
2024-11-25 12:06:35.598 UTC [35] LOG: database system is ready to accept connections
done
server started
/usr/local/bin/docker-entrypoint.sh: ignoring /docker-entrypoint-initdb.d/*
waiting for server to shut down...2024-11-25 12:06:35.717 UTC [35] LOG: received fast shutdown request
.2024-11-25 12:06:35.719 UTC [35] LOG: aborting any active transactions
2024-11-25 12:06:35.721 UTC [35] LOG: background worker "logical replication launcher" (PID 41) exited with exit code 1
2024-11-25 12:06:35.721 UTC [36] LOG: shutting down
2024-11-25 12:06:35.722 UTC [36] LOG: checkpoint starting: shutdown immediate
2024-11-25 12:06:35.729 UTC [36] LOG: checkpoint complete: wrote 3 buffers (0.0%); 0 WAL file(s) added, 0 removed, 0 recycled; write=0.001 s, sync=0.002 s, total=0.009 s; sync files=2, longest=0.002 s, average=0.001 s; distance=0 kB, estimate=0 kB; lsn=0/14E4F98, redo lsn=0/14E4F98
2024-11-25 12:06:35.733 UTC [35] LOG: database system is shut down
done
server stopped
PostgreSQL init process complete; ready for start up.
2024-11-25 12:06:35.844 UTC [1] LOG: starting PostgreSQL 17.2 (Debian 17.2-1.pgdg120+1) on x86_64-pc-linux-gnu, compiled by gcc (Debian 12.2.0-14) 12.2.0, 64-bit
2024-11-25 12:06:35.844 UTC [1] LOG: listening on IPv4 address "0.0.0.0", port 5432
2024-11-25 12:06:35.844 UTC [1] LOG: listening on IPv6 address "::", port 5432
2024-11-25 12:06:35.847 UTC [1] LOG: listening on Unix socket "/var/run/postgresql/.s.PGSQL.5432"
2024-11-25 12:06:35.852 UTC [49] LOG: database system was shut down at 2024-11-25 12:06:35 UTC
2024-11-25 12:06:35.857 UTC [1] LOG: database system is ready to accept connections
2024-11-25 12:06:44.293 UTC [59] FATAL: role "postgres" does not exist
2024-11-25 12:11:35.938 UTC [47] LOG: checkpoint starting: time
2024-11-25 12:11:40.261 UTC [47] LOG: checkpoint complete: wrote 46 buffers (0.3%); 0 WAL file(s) added, 0 removed, 0 recycled; write=4.309 s, sync=0.005 s, total=4.323 s; sync files=11, longest=0.004 s, average=0.001 s; distance=269 kB, estimate=269 kB; lsn=0/15286B0, redo lsn=0/1528658

Looking at the operator logs:

Error: postgres: scanning system variables: pq: unsupported startup parameter: search_path {"type": "Warning", "object": {"kind":"AtlasMigration","namespace":"postgres-migrator","name":"migration","uid":"854292d1-aced-473e-8ce7-b39c745a78ef","apiVersion":"db.atlasgo.io/v1alpha1","resourceVersion":"38119678"}, "reason": "TransientErr"}
2024-11-25T13:26:37Z INFO atlas_migration.reconcile reconciling migration {"env": "kubernetes"}
2024-11-25T13:26:37Z INFO atlas_migration.reconcile applying pending migrations {"count": 4}
2024-11-25T13:26:37Z DEBUG events Error: deployment rate limit exceeded, try again later
Error: postgres: scanning system variables: pq: unsupported startup parameter: search_path {"type": "Warning", "object": {"kind":"AtlasMigration","namespace":"postgres-migrator","name":"migration","uid":"854292d1-aced-473e-8ce7-b39c745a78ef","apiVersion":"db.atlasgo.io/v1alpha1","resourceVersion":"38119678"}, "reason": "TransientErr"}
2024-11-25T13:26:42Z INFO atlas_migration.reconcile reconciling migration {"env": "kubernetes"}
2024-11-25T13:26:43Z INFO atlas_migration.reconcile applying pending migrations {"count": 4}
2024-11-25T13:26:43Z DEBUG events Error: deployment rate limit exceeded, try again later
Error: postgres: scanning system variables: pq: unsupported startup parameter: search_path {"type": "Warning", "object": {"kind":"AtlasMigration","namespace":"postgres-migrator","name":"migration","uid":"854292d1-aced-473e-8ce7-b39c745a78ef","apiVersion":"db.atlasgo.io/v1alpha1","resourceVersion":"38119678"}, "reason": "TransientErr"}
2024-11-25T13:26:48Z INFO atlas_migration.reconcile reconciling migration {"env": "kubernetes"}
2024-11-25T13:26:48Z INFO atlas_migration.reconcile applying pending migrations {"count": 4}
2024-11-25T13:26:48Z DEBUG events Error: deployment rate limit exceeded, try again later
Error: postgres: scanning system variables: pq: unsupported startup parameter: search_path {"type": "Warning", "object": {"kind":"AtlasMigration","namespace":"postgres-migrator","name":"migration","uid":"854292d1-aced-473e-8ce7-b39c745a78ef","apiVersion":"db.atlasgo.io/v1alpha1","resourceVersion":"38119678"}, "reason": "TransientErr"}
2024-11-25T13:26:53Z INFO atlas_migration.reconcile reconciling migration {"env": "kubernetes"}
2024-11-25T13:26:54Z INFO atlas_migration.reconcile applying pending migrations {"count": 4}
2024-11-25T13:26:54Z DEBUG events Error: deployment rate limit exceeded, try again later
Error: postgres: scanning system variables: pq: unsupported startup parameter: search_path {"type": "Warning", "object": {"kind":"AtlasMigration","namespace":"postgres-migrator","name":"migration","uid":"854292d1-aced-473e-8ce7-b39c745a78ef","apiVersion":"db.atlasgo.io/v1alpha1","resourceVersion":"38119678"}, "reason": "TransientErr"}

I already talked to your support and they lifted the run limit limitation for the trial period.

As for the error
FATAL: role "postgres" does not exist that is coming from the dev db pod, would love to get some guidance on how to resolve this issue.

Thanks,
Tal

@ariga-peretz
Copy link

Hi Tal, let's start with a couple turn off/on kind of things and work from there ;-). Can you check the following?

Issue 1: FATAL: role "postgres" does not exist

This error happens because the default postgres role doesn't exist. Ensure the role is created during initialization:

  1. Add this to your PostgreSQL initdb.d script:

    CREATE ROLE postgres WITH LOGIN SUPERUSER PASSWORD 'yourpassword';

    Place the script in /docker-entrypoint-initdb.d/.

  2. Ensure your Deployment sets valid environment variables like POSTGRES_USER.


Issue 2: unsupported startup parameter: search_path

This happens if search_path is misconfigured.

  • Check your YAML for unnecessary search_path parameters and remove them.
  • To set it for a specific schema, log into the database and run:
    ALTER ROLE your_role SET search_path TO 'schema1,schema2';

@giautm
Copy link
Member

giautm commented Nov 25, 2024

Hello @talsuk5, what's the version of atlas-operator you have? We already fixed the issue with devdb's role in v0.6.3, which will use the default PG role.

@talsuk5
Copy link
Author

talsuk5 commented Nov 25, 2024

Hi @ariga-peretz, thanks for the reply.
As for Issue #2, I found that the culprit was using the pgbouncer svc. see here:
https://stackoverflow.com/questions/71798409/postgressql-org-postgresql-util-psqlexception-error-unsupported-startup-param

As for Issue #1, just to be clear - I think this is coming from your deployment.
Applying the migration works from local mac using this command:
atlas migrate apply --env sqlalchemy --url "<my_private_postgres_conn_string>" --baseline <my_baseline>

querying my db for

select * from pg_user;

and

select * from pg_roles;
image image

you can see that there's a postgres user and role (which is also a superuser/superole created by my postgresql system. disclosure - I'm using cnpg)

But like I said, I think it's related to your inner deployment:

apiVersion: apps/v1
kind: ReplicaSet
metadata:
  annotations:
    deployment.kubernetes.io/desired-replicas: '1'
    deployment.kubernetes.io/max-replicas: '2'
    deployment.kubernetes.io/revision: '2'
  creationTimestamp: '2024-11-25T13:48:17Z'
  generation: 1
  labels:
    app.kubernetes.io/created-by: controller-manager
    app.kubernetes.io/instance: migration-atlas-dev-db
    app.kubernetes.io/name: atlas-dev-db
    app.kubernetes.io/part-of: atlas-operator
    atlasgo.io/engine: postgres
    pod-template-hash: 7dfd65997c
  name: migration-atlas-dev-db-7dfd65997c
  namespace: postgres-migrator
  ownerReferences:
    - apiVersion: apps/v1
      blockOwnerDeletion: true
      controller: true
      kind: Deployment
      name: migration-atlas-dev-db
      uid: 2e1c7c24-07dc-4590-91e6-e2a2af0cecb2
  resourceVersion: '38134509'
  uid: 73c76563-bdd9-4b92-a666-f4d539499dcc
spec:
  replicas: 1
  selector:
    matchLabels:
      app.kubernetes.io/created-by: controller-manager
      app.kubernetes.io/instance: migration-atlas-dev-db
      app.kubernetes.io/name: atlas-dev-db
      app.kubernetes.io/part-of: atlas-operator
      atlasgo.io/engine: postgres
      pod-template-hash: 7dfd65997c
  template:
    metadata:
      annotations:
        atlasgo.io/conntmpl: postgres://root:pass@localhost:5432/postgres?sslmode=disable
        kubectl.kubernetes.io/restartedAt: '2024-11-25T13:48:17Z'
      creationTimestamp: null
      labels:
        app.kubernetes.io/created-by: controller-manager
        app.kubernetes.io/instance: migration-atlas-dev-db
        app.kubernetes.io/name: atlas-dev-db
        app.kubernetes.io/part-of: atlas-operator
        atlasgo.io/engine: postgres
        pod-template-hash: 7dfd65997c
    spec:
      containers:
        - env:
            - name: POSTGRES_DB
              value: postgres
            - name: POSTGRES_USER
              value: root
            - name: POSTGRES_PASSWORD
              value: pass
          image: postgres:latest
          imagePullPolicy: Always
          name: postgres
          ports:
            - containerPort: 5432
              name: postgres
              protocol: TCP
          resources: {}
          securityContext:
            allowPrivilegeEscalation: false
            capabilities:
              drop:
                - ALL
            runAsNonRoot: true
            runAsUser: 999
          startupProbe:
            exec:
              command:
                - pg_isready
            failureThreshold: 30
            periodSeconds: 10
            successThreshold: 1
            timeoutSeconds: 1
          terminationMessagePath: /dev/termination-log
          terminationMessagePolicy: File
      dnsPolicy: ClusterFirst
      restartPolicy: Always
      schedulerName: default-scheduler
      securityContext: {}
      terminationGracePeriodSeconds: 30
status:
  availableReplicas: 1
  fullyLabeledReplicas: 1
  observedGeneration: 1
  readyReplicas: 1
  replicas: 1

Also, anyway I can change the inner deployment image to not be latest? or change the pull policy to IfNotPresent? I think there's some room here for improvements 😃 please LMK if that's on your roadmap 🙏

@talsuk5
Copy link
Author

talsuk5 commented Nov 25, 2024

Hello @talsuk5, what's the version of atlas-operator you have? We already fixed the issue with devdb's role in v0.6.3, which will use the default PG role.

Hi @giautm it's 0.6.1 I will try to upgrade and report back

@talsuk5
Copy link
Author

talsuk5 commented Nov 25, 2024

Hello @talsuk5, what's the version of atlas-operator you have? We already fixed the issue with devdb's role in v0.6.3, which will use the default PG role.

Hi @giautm it's 0.6.1 I will try to upgrade and report back

Hi @giautm, I upgraded the operator to version 0.6.5 and the error is gone. hooray! 🎉

@giautm giautm closed this as completed Dec 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants