Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add -ness checks and refactor migrations #1674

Merged
merged 1 commit into from
Mar 6, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
80 changes: 80 additions & 0 deletions config/crd/bases/awx.ansible.com_awxs.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1571,6 +1571,86 @@ spec:
description: Number of task instance replicas
type: integer
format: int32
web_liveness_initial_delay:
description: Initial delay before starting liveness checks on web pod
type: integer
default: 5
format: int32
task_liveness_initial_delay:
description: Initial delay before starting liveness checks on task pod
type: integer
default: 5
format: int32
web_liveness_period:
description: Time period in seconds between each liveness check for the web pod
type: integer
default: 0
format: int32
task_liveness_period:
description: Time period in seconds between each liveness check for the task pod
type: integer
default: 0
format: int32
web_liveness_failure_threshold:
description: Number of consecutive failure events to identify failure of web pod
type: integer
default: 3
format: int32
task_liveness_failure_threshold:
description: Number of consecutive failure events to identify failure of task pod
type: integer
default: 3
format: int32
web_liveness_timeout:
description: Number of seconds to wait for a probe response from web pod
type: integer
default: 1
format: int32
task_liveness_timeout:
description: Number of seconds to wait for a probe response from task pod
type: integer
default: 1
format: int32
web_readiness_initial_delay:
description: Initial delay before starting readiness checks on web pod
type: integer
default: 20
format: int32
task_readiness_initial_delay:
description: Initial delay before starting readiness checks on task pod
type: integer
default: 20
format: int32
web_readiness_period:
description: Time period in seconds between each readiness check for the web pod
type: integer
default: 0
format: int32
task_readiness_period:
description: Time period in seconds between each readiness check for the task pod
type: integer
default: 0
format: int32
web_readiness_failure_threshold:
description: Number of consecutive failure events to identify failure of web pod
type: integer
default: 3
format: int32
task_readiness_failure_threshold:
description: Number of consecutive failure events to identify failure of task pod
type: integer
default: 3
format: int32
web_readiness_timeout:
description: Number of seconds to wait for a probe response from web pod
type: integer
default: 1
format: int32
task_readiness_timeout:
description: Number of seconds to wait for a probe response from task pod
type: integer
default: 1
format: int32
garbage_collect_secrets:
description: Whether or not to remove secrets upon instance removal
default: false
Expand Down
11 changes: 11 additions & 0 deletions config/rbac/role.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -78,6 +78,17 @@ rules:
- patch
- update
- watch
- apiGroups:
- batch
resources:
- jobs
verbs:
- get
- list
- create
- patch
- update
- watch
- apiGroups:
- monitoring.coreos.com
resources:
Expand Down
40 changes: 40 additions & 0 deletions docs/user-guide/advanced-configuration/container-probes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,40 @@
#### Container Probes
These parameters control the usage of liveness and readiness container probes for
the web and task containers.

#### Web / Task Container Liveness Check

The liveness probe queries the status of the supervisor daemon of the container. The probe will fail if it
detects one of the services in a state other than "RUNNING".

| Name | Description | Default |
| web_liveness_period | Time period in seconds between each probe check. The value of 0 disables the probe. | 0 |
| web_liveness_initial_delay | Initial delay before starting probes in seconds | 5 |
| web_liveness_failure_threshold| Number of consecutive failure events to identify failure of container | 3 |
| web_liveness_timeout | Number of seconds to wait for a probe response from container | 1 |
| task_liveness_period | Time period in seconds between each probe check. The value of 0 disables the probe. | 0 |
| task_liveness_initial_delay | Initial delay before starting probes in seconds | 5 |
| task_liveness_failure_threshold| Number of consecutive failure events to identify failure of container | 3 |
| task_liveness_timeout | Number of seconds to wait for a probe response from container | 1 |

#### Web Container Readiness Check

This is a HTTP check against the status endpoint to confirm the system is still able to respond to web requests.

| Name | Description | Default |
| -------------| ---------------------------------- | ------- |
| web_readiness_period | Time period in seconds between each probe check. The value of 0 disables the probe. | 0 |
| web_readiness_initial_delay | Initial delay before starting probes in seconds | 5 |
| web_readiness_failure_threshold| Number of consecutive failure events to identify failure of container | 3 |
| web_readiness_timeout | Number of seconds to wait for a probe response from container | 1 |

#### Task Container Readiness Check

This is a command probe using the builtin check command of the awx-manage utility.

| Name | Description | Default |
| -------------| ---------------------------------- | ------- |
| task_readiness_period | Time period in seconds between each probe check. The value of 0 disables the probe. | 0 |
| task_readiness_initial_delay | Initial delay before starting probes in seconds | 5 |
| task_readiness_failure_threshold| Number of consecutive failure events to identify failure of container | 3 |
| task_readiness_timeout | Number of seconds to wait for a probe response from container | 1 |
32 changes: 16 additions & 16 deletions roles/installer/tasks/initialize_django.yml
Original file line number Diff line number Diff line change
Expand Up @@ -2,8 +2,8 @@
- name: Check if there are any super users defined.
k8s_exec:
namespace: "{{ ansible_operator_meta.namespace }}"
pod: "{{ awx_task_pod_name }}"
container: "{{ ansible_operator_meta.name }}-task"
pod: "{{ awx_web_pod_name }}"
container: "{{ ansible_operator_meta.name }}-web"
command: >-
bash -c "echo 'from django.contrib.auth.models import User;
nsu = User.objects.filter(is_superuser=True, username=\"{{ admin_user }}\").count();
Expand All @@ -16,8 +16,8 @@
- name: Create super user via Django if it doesn't exist.
k8s_exec:
namespace: "{{ ansible_operator_meta.namespace }}"
pod: "{{ awx_task_pod_name }}"
container: "{{ ansible_operator_meta.name }}-task"
pod: "{{ awx_web_pod_name }}"
container: "{{ ansible_operator_meta.name }}-web"
command: awx-manage createsuperuser --username={{ admin_user | quote }} --email={{ admin_email | quote }} --noinput
register: result
changed_when: "'That username is already taken' not in result.stderr"
Expand All @@ -28,8 +28,8 @@
- name: Update Django super user password
k8s_exec:
namespace: "{{ ansible_operator_meta.namespace }}"
pod: "{{ awx_task_pod_name }}"
container: "{{ ansible_operator_meta.name }}-task"
pod: "{{ awx_web_pod_name }}"
container: "{{ ansible_operator_meta.name }}-web"
command: awx-manage update_password --username='{{ admin_user }}' --password='{{ admin_password }}'
register: result
changed_when: "'Password updated' in result.stdout"
Expand All @@ -39,8 +39,8 @@
- name: Check if legacy queue is present
k8s_exec:
namespace: "{{ ansible_operator_meta.namespace }}"
pod: "{{ awx_task_pod_name }}"
container: "{{ ansible_operator_meta.name }}-task"
pod: "{{ awx_web_pod_name }}"
container: "{{ ansible_operator_meta.name }}-web"
command: >-
bash -c "awx-manage list_instances | grep '^\[tower capacity=[0-9]*\]'"
register: legacy_queue
Expand All @@ -50,8 +50,8 @@
- name: Unregister legacy queue
k8s_exec:
namespace: "{{ ansible_operator_meta.namespace }}"
pod: "{{ awx_task_pod_name }}"
container: "{{ ansible_operator_meta.name }}-task"
pod: "{{ awx_web_pod_name }}"
container: "{{ ansible_operator_meta.name }}-web"
command: >-
bash -c "awx-manage unregister_queue --queuename=tower"
when: "'[tower capacity=' in legacy_queue.stdout"
Expand All @@ -74,8 +74,8 @@
- name: Register default execution environments (without authentication)
k8s_exec:
namespace: "{{ ansible_operator_meta.namespace }}"
pod: "{{ awx_task_pod_name }}"
container: "{{ ansible_operator_meta.name }}-task"
pod: "{{ awx_web_pod_name }}"
container: "{{ ansible_operator_meta.name }}-web"
command: >-
bash -c "awx-manage register_default_execution_environments"
register: ree
Expand All @@ -95,8 +95,8 @@
- name: Register default execution environments (with authentication)
k8s_exec:
namespace: "{{ ansible_operator_meta.namespace }}"
pod: "{{ awx_task_pod_name }}"
container: "{{ ansible_operator_meta.name }}-task"
pod: "{{ awx_web_pod_name }}"
container: "{{ ansible_operator_meta.name }}-web"
command: >-
bash -c "awx-manage register_default_execution_environments
--registry-username='{{ default_execution_environment_pull_credentials_user }}'
Expand All @@ -111,8 +111,8 @@
- name: Create preload data if necessary. # noqa 305
k8s_exec:
namespace: "{{ ansible_operator_meta.namespace }}"
pod: "{{ awx_task_pod_name }}"
container: "{{ ansible_operator_meta.name }}-task"
pod: "{{ awx_web_pod_name }}"
container: "{{ ansible_operator_meta.name }}-web"
command: >-
bash -c "awx-manage create_preload_data"
register: cdo
Expand Down
46 changes: 4 additions & 42 deletions roles/installer/tasks/install.yml
Original file line number Diff line number Diff line change
Expand Up @@ -94,51 +94,13 @@
- name: Include resources configuration tasks
dhageman marked this conversation as resolved.
Show resolved Hide resolved
include_tasks: resources_configuration.yml

- name: Check for pending migrations
k8s_exec:
namespace: "{{ ansible_operator_meta.namespace }}"
pod: "{{ awx_task_pod_name }}"
container: "{{ ansible_operator_meta.name }}-task"
command: >-
bash -c "awx-manage showmigrations | grep -v '[X]' | grep '[ ]' | wc -l"
changed_when: false
when: awx_task_pod_name != ''
register: database_check

- name: Migrate the database if the K8s resources were updated # noqa 305
k8s_exec:
namespace: "{{ ansible_operator_meta.namespace }}"
pod: "{{ awx_task_pod_name }}"
container: "{{ ansible_operator_meta.name }}-task"
command: |
bash -c "
function end_keepalive {
rc=$?
rm -f \"$1\"
kill $(cat /proc/$2/task/$2/children 2>/dev/null) 2>/dev/null || true
wait $2 || true
exit $rc
}
keepalive_file=\"$(mktemp)\"
while [[ -f \"$keepalive_file\" ]]; do
echo 'Database schema migration in progress...'
sleep 60
done &
keepalive_pid=$!
trap 'end_keepalive \"$keepalive_file\" \"$keepalive_pid\"' EXIT SIGINT SIGTERM
echo keepalive_pid: $keepalive_pid
awx-manage migrate --noinput
echo 'Successful'
"
register: migrate_result
when:
- awx_task_pod_name != ''
- database_check is defined
- (database_check.stdout|trim) != '0'
- name: Migrate database to the latest schema
include_tasks: migrate_schema.yml
when: awx_web_pod_name != ''

- name: Initialize Django
include_tasks: initialize_django.yml
kdelee marked this conversation as resolved.
Show resolved Hide resolved
when: awx_task_pod_name != ''
when: awx_web_pod_name != ''

- name: Update status variables
include_tasks: update_status.yml
Expand Down
57 changes: 57 additions & 0 deletions roles/installer/tasks/migrate_schema.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
---

- name: Check for pending migrations
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍
This task ensures that the k8s job is only run if there are new schema changes to be migrated.

And the unique name based on the version and the random hash ensure that there are no conflicts with other k8s jobs from previous migrations.

k8s_exec:
namespace: "{{ ansible_operator_meta.namespace }}"
pod: "{{ awx_web_pod_name }}"
container: "{{ ansible_operator_meta.name }}-web"
command: >-
bash -c "awx-manage showmigrations | grep -v '[X]' | grep '[ ]' | wc -l"
changed_when: false
when: awx_web_pod_name != ''
register: database_check

- block:
- name: Get version of controller for tracking
k8s_exec:
namespace: "{{ ansible_operator_meta.namespace }}"
pod: "{{ awx_web_pod_name }}"
container: "{{ ansible_operator_meta.name }}-web"
command: >-
bash -c "awx-manage --version"
changed_when: false
register: version_check

- name: Update instance version
set_fact:
version: "{{ version_check.stdout | trim }}"

# It is possible to do a wait on this task to create the job and wait
# until it completes. Unfortunately, if the job doesn't wait finish within
# the timeout period that is considered an error. We only want this to
# error if there is an issue with creating the job.
- name: Create kubernetes job to perform the migration
k8s:
apply: yes
definition: "{{ lookup('template', 'jobs/migration.yaml.j2') }}"
register: migrate_result

# This task is really only necessary for new installations. We need to
# ensure the database has a schema loaded before continuing with the
# initialization of admin user, etc.
- name: Watch for the migration job to finish
k8s_info:
kind: Job
namespace: "{{ ansible_operator_meta.namespace }}"
name: "{{ ansible_operator_meta.name }}-migration-{{ version }}"
register: result
until:
- result.resources[0].status.succeeded is defined
- result.resources[0].status.succeeded == 1
retries: 180
delay: 5
ignore_errors: true

when:
- database_check is defined
- (database_check.stdout|trim) != '0'
Loading
Loading