Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Containerized Full RP Dev Automation #3764

Open
wants to merge 11 commits into
base: master
Choose a base branch
from
Open

Containerized Full RP Dev Automation #3764

wants to merge 11 commits into from

Conversation

razo7
Copy link
Collaborator

@razo7 razo7 commented Aug 8, 2024

The PR automates the full RP dev int-like procedure using dockerfile and Makefile.

Which issue this PR addresses:

Fixes ARO-9327

Procedure Steps and Timing:

On average the full automation takes 65 minutes in eastus and below are the procedure steps and timing (following the design):

  1. Verify full RP succeeded ~ 3 seconds
  2. Download shared secrets ~ 5 seconds
  3. Setup dev-config.yaml ~ 30 seconds
  4. Pre-deploy resources ~ 4 minutes
  5. add_hive ~ 32 minutes (dev-vpn 20 minutes, aks-dev 12 minutes)
  6. Install Hive ~1 minute
  7. mirror_images 14 minutes
  8. prepare_RP_deployment ~1 minute
  9. fully_deploy_resources ~ 13 minutes

What are these bash scripts?

  • rp_dev_helper.sh bash script has functions that any SRE could use regardless of the Full RP automation for checking and using Azure resources.
  • full_rp_funcs.sh has functions that correspond to the designed steps of Full RP creation and are triggered with full_rp_deploy.sh inside your local container.

What this PR does / why we need it:

The PR adds the capability of quickly and easily provisioning INT-like development environments without manual steps using a local containerized process.

How to run the ARO RP automation?

  1. Login to Azure
  2. Run AZURE_PREFIX=aaa RP_LOCATION=eastus SKIP_DEPLOYMENTS=true make full-rp-dev

Comments:

  • We copy local Azure credentials to the container, so make sure they are at ${HOME}/.azure.
    You can run az account show --query state -o tsv to check if your login to Azure CLI is successful. Look forEnabled in the output.
  • Use AZURE_PREFIX with unique characters (otherwise there could be a collision between other developers). See also Optionally Use USER Environment Variable for Azure Resources #3681 (comment) for AZURE_PREFIX motivation.
  • Running the automation will be skipped in case the RP and GWY VMSSs succeed and their final deployment succeeds.
  • (optional) Set RP_LOCATION var to your preferred Azure location (by default it is eastus).
  • (optional) Set SKIP_DEPLOYMENTS var as false (by default it is true) when you prefer to deploy Azure resources regardless of their existence.
  • (optional) You may run the target with RP_FULL_DEV_IMAGE var (e.g., RP_FULL_DEV_IMAGE=YOUR_REPO_AND_TAG) in case you want to push the container to your registry.
  • Carefully clean Azure resources by running AZURE_PREFIX=aaa clean_rp_dev_env eastus or AZURE_PREFIX=aaa RP_LOCATION=eastus make full-rp-dev-clenup (based on the above creation command).

Which changes does it include?

  • Add dockerfile Dockerfile.rp-full-dev to build an image with the required packages to automate the full RP dev int-like.
  • Add a target to initiate the container creation and executing the full-rp dev creation.
  • Add two bash scripts under hack/rp-dev/ for automating https://github.com/Azure/ARO-RP/blob/master/docs/deploy-full-rp-service-in-dev.md#deploying-an-int-like-development-rp and a helper script hack/devtools/rp-dev-helper.sh. Run usage_rp_funcs and usage_rp_devto get functions' usage help.
  • Add a target to cleanup 4 resource groups and 4 keyvaults which are needed per USER/AZURE_PREFIX
  • Separate deploy target into three targets: pre-deploy-full and pre-deploy-no-aks and deploy. It is needed due to rpServiceKeyvaultDynamic deployment which depends on aro-aks-cluster-001 that hasn't been set up ATM.
  • Workaround for azsec-monitor problem
  • Don't wait on Hive instllation- SImilar to Refactor Hive Directory #3765
  • Use self-signed certificates due to expired old certs.
  • Check Azure resources state prior to recreation and skip it when SKIP_DEPLOYMENTS=true.

### How to review?

1. Checkout a new branch (e.g., git checkout -b NEW_BRANCH)
2. Add changes, update ARO_RP_BRANCH var to ARO_RP_BRANCH, and commit them ('git add .' and then 'git commit -m "MESSAGE"')
3. Push the changes (e.g., git push origin NEW_BRANCH)

Test plan for issue:

Run AZURE_PREFIX=aaa RP_LOCATION=eastus SKIP_DEPLOYMENTS=true make full-rp-dev to test if it fully deploys your ARO RP deployment.

Limitation/Known-issues:

  • Sometimes the RP deployment fails with the following error after three metric rules fail but the RP and GWY VMSSs have been succeeded

{"code":"DeploymentFailed","message":"At least one resource deployment operation failed. Please list deployment operations for details. Please see https://aka.ms/arm-deployment-operations for usage details.","details":[{"code":"BadRequest","message":"Couldn't find a metric named DipAvailability. Make sure the name is correct. Activity ID: 01984677-d0df-4c08-a733-c01a8cb6ed30."},{"code":"BadRequest","message":"Couldn't find a metric named VipAvailability. Make sure the name is correct. Activity ID: efa0cfbc-705c-4b8b-9d02-a202d99a6a37."},{"code":"BadRequest","message":"Couldn't find a metric named DipAvailability. Make sure the name is correct. Activity ID: 1dd4d996-6d70-43e3-bacf-92a1702bc973."}]}

  • Mirroring the OCP images (step 10.3) can be long when the ISP is slow (and even fail) while with fast ISP it could finish in minutes.

Is there any documentation that needs to be updated for this PR?

Yes at https://github.com/Azure/ARO-RP/blob/master/docs/deploy-full-rp-service-in-dev.md#deploying-an-int-like-development-rp

How do you know this will function as expected in production?

Many tests, and it is not planned to be used in production. Merely for SREs to have a fully running RP automated.

@razo7
Copy link
Collaborator Author

razo7 commented Aug 13, 2024

/azp run ci, e2e

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@razo7
Copy link
Collaborator Author

razo7 commented Sep 22, 2024

I added some fixes due to moving to podman when building the container, and some small gotchas (e.g., the usage of -ojson). CC @tiguelu

@razo7
Copy link
Collaborator Author

razo7 commented Sep 22, 2024

/azp run ci,e2e

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

It respects .dockerignore, thus .bingo directory should be included. When untar an archive of secrets we should use  --no-same-owner to not modify the ownership. No need to check if dev-config.yaml exists when it is already ignored. Furthremore the new build us with no cache, when log in to ACR you will use -ojson flag, and we compare repo tag vs each existing tag
@razo7 razo7 force-pushed the razo7/ARO-9327 branch 3 times, most recently from 575d05d to 9a78b8f Compare September 22, 2024 15:32
Makefile Outdated Show resolved Hide resolved
Copy link
Contributor

@jaitaiwan jaitaiwan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've reviewed several files but I need to come back and review more with fresh eyes.

ehvs

This comment was marked as off-topic.

@razo7
Copy link
Collaborator Author

razo7 commented Sep 23, 2024

/azp run ci,e2e

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

…y fix

Using dockerignore will not pass all the files from the repo, resulting with dirty git status and running the deploy target with dirty suffix. regardless of the git repo state. Allow user import access to the 3 KeyVaults we store the certs
@razo7
Copy link
Collaborator Author

razo7 commented Sep 26, 2024

/azp run ci,e2e

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@razo7
Copy link
Collaborator Author

razo7 commented Sep 26, 2024

There is an issue with importing certs to 3 keyvaults when running the Full RP automation:

ATM the automation fails on the first try for missing the DipAvailability metric, and on the second try with SKIP_DEPLOYMENTS=true it succeeds. This issue can be addressed in a follow up.

check_vmss: 🟢🖥️ VMSS 'rp-vmss-69c86c8' in Resource group 'xxx-aro-eastus' has been provisioned successfully. DELETE_VMSS:false
check_vmss: 🟢🖥️ VMSS 'gateway-vmss-69c86c8' in Resource group 'xxx-gwy-eastus' has been provisioned successfully. DELETE_VMSS:false
fully_deploy_resources: Success step 8 ✅ - fully deploy all the resources for ARO RP and GWY VMSSs

Copy link
Contributor

@tiguelu tiguelu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still need to review more closely the rest of changes. I just happened to see the change in deploy code is a blocker for merging, imo.

if err != nil {
return err
}

// Must be last step so we can be sure there are no RPs at older versions
// still serving
return deployer.SaveVersion(ctx)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After the split of the original function, SaveVersion has ended in two places, here and in the new deploy function, but this should not be called before we have deployed the RP and Gateway. This will cause problems in Production, because the new version can be stored even if the actual deploy fails while we still keep the older deploy.

Also, this split is being reflected only in the Makefile, but this refactoring will have implications also in the Production pipelines because we will be only calling aro deploy there. Hence, this invalidates the deployment test we did in INT.

We should not be save version before we have deployed the RP and Gateway. This will cause problems in Production, because the new version can be stored even if the actual deploy fails while we still keep the older deploy.
@razo7
Copy link
Collaborator Author

razo7 commented Sep 27, 2024

/azp run ci,e2e

Copy link

Azure Pipelines successfully started running 2 pipeline(s).

@razo7
Copy link
Collaborator Author

razo7 commented Oct 1, 2024

/azp run e2e

Copy link

Azure Pipelines successfully started running 1 pipeline(s).

@github-actions github-actions bot added the needs-rebase branch needs a rebase label Oct 9, 2024
Copy link

github-actions bot commented Oct 9, 2024

Please rebase pull request.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
needs-rebase branch needs a rebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants