Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding Wave Autoscale to the Partners #321

Merged
merged 19 commits into from
Dec 12, 2024
Merged

Conversation

Ari-suhyeon
Copy link
Contributor

Description of changes:
This pull request integrates Wave Autoscale (STCLab) into the AWS EKS testing framework. It includes the following artifacts:

  • FluxCD configuration
  • External Secret

It needs two keys in Parameter Store. The keys are included in the self-assessment spreadsheet we sent.

  • GHRC_TOKEN
  • WA_LICENSE

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Copy link
Contributor

@elamaran11 elamaran11 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functional Test job is missing. Please check other partner submission for more details. Also please send us your external secrets with full licsense via email.

@mikemcd3912
Copy link
Contributor

Thanks for the PR!

I have loaded the secrets you provided into our testing account, however when I went to go test that it was having a little trouble scheduling on our testing environments due to the resource demands. Is the wave-autoscale-autopilot container meant to request 3cpu and 3000Mi memory?

@mikemcd3912
Copy link
Contributor

mikemcd3912 commented Dec 3, 2024

Functional Test job is missing. Please check other partner submission for more details

Hi @Ari-suhyeon,

I was able to get the solution deployed and test it across our EKS environments, and I noticed that with the functional testing it appears the focus is endpoint health checks and it appeared that the test was still completing as successful when pods were in a pending state for some reason. (Point 3 on the functional job test requirements). Can we update this to include a test of the functionality as well to make sure those pods are both up and functioning as intended?

@Ari-suhyeon
Copy link
Contributor Author

Hi @mikemcd3912

I have identified an issue in the status-check logic within the cronjob.
The logic has been updated to validate the status by checking the response values.
Additionally, I have added a StatefulSet check logic that includes communication with the Kubernetes API.

Please review the changes at your convenience.

@mikemcd3912
Copy link
Contributor

mikemcd3912 commented Dec 6, 2024

Hi @Ari-suhyeon,

On the functional testing what we're looking for a test of the main function of the solution to show that it is up and running properly, so the health check with the K8s API is still not quite as thorough as we need. One example of this is a partner Tetrate's solution that creates and the subsequently destroys ingress resources to test the functionality of their product who's function is to manage networking and ingress

Based on my understanding of your solution ideally we'd like to see a test that validates that the intelligent scaling is working as expected, so possibly creating a temporary pod that will be resource constrained from the start then checking whether wave-autoscale performs the necessary autoscaling operation on it before cleaning up the testing resources.

Let me know if you have any questions or if I can provide additional clarification!

@Ari-suhyeon
Copy link
Contributor Author

Ari-suhyeon commented Dec 9, 2024

Hi @mikemcd3912 ,

I understood.
To test the auto-scaling functionality, we need to collect metrics, and for this, an agent must be installed as a DaemonSet.
I hope this is acceptable to you.
As my schedule does not allow for it this week, I will proceed with the work next week and push the changes accordingly.

@Ari-suhyeon
Copy link
Contributor Author

Hi @mikemcd3912 ,
We have added a script to verify the scale-out functionality of Wave Autoscale.
This includes deploying a DaemonSet-based agent for metric collection and adding the target Deployment for scaling out.

Additionally, with the version update of Wave Autoscale, it is necessary to remove the previously installed PV and PVC. Please ensure this step is checked carefully.

@elamaran11
Copy link
Contributor

Looks like your deployment wa-test-dp is failing.
Specifically, it looks like these pods are failing:

  • Pod: wa-test-dp-549f5f7f75-2mtcl.
  • Pod: wa-test-dp-549f5f7f75-6sn6n.
  • Pod: wa-test-dp-549f5f7f75-6zcw5.
  • Pod: wa-test-dp-549f5f7f75-c4j8x.
  • Pod: wa-test-dp-549f5f7f75-cgfn8.
  • Pod: wa-test-dp-549f5f7f75-f5k85.
  • Pod: wa-test-dp-549f5f7f75-g78sx.
  • Pod: wa-test-dp-549f5f7f75-hgmrw.
  • Pod: wa-test-dp-549f5f7f75-jfrbx.
  • Pod: wa-test-dp-549f5f7f75-kktgg.
  • Pod: wa-test-dp-549f5f7f75-ljn4p.
  • Pod: wa-test-dp-549f5f7f75-mc4dt.
  • Pod: wa-test-dp-549f5f7f75-mzhf5.
  • Pod: wa-test-dp-549f5f7f75-prfw6.
  • Pod: wa-test-dp-549f5f7f75-qrxd9.
  • Pod: wa-test-dp-549f5f7f75-sswh8.
  • Pod: wa-test-dp-549f5f7f75-thb79.
  • Pod: wa-test-dp-549f5f7f75-tzvsc.
  • Pod: wa-test-dp-549f5f7f75-v4zv6.
  • Pod: wa-test-dp-549f5f7f75-v7bgn.
  • Pod: wa-test-dp-549f5f7f75-wsdh2.
  • Pod: wa-test-dp-549f5f7f75-ww4ld.

@Ari-suhyeon
Copy link
Contributor Author

Hi @elamaran11
Please let us know what specifically you mean by it failing.
Logs from the cronjob would be great too.
Or are you just saying that the wa-test-dp Deployment failed, not the cronjob?
If so, please provide logs from the pod as well.
Also, did you remove the pv and pvc of the existing wave-autoscale and redeploy it?

@elamaran11
Copy link
Contributor

@mikemcd3912 Can you report more details on this error.

@mikemcd3912
Copy link
Contributor

Hi @Ari-suhyeon,

Thanks for the updates! The message about all those failed pods was part of an automation and appear to have been remedied by the redeployment of the solution/volumes after those updates so I'll keep you updated as I complete the rest of the testing. So far our VMware environment test was successful so I believe the current iteration is looking promising

@elamaran11
Copy link
Contributor

Looks like your deployment wa-test-dp is failing.
Specifically, it looks like these pods are failing:

  • Pod: wa-test-dp-8ccbc64df-5rxrf.

@mikemcd3912
Copy link
Contributor

Looks like your deployment wa-test-dp is failing. Specifically, it looks like these pods are failing:

* Pod: `wa-test-dp-8ccbc64df-5rxrf`.

Feel free to ignore this - it appears that we're reaching cpu capacity on one of our test clusters and that's causing some automated messages to go out when the scheduler doesn't have space for new pods

Copy link
Contributor

@mikemcd3912 mikemcd3912 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Vmware - Success
Hybrid - Success
Baremetal - Success
Cloud Bottlerocket x86 - Success
Cloud Bottlerocket ARM - Success
Auto Mode - Success

Pods Deploy and testers complete successfully in all environments - LGTM

@mikemcd3912 mikemcd3912 dismissed elamaran11’s stale review December 12, 2024 22:33

Functional job has been added and completes successfully in all environments

@mikemcd3912 mikemcd3912 merged commit e82551a into aws-samples:main Dec 12, 2024
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants