Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create a CI workflow that creates new AMIs using packer #258

Closed
gaiksaya opened this issue Mar 13, 2023 · 14 comments · Fixed by #263
Closed

Create a CI workflow that creates new AMIs using packer #258

gaiksaya opened this issue Mar 13, 2023 · 14 comments · Fixed by #263
Assignees
Labels
enhancement New feature or request good first issue Good for newcomers

Comments

@gaiksaya
Copy link
Member

gaiksaya commented Mar 13, 2023

Is your feature request related to a problem? Please describe

Currently the AMI's used by agent nodes using a specific base image that may go out of date or need updates as new kernel updates come in.
This happens as often as per quarter. Even though we run yum update, apt updates, etc we still need to reboot the EC2 to apply those updates which does not fit jenkins' agent nodes' lifecycle management. If a SSH connection is lost (when we reboot) a new agent will be brought up.

Describe the solution you'd like

In order to apply regular updates to the base AMI image we need to build a new AMI.
Using packer it is a pretty straight forward process. https://github.com/opensearch-project/opensearch-ci/tree/main/packer

Below are 2 possible approaches:

  1. Use GHA that will create new AMI's and create a pull request to update the same in this repository
  2. Use jenkins workflow that will do the same.

Please keep in mind that this needs to be a blue green deployment and that's why old AMI's need to be deprecated (made private) only after confirming new AMI's are working fine. This can be a manual process to start with but can be automated via GHA too if we maintain a list somewhere.

Describe alternatives you've considered

Do the entire process manually. However building AMI even using packer takes more than half a day for the number of AMI's we have.

Additional context

No response

@gaiksaya gaiksaya added enhancement New feature or request untriaged Issues that have not yet been triaged good first issue Good for newcomers and removed untriaged Issues that have not yet been triaged labels Mar 13, 2023
@peterzhuamazon
Copy link
Member

Either way would work and I prefer using Jenkins.
The only part we need is role assume.

@peterzhuamazon
Copy link
Member

peterzhuamazon commented Mar 21, 2023

  1. Need to update one docker ci image to include packer.
We can use the docker-builder image for packer:
https://developer.hashicorp.com/packer/downloads
  1. Need to create a Jenkins workflow to build packer templates in here: https://github.com/opensearch-project/opensearch-ci/tree/main/packer

@peterzhuamazon
Copy link
Member

peterzhuamazon commented Mar 21, 2023

More to consider:

  1. IAM role to assume in order to have full EC2 access or corresponding access just for EC2 instance creating and AMI creation.
  2. SG have access from the SG of the main node of Jenkins to allow access to 22 / 5985 for EC2 instance connection during build.

@peterzhuamazon
Copy link
Member

Need 3 secrets to hold the value of these:

  1. VPC of Jenkins production cluster VPC
  2. Subnet of Jenkins production public subnet
  3. SG of the above mentioned SG preferably taking the Agent Node SG as it has all the requirements

@peterzhuamazon
Copy link
Member

There might be another problem since the node that runs packer is the source, and needs to connect to destination on 22/5985 ports. This means if the workflow is running on an agent node, then the connection would be agent -> agent where our existing SG only allows connection from main -> agent.

Either we add a new SG to allow agent -> agent connection (which is highly not recommended for security measures), or we restrict the AMI/Packer builder workflow run on only the main node (main -> agent).

@peterzhuamazon
Copy link
Member

Add @gaiksaya @rishabh6788 @prudhvigodithi into the conversation on above issues ^^.

Thanks.

@gaiksaya
Copy link
Member Author

gaiksaya commented Mar 21, 2023

Why are we using jenkins? GHA can do all of these using roles. All you need to provide is right vpc and subnet right?
Anything that we use just needs to have right credentials that will build the AMI and push to Prod right?

@peterzhuamazon
Copy link
Member

Why are we using jenkins? GHA can do all of these using roles. All you need to provide is right vpc and subnet right? Anything that we use just needs to have right credentials that will build the AMI and push to Prod right?

In our discussion yesterday we were already talking about using it in Jenkins.
If we are ok to use on GHA I have no issues but @prudhvigodithi raised the point where GHA can throttle if the run is too long.

Average mac build time is 2+ hrs and average windows build time is 1+ hour, cause inconsistency in the build overall.

Thanks.

@prudhvigodithi
Copy link
Member

@gaiksaya AMI building is an expensive task, it requires some resources and mainly could take lot of time to complete the end to end AMI building, for this GH runners would end up same issues like we had for manifest workflow failure, so better to use jenkins job.

@gaiksaya
Copy link
Member Author

Got it! Forgot about the resources section. But even though with that all a machine needs is right credentials which has nothing to do with agent or main node. If agent node or AMI build is provided with right credentials we should be good. Running anything on main node is restricted as a security measure so we cannot and should not run on main node.
Regarding

SG have access from the SG of the main node of Jenkins to allow access to 22 / 5985 for EC2 instance connection during build.

We already have that in place.
https://github.com/opensearch-project/opensearch-ci/blob/main/lib/security/ci-security-groups.ts#L45-L47

@peterzhuamazon
Copy link
Member

Got it! Forgot about the resources section. But even though with that all a machine needs is right credentials which has nothing to do with agent or main node. If agent node or AMI build is provided with right credentials we should be good. Running anything on main node is restricted as a security measure so we cannot and should not run on main node. Regarding

SG have access from the SG of the main node of Jenkins to allow access to 22 / 5985 for EC2 instance connection during build.

We already have that in place. https://github.com/opensearch-project/opensearch-ci/blob/main/lib/security/ci-security-groups.ts#L45-L47

See #258 (comment)

@zelinh
Copy link
Member

zelinh commented Mar 21, 2023

New docker image ubuntu2004-x64-docker-buildx0.6.3-qemu5.0-awscli1.22-jdk11-v2 including packer has been built and pushed to here.

@peterzhuamazon
Copy link
Member

We will use the same agentnode sg after making sure jenkins agent are all running on private subnet.

@peterzhuamazon
Copy link
Member

This is completed.

@peterzhuamazon peterzhuamazon linked a pull request Jul 10, 2024 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request good first issue Good for newcomers
Projects
Development

Successfully merging a pull request may close this issue.

4 participants