Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jenkins Multi Master support #373

Closed
jordarlu opened this issue Nov 13, 2023 · 11 comments
Closed

Jenkins Multi Master support #373

jordarlu opened this issue Nov 13, 2023 · 11 comments
Assignees
Labels
enhancement New feature or request

Comments

@jordarlu
Copy link
Contributor

jordarlu commented Nov 13, 2023

Is your feature request related to a problem? Please describe

The existing Jenkins CI infrastructure serves as the exclusive system for executing a diverse range of critical tasks, including Gradle checks for Pull Requests (PR), release processes, benchmark tests, and various other functions.

Recently, there were instances of Jenkins performance degradation, possibly due to an escalating workload or other factors , which ultimately resulted in the Jenkins Master node going down, leading to Jenkins service downtime. Details about the most recent incident and the steps taken to restore Jenkins to functionality can be found at opensearch-project/opensearch-build#4130.

We need a long-term solution that will be capable of handling the growing workload to prevent future instances of Jenkins failure.

Describe the solution you'd like

The proposal in high level is to split the Jenkins into multiple Jenkins masters, and each Jenkins handling a set (category) of workloads and is isolated from other Jenkins masters and its associated categorized workloads.

Describe alternatives you've considered

In addition to the proposal mentioned above, we are open to any other proposals and ideas from the community to make Jenkinss even better, please feel free to make comments and describe your suggestions.

Additional context

This issue serves as the main issue to implement Jankins Multi Master support.

As we progress, we will consistently add/update comments, discussions, designs, and relevant issues and PRs to keep tracking all activities.

@jordarlu jordarlu added enhancement New feature or request untriaged Issues that have not yet been triaged and removed untriaged Issues that have not yet been triaged labels Nov 13, 2023
@jordarlu jordarlu self-assigned this Nov 13, 2023
@peterzhuamazon
Copy link
Member

I still have question regarding this, as in each master will handle a portion of the workflows.

Lets say if the master of build workflows offline, will another master able to pick up the workflow, or we have to wait for the original one to go online again?

Thanks.

@rishabh6788
Copy link
Collaborator

@jordarlu Thank you for taking this up. I am wondering if we can just have one more master node added in the existing code with similar settings as the existing one, except for name and labels, and then register it as a new target group under the existing load balancer. We then route the traffic based on url path, e.g., if it is ci.opensearch.org then route to existing master, and if it is ci.opensearch.org/performance then it routes to the new master.
@gaiksaya @prudhvigodithi @peterzhuamazon

@jordarlu
Copy link
Contributor Author

I still have question regarding this, as in each master will handle a portion of the workflows.

Lets say if the master of build workflows offline, will another master able to pick up the workflow, or we have to wait for the original one to go online again?

Thanks.

Hopefully once we splilt the Jenkins to process on each category of jobs ( for example, we will have a Jenkins for 'build', another Jenkins for 'gradle-check', and another Jenkins for 'benchmark' ), we won't face this master down issue anymore ( if the mastet down root cause was casued by the workload ), but that is a good point that to have a HA on Master

@rishabh6788
Copy link
Collaborator

I still have question regarding this, as in each master will handle a portion of the workflows.

Lets say if the master of build workflows offline, will another master able to pick up the workflow, or we have to wait for the original one to go online again?

Thanks.

I would suggest keeping both masters mutually exclusive of each other and use them to distribute our jobs based on their functionality.

@gaiksaya
Copy link
Member

Hey!

Just wondering did we research if having 2 masters will cause split brain issues? Sometime back I had read about this on jenkins forum. Worth researching a bit and experimenting with local set up before we move to implementation. AFAIK jenkins is not supposed to have more masters but I might be wrong and technology might have evolved since last I read but please do confirm.

@jordarlu
Copy link
Contributor Author

jordarlu commented Nov 13, 2023

@jordarlu Thank you for taking this up. I am wondering if we can just have one more master node added in the existing code with similar settings as the existing one, except for name and labels, and then register it as a new target group under the existing load balancer. We then route the traffic based on url path, e.g., if it is ci.opensearch.org then route to existing master, and if it is ci.opensearch.org/performance then it routes to the new master. @gaiksaya @prudhvigodithi @peterzhuamazon

:) wonderful! that is also the direction I learned that we are moving toward to; from the end result, we may end of having https://build.ci.opensearch.org/build/ for the 'build' ; https://build.ci.opensearch.org/benchmark/ for the 'benchmark' ; and https://build.ci.opensearch.org/gradlecheck/ for the 'Gradle Check' ( just name of few to use as an example ... we will certainly discuss how we want to categorize it ) .. thanks for the good suggestion

@jordarlu
Copy link
Contributor Author

Hey!

Just wondering did we research if having 2 masters will cause split brain issues? Sometime back I had read about this on jenkins forum. Worth researching a bit and experimenting with local set up before we move to implementation. AFAIK jenkins is not supposed to have more masters but I might be wrong and technology might have evolved since last I read but please do confirm.

Understood ... thanks for bring this up, @gaiksaya , and let me do more reasearch on that ... the original idea was to distribute the load to be on a seperated Jenkins master ( based on the assumption that the master downtimes happened last month were caused by the increasing of workload ) while keeping using the same access FQDN ; but if we can have a way to do HA on master (without causing the issue you mentioned) , that will be even better I believe .. appreciate the consideration on all possible downside of having the HA and the experience sharing ~

@prudhvigodithi
Copy link
Member

Jenkins does not support multi master with Active Active load distribution, assume they have some load balancing with enterprise version https://www.cloudbees.com/capabilities/continuous-integration. However we have two options here.

  1. Active passive Jenkins master to HAproxy (load balancer) in front.
  2. Seperate Jenkins masters for set of builds.

I would go for option 2 as it has many advantages like Jenkins job level isolation, easy upgrades, less blast radius, easy to manage and more. https://welltempereddeveloper.com/ci/cd/2019/04/08/jenkins-ha-multiple-masters.html

@jordarlu
Copy link
Contributor Author

jordarlu commented Nov 14, 2023

Jenkins does not support multi master with Active Active load distribution, assume they have some load balancing with enterprise version https://www.cloudbees.com/capabilities/continuous-integration. However we have two options here.

  1. Active passive Jenkins master to HAproxy (load balancer) in front.
  2. Seperate Jenkins masters for set of builds.

I would go for option 2 as it has many advantages like Jenkins job level isolation, easy upgrades, less blast radius, easy to manage and more. https://welltempereddeveloper.com/ci/cd/2019/04/08/jenkins-ha-multiple-masters.html

Thanks for the insight, @prudhvigodithi , should we explore both options that you mentioned above as they are not interfere with each other? While we seperate Jenkins master per category of workload, we can still have 'sort of' HA on each master to prevent single point of failure ?

@prudhvigodithi
Copy link
Member

Sure Jeff, Once the Jenkins master are split, we are take that up as a new enhancement to add active/passive mechanism, should be easy as the underlying data store is EFS.

@jordarlu
Copy link
Contributor Author

jordarlu commented Jan 5, 2024

I am closing this issue as we are moving on to creating mulitple Jenkins instance instead of spliting the master node, hopefully avoid the confusion between them. Let me also create a new issue to track on multiple Jenkins instace feature.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Development

No branches or pull requests

5 participants