title | layout | categories |
---|---|---|
Creating the build infrastructure |
main |
infrastructure |
The {{site.experiment}} build infrastructure consists of two kinds of nodes, masters, responsible for scheduling jobs, and agents, responsible to execute jobs and services. They are in general provisioned using CERN Openstack Infrastructure and configured using CERN Puppet / Foreman setup.
Masters belong to the Puppet hostgroup {{site.master_hostgroup}}
while agents
belong to {{site.slave_hostgroup}}
. The configuration of those hostgroups
can be found in the GIT repository for it-puppet-hostgroup-alibuild, in particular in:
- /code/manifests/{{site.master_hostgroup}}.pp for the master.
- /code/manifests/{{site.slave_hostgroup}}.pp for the slaves.
We have in particular three masters, each running on a separate OpenStack availability zone which work in an High Availability (HA) mode which allows the ensamble to continue working correctly and scheduling jobs even in eventuality that one of the machines goes down. A diagram for the services running on the masters can be find below:
In particular the masters run the following services:
-
The Mesos Master service: Mesos is used to schedule some of the Jenkins jobs automatically on the cluster and to automate deployment of some of the services, in particular using the Marathon setup.
-
The ZooKeeper: the backend which keeps track of Mesos distributed state, actually providing the HA setup.
-
The Marathon service: a simple Platform as a Service (PaaS) implemented as a Mesos framework which allows to define, launch and monitor long running services on the slaves. It relays on Mesos to do the resource management.
- Getting access to the OpenStack / Puppet infrastructure
- Creating a master
- Rebuild a master
- Creating an agent
- Reboot an agent
- Delete an agent
{: #setup}
First of all make sure you have all the rights to create machines in OpenStack and to administer them via Puppet.
To get the OpenStack access rights, you should ask to become member of the alice-vm-admin
egroup. To get the Puppet rights, you should ask to become member of the alice-agile-admin
egroup. This can be done using the usual egroups interface.
Once you have those rights to use OpenStack you need to go to CERN OpenStack administration
machines: aiadm.cern.ch
and obtain the OpenStack credentials by doing:
eval $(ai-rc "{{site.openstack_project}}")
You can now execute the various OpenStack commands, using the CLI tool called
openstack
, while an exhaustive list of all the available options can be
optained via openstack help -h
, for the process of spawning new machines you probably
only care about:
openstack server list
: list the machines in the {{site.openstack_project}}openstack image list
: list of OS images you can use. In particular the build nodes should use the latest{{site.openstack_image}}
ones.openstack flavor list
: list available flavors of virtual machines (i.e. how many CPUs, RAM).
Further information on how CERN OpenStack cloud works can be found here.
Note that you will have to login as root
to all the machines.
In case there are issues with one of the masters you should follow the following checklist:
- Check on the Openstack Dashboard if the machine is up and running.
- Check in Foreman if there are any puppet errors.
- Ping the machine.
- SSH into the machine.
- Check if docker is running and if it has at least the following containers:
mesos-master
,zookeeper
,marathon
,aurora-scheduler
,mesos-dns
.
{: #create-master}
Creation of masters in CERN Foreman setup is described at http://cern.ch/config/nodes/createnode.html. The short recipe for build machine is:
-
Login to
aiadm.cern.ch
. -
Set up your OpenStack environment by doing:
eval $(ai-rc "{{site.experiment}} Release Testing")
-
To spawn a machine you need to use the
ai-bs
wrapper, which will take care of provisioning the machine and putting it in Foreman, so that it will receive from it the Puppet configuration:MACHINE_NAME=<{{site.exp_prefix}}mesosXX> ZONE=cern-geneva-<X> # <X> needs to be in {a,b,c} # make sure you use different zones # to improve availability. ai-bs -g {{site.master_hostgroup}} \ --foreman-environment alibuild_devel \ --{{site.openstack_image | downcase}} \ --nova-sshkey {{site.builduser}} \ --nova-availabilityzone $ZONE \ --nova-flavor {{site.openstack_master_flavor}} \ --landb-mainuser alice-agile-admin \ --landb-responsible alice-agile-admin \ --nova-attach-new-volume vdb=200GB \ $MACHINE_NAME
{: #create-agent}
Creation of mesos agents in CERN Foreman setup is described at http://cern.ch/config/nodes/createnode.html. The short recipe for build machine is:
-
Login to
aiadm.cern.ch
. -
Set up your OpenStack environment by doing:
eval $(ai-rc "ALICE Release Testing")
-
Specify a few parameters for the machine you want to spawn:
MACHINE_NAME=<alibuildXX>
-
To spawn a machine you need to use the
ai-bs
wrapper, which will take care of provisioning the machine and putting it in Foreman, so that it will receive from it the Puppet configuration:ai-bs -g alibuild/mesos/slave \ --foreman-environment alibuild_devel \ --cc7 \ --nova-sshkey alibuild \ --nova-flavor m2.2xlarge \ --landb-mainuser alice-agile-admin \ --landb-responsible alice-agile-admin \ --nova-attach-new-volume vdb=500GB:type=io1 \ $MACHINE_NAME
This will spawn a new machine. You can check the boot status either in the
OpenStack GUI or via openstack server list
. Of course you should change the name
of the machine (<alibuildXX>
in the example).
- In order to make sure that the machine is correctly up and running, you should:
- ping it
- ssh to it
- run
puppet agent -t -v
until no errors are reported - execute
docker pull alisw/slc7-builder
to force pull the builder image.
{: #rebuild-master}
Rebuilding a master is a potentially disruptive operation, since our mesos setup requires at least 2 masters to be up and running in order to schedule new jobs. Therefore before you actually decide to rebuild one you should:
- Discuss with your collegueas wether that's a good idea.
- Make sure that the other two masters are properly functioning.
- If the master is the currently leading master, force a leadership transition to one of the other two machines before the rebuild (Optional, since failower will take care of that).
In order to perform the rebuild you need to do:
-
Login to
aiadm.cern.ch
. -
Set up your OpenStack environment by doing:
eval $(ai-rc "{{site.experiment}} Release Testing")
-
Actually rebuild the machine
ai-rebuild-vm --cc7 {{site.exp_prefix}}mesosXX
it can take up to one hour for the process to complete.
- In order to make sure that the machine is correctly up and running, you should:
- ping it
- ssh to it
- run
puppet agent -t -v
until no errors are reported
{: #rebuild-master}
YOU SHOULD NEVER REBUILD ALIBUILD03 and ALIBUILD09
Rebuilding an agent is potentially a problem, since the Mesos machine might be doing something, e.g. building a release, which should not be in general interrupted. Therefore you need to:
- Discuss with your collegueas wether that's a good idea.
- Verify that the machine is not running any particularly important task, by looking at the report in the Mesos GUI. If in doubt, ask.
In order to perform the rebuild you need to do:
-
Login to
aiadm.cern.ch
. -
Set up your OpenStack environment by doing:
eval $(ai-rc "ALICE Release Testing")
-
Actually rebuild the machine
ai-rebuild-vm --cc7 alibuildXX
-
In order to make sure that the machine is correctly up and running, you should:
- ping it
- ssh to it
- run
puppet agent -t -v
until no errors are reported. If you keep having errors after a few run, report them.
{: #delete-agent}
WHATEVER YOU DO, NEVER DELETE ALIBUILD03 OR ITS ATTACHED VOLUME
Documentation to delete a VM is found at:
http://configdocs.web.cern.ch/configdocs/nodes/deletenode.html
the recipe for destoying agents is:
-
Ask yourself why you are deleting the VM. Do it only if you want to get rid of it for good. If you want to recreate it immediately after, e.g. to handle some irreversible fault on the installation, you might first want to try rebuilding it, since it will be much faster.
-
Login to
aiadm.cern.ch
. -
Set up your OpenStack environment by doing:
eval $(ai-rc "{{site.experiment}} Release Testing")
-
Delete the VM with
ai-kill <{{site.builduser}}XX>
-
Delete the previously attached volumes.
{: #reboot-server}
In case there is an issue with any of the agents, a hard reboot can be attempted to bring it back to a working state. This can be done via the OpenStack GUI, in the Instances tab, or doing:
-
Login to
aiadm.cern.ch
. -
Set up your OpenStack environment by doing:
eval $(ai-rc "{{site.experiment}} Release Testing")
-
Actually reboot
<server name>
openstack server reboot --hard <server name>
in case the GUI is not functional.