Skip to content

Working example of getting a production like OKD4 environment up and running in a Vagrant environment. Useful for learning and testing

Notifications You must be signed in to change notification settings

sc0ttes/vagrant_ansible_okd4

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

43 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Mac OSX > Vagrant > Ansible > OKD4.12

OKD4 is the upstream community version of Red Hat's OpenShift Container Platform. This system, which is based on Kubernetes and containers, enables you to run you own Platform-as-a-Service on your own hardware or deploy it into a cloud platform such as Google Cloud or AWS.

These instructions are based on @timhughes upstream repository and https://docs.okd.io/4.12/installing/installing_platform_agnostic/installing-platform-agnostic.html

I have built this as a learning experience and potential use as a playground environment for Systems Engineers who manage OKD4 clusters. At the end you should end up with a UPI install OKD4 cluster with:

  • One router/load-balancer/dhcp/dns system
  • Three control-plane systems
  • Two worker systems

If you run into any issues with these install instructions, it is highly recommended that you check the official OKD docs as these may quickly become out of date with newer versions as previous instructions have done.

Security

Don't expect this to be secure in any way. There are several things in here that I know to be insecure but this is for setting up a test environment. You have been warned.

Prerequisites

*** NOTE *** Because of autogenerated certificate expiry dates and such, you will run into issues if not running this deployment straight through. Leaving the installation one day to pick it up the next day may cause unexpected issues and the install may fail. To ensure success, after troubleshooting and fixing any issues you may run into during the install, wipe everything and start fresh again.

This walkthrough is based on my working environment.

  • Mac OSX Ventura 13.2.1
  • VMWare Fusion 12.2.5
  • Vagrant 2.3.4
  • Ansible 2.14.3

Hardware: 16 core processor, 64GB RAM, 1TB SSD

*** IMPORTANT *** Set your OSX time to GMT/UTC by going into the Settings -> General -> Date & Time -> Unset "Set time zone automatically" and put in a GMT timezone city in "Closest city" ( "Reykjavik - Iceland" should work)

*** IMPORTANT *** The machine requirements (https://docs.openshift.com/container-platform/4.12/installing/installing_bare_metal/installing-bare-metal.html#installation-minimum-resource-requirements_installing-bare-metal) must be adhered to or your master/cp nodes will not boot up correctly. You may be able to skimp on storage space but not CPU/RAM. Every VM in the Vagrantfile has been overprovisioned beyond minimums simply to speed up deployment. After the machine has booted up, the resource consumption falls to expected levels.

You will also need some way of accessing the dns inside the cluster. This will manually add mappings to your hosts file. Alternatively, setting up dnsmasq on your OSX box as @timhughes post suggested and following the instructions here https://passingcuriosity.com/2013/dnsmasq-dev-osx/ will likely work better than the below command if you plan on doing anything more than just simply OKD setup.

sudo sh -c 'echo "192.168.100.2 api.kube1.vm.test console-openshift-console.apps.kube1.vm.test oauth-openshift.apps.kube1.vm.test" >> /etc/hosts'

Get the code

Step one is to clone the code which is available on Github.

git clone https://github.com/sc0ttes/vagrant_ansible_okd4
cd vagrant_ansible_okd4

The working directory for for all the commands is assumed to be the root directory of the cloned git repository.

Getting the CLI tools and preparing

Make some directories to use.

export PATH=${PWD}/bin/:$PATH

Generating an SSH keypair

ssh-keygen -o -a 100 -t ed25519 -f ssh_key/id_ed25519  -C "example okd key"

Discover the revision of the latest OKD release and set it as a variable we can use throughout the rest of the process. If you open another terminal you will need to run this again.

export OKD_RELEASE=$(curl --silent "https://api.github.com/repos/okd-project/okd/releases/latest" | grep -E -o '"tag_name": ".*"' | sed -r 's/"tag_name": "(.*)"/\1/g')

Download the installation file and command line tool oc on a local computer.

cd tmp/
curl -LO https://github.com/okd-project/okd/releases/download/${OKD_RELEASE}/openshift-install-mac-${OKD_RELEASE}.tar.gz
curl -LO https://github.com/okd-project/okd/releases/download/${OKD_RELEASE}/openshift-client-mac-${OKD_RELEASE}.tar.gz
cd ..

Extract the installation program and put it somewhere on your PATH.

tar xvf tmp/openshift-install-mac-${OKD_RELEASE}.tar.gz --directory tmp/
mv tmp/openshift-install bin/
tar xvf tmp/openshift-client-mac-${OKD_RELEASE}.tar.gz --directory tmp/
mv tmp/{oc,kubectl} bin/

Create an installation directory

mkdir -p webroot/os_ignition

If you have used the directory before then you need to remove the old files.

rm -rf webroot/os_ignition/*
rm -rf webroot/os_ignition/.openshift*

Creating the boot configs

Create an install-config.yaml inside the installation directory and customize it as per the documentation on https://docs.okd.io/latest/installing/installing_bare_metal/installing-bare-metal.html#installation-bare-metal-config-yaml_installing-bare-metal Note that you do not need valid value for pullSecret because that is for the Red Hat supported OCP:

cat <<-EOF > webroot/os_ignition/install-config.yaml
apiVersion: v1
baseDomain: vm.test
compute:
- hyperthreading: Enabled
  name: worker
  replicas: 0
controlPlane:
  hyperthreading: Enabled
  name: master
  replicas: 3
metadata:
  name: kube1
networking:
  clusterNetwork:
  - cidr: 10.128.0.0/14
    hostPrefix: 23
  networkType: OpenShiftSDN
  serviceNetwork:
  - 192.168.101.0/24  # different to host network
platform:
  none: {}
fips: false
pullSecret: '{"auths":{"xxxxxxx": {"auth": "xxxxxx","email": "xxxxxx"}}}'
sshKey: '$(cat ssh_key/id_ed25519.pub)'
EOF

You can add a corporate CA certificate and proxy servers in the install-config.yaml if required:

apiVersion: v1
baseDomain: my.domain.com
proxy:
    httpProxy: http://<username>:<pswd>@<ip>:<port>
    httpsProxy: http://<username>:<pswd>@<ip>:<port>
    noProxy: example.com
additionalTrustBundle: |
    -----BEGIN CERTIFICATE-----
    <MY_TRUSTED_CA_CERT>
    -----END CERTIFICATE-----

Creating the Kubernetes manifest and Ignition config files:

openshift-install create manifests --dir=webroot/os_ignition

If you don't want pods to run on your contorl plane master nodes, run the below command. This forces any newly created pods to run on only worker nodes.

sed -i.bak 's/  mastersSchedulable: true/  mastersSchedulable: false/g' webroot/os_ignition/manifests/cluster-scheduler-02-config.yml

And finally:

openshift-install create ignition-configs --dir=webroot/os_ignition #--log-level=debug

Besides the ignition files this also create the kubernetes authentication files and they shouldn't be uploaded to a web servers.

#TODO copy the ignition files to a web server with out the auth files

Download the install images and sig files into webroot/images/. You can do this by hand or use the following command to grab the latest automatically:

cd webroot/images
curl -s "https://builds.coreos.fedoraproject.org/streams/stable.json"|jq '.architectures.x86_64.artifacts.metal.formats| .pxe,."raw.xz"|.[].location' | xargs -n 1 curl -LO
curl -s "https://builds.coreos.fedoraproject.org/streams/stable.json"|jq '.architectures.x86_64.artifacts.metal.formats| .pxe,."raw.xz"|.[].location,.[].signature'| xargs -n 1 curl -LO
cd ../../

Update webroot/boot.ipxe.cfg to match the filename and versions you downloaded.

set okd-kernel fedora-coreos-[VERSION]-live-kernel-x86_64
set okd-initrd fedora-coreos-[VERSION]-live-initramfs.x86_64.img
set okd-image fedora-coreos-[VERSION]-metal.x86_64.raw.xz

Setup a private network for VMWare Fusion

Run the following commands to configure vmnet7 (or choose an alternative vmnet to configure)

sudo /Applications/VMware\ Fusion.app/Contents/Library/vmnet-cfgcli vnetcfgadd VNET_7_DHCP no
sudo /Applications/VMware\ Fusion.app/Contents/Library/vmnet-cfgcli vnetcfgadd VNET_7_DISPLAY_NAME okd-internal
sudo /Applications/VMware\ Fusion.app/Contents/Library/vmnet-cfgcli vnetcfgadd VNET_7_HOSTONLY_SUBNET 192.168.100.0
sudo /Applications/VMware\ Fusion.app/Contents/Library/vmnet-cfgcli vnetcfgadd VNET_7_HOSTONLY_NETMASK 255.255.255.0
sudo /Applications/VMware\ Fusion.app/Contents/Library/vmnet-cfgcli vnetcfgadd VNET_7_VIRTUAL_ADAPTER yes
sudo /Applications/VMware\ Fusion.app/Contents/Library/vmnet-cli --configure
sudo /Applications/VMware\ Fusion.app/Contents/Library/vmnet-cli --stop
sudo /Applications/VMware\ Fusion.app/Contents/Library/vmnet-cli --start

Some helpful commands you may use regularly and troubleshooting tips

  • Wipe all vagrant boxes: vagrant destroy -f /cp[0-9]/ /worker[0-9]/ bootstrap lb
  • sudo may fix your issue
  • journalctl just to see if anything is happening on the machine in question
  • ss -ant to see what network connections are currently established
  • oc get clusteroperator/network check OKD network status
  • oc describe node cp0 check cp0/node's status

The load-balancer system

Start the first vagrant system that will provide the load-balancer and DHCP/DNS services. In a production environment this would be a part of your infrastructure. The following vagrant command should build the lb system and provision it using Ansible:

vagrant up lb --provider vmware_fusion

The lb system is ready when you can access the HAProxy web interface.

Username: test Password: test

You can access the lb system via vagrant:

vagrant ssh lb

*** NOTE *** If you plan on running pods on your master nodes and aren't deploying any worker nodes, you'll need to edit the /etc/haproxy/haproxy.cfg file accordingly. Add the master server lines to the backend router_https and backend router_http like so server master-0 cp0.kube1.vm.test:443 check with the correct server numbers and ports.

*** NOTE *** If you are planning to setup persistent volumes via NFS (likely), it is suggested that you add a second hard drive to this machine with 100GB+ so that it can be served to OKD via NFS. Here are some quick and dirty instructions:

  1. Shutdown the LB
  2. Add an additional HDD with 100GB+
  3. Power on the LB
  4. Create the partition for /dev/sdb and format it: https://www.digitalocean.com/community/tutorials/create-a-partition-in-linux
  5. Mount the drive to a folder
  6. Enable the NFS server for the folder: https://dev.to/prajwalmithun/setup-nfs-server-client-in-linux-and-unix-27id
  7. Verify your NFS mount: https://www.cyberciti.biz/faq/howto-see-shares-on-nfs-server-exported-filesystems/

Creating iPXE drivers for VMWare Fusion

VMWare Fusion does not support iPXE out of the box apparently. To support iPXE, you need to specify an iPXE driver in the VMX file. This repo has a precompiled iPXE driver included but if you find yourself needing to build from scratch, running the following commands on the load balancer machine that was just created above will create them.

sudo yum install -y git gcc gcc-c++ make zlib-devel binutils-devel xz-devel
cd /tmp
git clone https://github.com/ipxe/ipxe.git
cd ipxe/src/
make vmware

Pull the e1000e driver(s) from the LB box to OSX running this command in the base directory of this repo on OSX:

scp -i ./.vagrant/machines/lb/vmware_fusion/private_key [email protected]:/tmp/ipxe/src/bin/808610d3.mrom .

Building the Bootstrap system

Start a web server locally with ./webroot as the root directory. The simplest way is to use the webserver built into python. This web server serves the ipxe configs and all the installation files. 🔥 the ./webroot/os_ignition/auth directory contains files that dont need to be and should never be accessable, I have just been lazy here since it is a network local to my workstation.

python -m http.server --bind 192.168.100.1 --directory ./webroot 8000

*** IMPORTANT *** VMWare Fusion will sit in a PXE network boot loop until it's pulled out of it. After the machine has entered iPXE boot but before you see the machine restart and begin looking to PXE boot again (after "Configuring (net0 )..."), go into the settings for the machine and click Startup Disk. Change the boot device to the Hard Disk.

Start the bootstrap system. This should ipxe boot over http.

vagrant up bootstrap --provider vmware_fusion

After a short while you should see the following in the weblogs.

Serving HTTP on 192.168.100.1 port 8000 (http://192.168.100.1:8000/) ...
192.168.100.5 - - [16/Jul/2020 17:33:05] "GET /bootstrap.ipxe HTTP/1.1" 200 -
192.168.100.5 - - [16/Jul/2020 17:33:05] "GET /boot.ipxe.cfg HTTP/1.1" 200 -
192.168.100.5 - - [16/Jul/2020 17:33:05] "GET /images/fedora-coreos-32.20200629.3.0-live-kernel-x86_64 HTTP/1.1" 200 -
192.168.100.5 - - [16/Jul/2020 17:33:05] "GET /images/fedora-coreos-32.20200629.3.0-live-initramfs.x86_64.img HTTP/1.1" 200 -
192.168.100.5 - - [16/Jul/2020 17:33:24] "GET /os_ignition/bootstrap.ign HTTP/1.1" 200 -
192.168.100.5 - - [16/Jul/2020 17:33:25] "GET /images/fedora-coreos-32.20200629.3.0-metal.x86_64.raw.xz.sig HTTP/1.1" 200 -
192.168.100.5 - - [16/Jul/2020 17:33:25] "GET /images/fedora-coreos-32.20200629.3.0-metal.x86_64.raw.xz HTTP/1.1" 200 -

Change the Startup Disk to the Hard Disk

Leave the bootstrap VM at the login screen. Things are happening in the background which will pop up on the screen as they happen. This install bit takes a while.

If you'd like to view the progress:

ssh -o StrictHostKeyChecking=no -i ssh_key/id_ed25519 [email protected]

# to view the progress logs
journalctl -b -f -u bootkube.service

The bootstrap system is ready when it becomes available in the load-balancer and Vagrant has finished running successfully. You can examine if the required services are available by looking at the load-balancer stats page. Make sure that the bootstrap system is available under both the kubernetes_api and machine_config backend headers. They should go green.

Control Plane systems and Worker systems

*** IMPORTANT *** As above, you will need to change all the Startup Disk to Hard Disk for all of these workers being started below

Start the rest of the systems. This requires 3 control plane (cp) systems and at least 2 worker systems. The following commands will start these up, three of each.

vagrant up /cp[0-9]/ --provider vmware_fusion

vagrant up /worker[0-9]/ --provider vmware_fusion

To run this a bit quicker in parallel, you could run these in multiple terminals/tabs or run them in the background like so:

vagrant up cp0 --provider vmware_fusion & vagrant up cp1 --provider vmware_fusion & vagrant up cp2 --provider vmware_fusion &

You need to remove the bootstrap system when it has finished doing the initial setup of the cp systems. The following command will monitor the bootstrap progress and report when it is complete.

$ openshift-install --dir=webroot/os_ignition wait-for bootstrap-complete --log-level=debug
DEBUG OpenShift Installer 4.4.6
DEBUG Built from commit 99e1dc030910ccd241e5d2563f27725c0d3117b0
INFO Waiting up to 20m0s for the Kubernetes API at https://api.kube1.vm.test:6443...
INFO API v1.17.1+f63db30 up
INFO Waiting up to 40m0s for bootstrapping to complete...
DEBUG Bootstrap status: complete
INFO It is now safe to remove the bootstrap resources

You can also follow along the logs on the bootstrap system. After what feels like a year the logs on bootstrap will have something like this appear.

[core@bootstrap ~]$ journalctl -b -f -u bootkube.service

Jun 22 21:20:58 bootstrap bootkube.sh[7751]: All self-hosted control plane components successfully started
Jun 22 21:20:59 bootstrap bootkube.sh[7751]: Sending bootstrap-success event.Waiting for remaining assets to be created.
# a whole lot more crap
Jun 22 21:22:36 bootstrap bootkube.sh[7751]: bootkube.service complete

Destroy the bootstrap system

vagrant destroy bootstrap

Looking in the HAProxy page http://192.168.100.2:8404/stats you will see that the cp systems have all gone green and the bootstrap system is now red

Access the cluster with oc

export KUBECONFIG=webroot/os_ignition/auth/kubeconfig


$ watch -n5 oc get nodes
NAME   STATUS   ROLES    AGE   VERSION
cp0    Ready    master   26m   v1.17.1
cp1    Ready    master   25m   v1.17.1
cp2    Ready    master   25m   v1.17.1

For a new worker system to join the cluster it will need it's certificate approved. This can be automated in a cloud provider but for a bare-metal cluster you need to do it by hand or automate it in some way. Keep an eye on this while the workers are building as they have 2 sets of certificates which need signing.

$ oc get csr
NAME        AGE   SIGNERNAME                                    REQUESTOR                                                                   CONDITION
csr-6jxjn   36m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-8fztn   10m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-8hmrl   10m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-8xzzz   36m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-cktk4   36m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-kktkw   36m   kubernetes.io/kubelet-serving                 system:node:cp2                                                             Approved,Issued
csr-lkxhx   10m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Pending
csr-qlhmk   35m   kubernetes.io/kubelet-serving                 system:node:cp0                                                             Approved,Issued
csr-wgjdd   36m   kubernetes.io/kubelet-serving                 system:node:cp1                                                             Approved,Issued

You can sign them one at a time but this does it all at once.

oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs oc adm certificate approve
certificatesigningrequest.certificates.k8s.io/csr-8fztn approved
certificatesigningrequest.certificates.k8s.io/csr-8hmrl approved
certificatesigningrequest.certificates.k8s.io/csr-lkxhx approved

This will allow the nodes to register and they will almost immediatly request a certificate of their own which will need signing.

$ oc get csr
NAME        AGE   SIGNERNAME                                    REQUESTOR                                                                   CONDITION
csr-6jxjn   37m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-8fztn   11m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-8hmrl   11m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-8xzzz   37m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-cktk4   37m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-fb9qk   34s   kubernetes.io/kubelet-serving                 system:node:worker-192-168-100-228.kube1.vm.test                            Pending
csr-hjqv6   33s   kubernetes.io/kubelet-serving                 system:node:worker-192-168-100-230.kube1.vm.test                            Pending
csr-j42cq   34s   kubernetes.io/kubelet-serving                 system:node:worker-192-168-100-238.kube1.vm.test                            Pending
csr-kktkw   37m   kubernetes.io/kubelet-serving                 system:node:cp2                                                             Approved,Issued
csr-lkxhx   11m   kubernetes.io/kube-apiserver-client-kubelet   system:serviceaccount:openshift-machine-config-operator:node-bootstrapper   Approved,Issued
csr-qlhmk   36m   kubernetes.io/kubelet-serving                 system:node:cp0                                                             Approved,Issued
csr-wgjdd   37m   kubernetes.io/kubelet-serving                 system:node:cp1                                                             Approved,Issued

Because there are still come pending you will need to sign them as well.

$ oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs oc adm certificate approve
certificatesigningrequest.certificates.k8s.io/csr-fb9qk approved
certificatesigningrequest.certificates.k8s.io/csr-hjqv6 approved
certificatesigningrequest.certificates.k8s.io/csr-j42cq approved

Watch the cluster operators start up.

watch -n5 oc get clusteroperators

Eventually, everything will go to True in the AVAILABLE column.

NAME                                       VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE
authentication                             4.4.6     True        False         False      30m
cloud-credential                           4.4.6     True        False         False      80m
cluster-autoscaler                         4.4.6     True        False         False      46m
console                                    4.4.6     True        False         False      30m
csi-snapshot-controller                    4.4.6     True        False         False      38m
dns                                        4.4.6     True        False         False      65m
etcd                                       4.4.6     True        False         False      65m
image-registry                             4.4.6     True        False         False      57m
ingress                                    4.4.6     True        False         False      40m
insights                                   4.4.6     True        False         False      57m
kube-apiserver                             4.4.6     True        False         False      64m
kube-controller-manager                    4.4.6     True        False         False      64m
kube-scheduler                             4.4.6     True        False         False      64m
kube-storage-version-migrator              4.4.6     True        False         False      39m
machine-api                                4.4.6     True        False         False      66m
machine-config                             4.4.6     True        False         False      34m
marketplace                                4.4.6     True        False         False      57m
monitoring                                 4.4.6     True        False         False      29m
network                                    4.4.6     True        False         False      67m
node-tuning                                4.4.6     True        False         False      68m
openshift-apiserver                        4.4.6     True        False         False      40m
openshift-controller-manager               4.4.6     True        False         False      47m
openshift-samples                          4.4.6     True        False         False      22m
operator-lifecycle-manager                 4.4.6     True        False         False      66m
operator-lifecycle-manager-catalog         4.4.6     True        False         False      66m
operator-lifecycle-manager-packageserver   4.4.6     True        False         False      51m
service-ca                                 4.4.6     True        False         False      68m
service-catalog-apiserver                  4.4.6     True        False         False      68m
service-catalog-controller-manager         4.4.6     True        False         False      68m
storage                                    4.4.6     True        False         False      57m

Use the following command to check when the install is completed.

openshift-install --dir=webroot/os_ignition wait-for install-complete

The output will hang and wait for the installation to complete. Eventually you should end up with output similar to the following:

INFO Waiting up to 30m0s for the cluster at https://api.kube1.vm.test:6443 to initialize...
INFO Waiting up to 10m0s for the openshift-console route to be created...
INFO Install complete!
INFO To access the cluster as the system:admin user when using 'oc', run 'export KUBECONFIG=/home/timhughes/git/vagrant_ansible_okd4/webroot/os_ignition/auth/kubeconfig'
INFO Access the OpenShift web-console here: https://console-openshift-console.apps.kube1.vm.test
INFO Login to the console with user: kubeadmin, password: xxxxx-xxxxx-xxxxx-xxxxx

Access the web interface.

To login for the first time you should use the user kubeadmin and the password shown in the last section. If you have missed that then it is available in the configs you set up at the beginning.

cat webroot/os_ignition/auth/kubeadmin-password

Tags: OKD4, OCP4, OpenShift4, OKD4.12, Docker Compose, Vagrant, Mac, OSX, install

About

Working example of getting a production like OKD4 environment up and running in a Vagrant environment. Useful for learning and testing

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Shell 100.0%