- Master Node
- Api server - face of k8 master, every c’ happens via api server
- Schedulers - schedule workloads to worker nodes
- Control manager - compare state mentioned in request [desired] and actual state, then act accordingly
- Etcd - distributed key value store - only stateful component - source of truth
- Worker Nodes
- Kubelet - take request from master and fulfil them, reports to master node
- Docker runtime - to run containers - OCI compliant container engine, deals with container abstraction
- Kube proxy - manage n/w b/w worker nodes, Assigns IP to each pod with the help of CNI provider
- Pods
- Flow
- Client sends a request - To keep infra in a particular state
- Api server receives request and save it to Etcd
- Ctrl manager keeps looking at Etcd to notice any differences b/w current state and desired state
- Once decision has been made on what needs to be changed in pods, scheduler assign actual pod configuration to worker node
- Kubelet in worker node keeps listening to the api server in Master node
- Kubelet uses docker runtime to spin up new pods with mentioned configuration
- The new IPs of pods and routes definition are done by Kube proxy - IP table route
- Even Kubernetes components like api server, controller, scheduler, kubeproxy, etc run as pods
- CLI to communicate with k8 api server
- Restful communication
- kubectl [command] [type] [name] [flags]
- Commands - get, patch, delete
- Type - pods, services, jobs
- Flags - -o (wide)
- Connects to API server of K8 master node
- Use rest apis to do that
- Kubeconfig - info related to :
- Cluster info
- User info
- Namespace
- Default loc of kubeconfig - $HOME/.kube/config
- KUBECONFIG env var
- kubectl version
- kubectl version —short - client version and server version
- kubectl get nodes
- kubectl get nodes -o wide
- kubectl config view - to get cluster info, user info and namespace
- kubectl get config get-contexts
- kubectl get pods
- kubectl get pods -A -o wide
- kubectl apply -f file.yaml
- kubectl delete /
- kubectl describe /
- kubectl get pods —show-labels
- kubectl get svc
- kubectl get endpoints
- kubectl describe endpoints svc-name
- kubectl rollout history deployment/
- kubectl rollout undo deployment/ --to-revision=1
- kubectl cordon node-name -> no further pod will be scheduled here -> STATUS: SchedulingDisabled
- kubectl replace -f -> Replaces existing configuration with latest, works same as apply
- kubectl scale --replicas=6
- minikube ip
- minikube ssh - to connect to minikube
- eval $(minikube docker-env) - to make docker point to minikube docker context
- -o json -> in json formatted API object
- -o name -> only name of the resource
- -o wide -> additional info in plain-text format
- -o yaml -> YAML formatted API object
- Persistent entity in K8s system and rep state of system
- Includes:
- Spec - desired/requested state
- Status - current state
- Also called API resources
- Smallest deployable unit - pods
- Abstraction on top of pods - replica-set, stateful-set, daemon-set, job and cron-job, services and ingress
- Abstraction on top of Replicaset - Deployment
- Volumes, PVC,PV, Storage Class
- ConfigMap and Secrets
- Object descriptor YAML - to communicate our desired state
- Parts of object descriptor file:
- apiVersion,
- kind [of object],
- metadata [info about object, name - unique identifier, labels]
- Spec - actual specification of the object to be created
- Replication controller (same purpose as Replica Set) -
- Replica set is recommended,
- Replication controller is an older concept
- Replication controller does not have 'selector' under spec, but Replica Set has
- Selector helps Replica Set to attach any already running pods to itself or any other pods that can be started individually in future
- Smallest unit
- Run inside nodes
- Can run multiple pods in 1 node
- Pods are a wrapper over containers
- Multiple containers in a pod is possible and they share the same container env, but best practice is to run 1 container/pod unless other containers are monitoring/tracking apps
- Ring-fenced env
- Network stack
- Volume mounts
- Kernel namespace
- High level Pod lifecycle -
- Kubectl -> API server
- API server -> Etcd
- Scheduler reads from Etcd -> Node [kubelet/worker]
- Pod - pending
- Pod - Running / Failed
- Pod - Success
- Intra pod communication
- Containers within pod talk to each other via localhost
- Share same n/w namespace, hence same IP and Port
- Container within Pod to avoid same port, use to avoid port binding error
- Inter pod communication
- Each pod gets own private IP from k8 cluster vpn
- Container specs tags
- name
- image
- command
- args
- workingDir
- ports
- env
- resources
- volumeMounts
- livenessProbe
- readinessProbe
- lifecycle
- terminationMessagePath
- imagePullPolicy
- securityContext
- stdin
- stdinOnce
- tty
- Abstraction over pods, which ensures that a particular no. of pods is always running in the cluster
- Uses Reconciliation control loop -> Current state - Desired State - Observe-Diff-Act
- Ensures that a pod or homogeneous set of pods are always available
- Maintains desired no. of pods:
- Excess pods - killed
- Launch new pod - in case of fail/deleted/terminated
- Associated with pods via matching labels
- Labels: Key-Value pair attached to objects like pod - user defined
- Selectors: Help identify objects in cluster - equality based / set based
- apiVersion - apps/v1
- kind - ReplicaSet
- metadata - name, labels…
- spec -
- replicas
- selector - matchLabels - app
- template - pod specification - prevents specifying separate pod yaml
- Distributes pods evenly across nodes
- Deleting replica set -> deletes associated pods as well
These diagnostics are performed periodically - in template section of replicaset/deployments - httpGet [path] /exec [command] - initialDelaySeconds and periodSeconds
- readinessProbe - indicates if container is ready to serve requests, halts sending new requests until probe succeed - in template section of replicaset/deployments - httpGet/exec - initialDelaySeconds and periodSeconds
- livenessProbe - indicates whether the container is running healthy, if fails, declares container unhealthy and restarts container
- startupProbe - protect slow starting containers with startup probes
- httpGet - /health endpoint
- exec - shell script or command to exit successfully with return code 0
- tcpSocket - open a socket to container on specified port successfully
- Pods are ephemeral
- They are recreated and not resurrected
- Services are abstraction of a way to expose an app running on a set of pods by reliable network svc.
- Exposes pod over a reliable IP, Port, DNS
- Associated with pods via matching labels
- Also used for inter pod communication
- Client -> service [DNS/IP] -> Endpoint object [list of all pod IP address associated with svc, keeps getting updated]
- Types:
- ClustedIP - default - cluster-internal IP only access within n/w
- NodePort - exposes node on a static port - NodeIP:NodePort
- LoadBalancer - Exposes service publicly
- apiVersion - v1
- kind - service
- metadata - name
- spec - type, selector - app [same as replicaset/template/metadata/name or pod/metadata/name]
- ports - protocol, port, targetPort
- Deleting pod or replica sets does not affect svc but just removes them from endpoints. Upon new spin ups, services will update the endpoints based on label-selector
- Readiness and Liveliness probe also affect the endpoints
- How to deploy a new version of app?
- How to roll back?
- Is replica set good enough?
- Change in rs.pod spec - no effect
- Delete and re-deploy rs - change effected
- Updates with zero downtime
- Rollbacks
- A higher level of abstraction over replica set, provides declarative way of upgrading and rollbacks to pods
- Flow:
- Current state - RS 1
- Client -> Revision 2 -> API server
- Scheduler + Control Manager -> spin up RS 2, pods created
- Terminate pods in RS1
- RS 1 still persists -> so that during rollback, the can be used
- The diff b/w replica-set and deployment is the kind
- Default strategy - RollingUpdate - maxSurge, maxUnavailable
- Recreate strategy -> downtime
- Containers are ephemeral
- We require persistent storage
- Types:
- emptyDir -
- No data at start,
- created when pods get created,
- mounted and accessible across all containers in the pod
- Help sharing data across containers
- spec -> volumes/name : html, volumes/emptyDir: {}
- spec/containers -> volumeMounts/name : html, volumeMounts/mountPath:
- Good option to share data b/w container but data is lost once pod goes down
- hostPath -
- Storage from backing Node [Host] is mounted inside container [Pod]
- Data retained on Node even after Pod goes down
- Data not available if Pod is scheduled on another Node
- Cant save data from Node outage
- spec -> volumes/name : html, volumes/hostPath/path: , volumes/hostPath/type: Directory
- spec/containers -> volumeMounts/name : html, volumeMounts/mountPath:
- Good option to shared data across pods in a Node
- Cloud volume type -
- awsEBS
- gcePersistentDisk
- azureDisk
- Nfs
- emptyDir -
- Abstracts how storage is provided and how storage is consumed
- PV
- Represent actual volume
- Provisioned by Admin or dynamically provisioned using StorageClass
- Lifecycle <-> Pod
- PVC
- Represent request for volume by user
- Abstract the storage resource without exposing details how those volumes are implemented
- Claims are fulfilled by PV hence PVC is linked with PV
- Retain - Actual volume is retained even after PV and PVC is deleted
- Delete - Actual physical storage is deleted, default
- Recycle - Deprecated
- Access modes
- ReadWriteOnce - RWO - volume can mounted by read-write by single node
- ReadOnlyMany - ROX - read-only by many nodes
- ReadWriteMany - RWX - read-write by many nodes
- Provisioning
- Static:
- Admin creates a number of PVC
- Cluster matches one of the PV for a PVC
- Only one PVC can be attached for a PV
- Dynamic:
- Allows storage volumes to be created on-demand as per the request
- Claims are fulfilled by PV, hence PVC are linked to PV
- Static:
- Helps create dynamic on-demand PVs
- PVC refers storage class, Storage class provisions PVC on demand, Deployment/ReplicaSet/Pod mount the PV via PVC
- Basically storage class are template for PVs
- Provisioners - cloud service providers
- Parameters - specific to provisioners
- If PVC is deleted, PV is also gone, id reclaim policy is not set to ‘retain’
- Link to K8 commands compilation: https://www.evernote.com/shard/s645/sh/18a2e56b-3451-90a2-75b5-2f91ec5ac6ef/3e5b88d59f5bb686d5fb7350cf823e63
- resource address format: ...cluster.local
- kubectl create -f --namespace=
- Also, namespace can be mentioned in metadata of the resource
- kubectl create namespace
- kubectl config set-context $(kubectl config current-context) --namespace=
apiVersion: v1
kind: ResourceQuota
metadata:
name: compute-quota
namespace: dev
spec:
hard:
pods: "10"
requests.cpu: "4"
requests.memory: 5Gi
limits.cpu: "10"
limits.memory: 10Gi
- --dry-run=client -> resource won't be created, instead will tell if resource would be created or not
- -o yaml -> resource definition in YAML format
- kubectl run nginx --image=nginx --dry-run=client -o yaml : will not create the resource 'pod' but will give pod declarative definition
- kubectl create deployment --image=nginx nginx --dry-run -o yaml : will not create the resource 'deployment' but will give deployment declarative definition
- kubectl create deployment nginx --image=nginx--dry-run=client -o yaml > nginx-deployment.yaml : saves definition to a file
- kubectl expose pod redis --port=6379 --name redis-service --dry-run=client -o yaml : will not create the resource 'service' but will give service declarative definition
- CMD vs EntryPoint - command line args replace CMD while it gets appended in EntryPoint
- Default can be specified by having both CMD and EntryPoint - CMD instructions are appended to EntryPoint
- ENTRYPOINT (docker) -> command (k8)
- CMD (docker) -> args (k8)
- Specifications of an existing POD, CANNOT be edited other than the below:
- spec.containers[*].image
- spec.initContainers[*].image
- spec.activeDeadlineSeconds
- spec.tolerations
- The environment variables, service accounts, resource limits of a running pod cannot be edited
- There are 2 options to achieve though:
- Approach 1:
- kubectl edit pod -> This will open up pod specification in a vi editor
- Change the specifications and try to save it -> will through error but will save the changed specifications in a temp file
- delete the existing pod:
kubectl delete pod <pod-name>
- create the changed pod:
kubectl create -f <tmp file path>
- Approach 2:
- Extract the pod definition in YAML format to a file using the command:
kubectl get pod <pod-name> -o yaml > my-new-pod.yaml
- vi my-new-pod.yaml: changes specifications and save
- kubectl delete pod
- kubectl create -f my-new-pod.yaml
- Extract the pod definition in YAML format to a file using the command:
- Approach 1:
- For deployments:
kubectl edit deployment my-deployment
, the new changes will be applied to the pods (running pods will be terminated and new pods with latest specifications will be created)
- In pod specifications, under 'env' attribute. This is an array of (Key value pair) name & value.
- Other ways of specifying env vars are: ConfigMap and Secrets
- Example of direct key-value pair under 'env'
env:
- name: APP_COLOR
value: pink
- Example of config-map under 'env'
env:
- name: APP_COLOR
valueFrom:
configMapKeyRef: <config-map-name>
- Example of secret under 'env'
env:
- name: APP_COLOR
valueFrom:
secretKeyRef: <secret-name>
- Centralized way of configuring configuration data in the form of key-value pairs.
- When pods are created, these configuration data are injected to the apps inside the container inside the pod for usage
- Phases: Create config map, inject them into pod
- Imperative ways of creating a config map
kubectl create configmap <config-map-name> --from-literal=key1=value1 --from-literal=key2=value2
kubectl create configmap <config-map-name> --from-file=<path-to-file>
- Declarative way of creating a config map: apiVersion, kind, metadata, data (ke-value pairs)
kubectl apply -f <config-map-definition-file-path>
- kubectl get configmaps
- kubectl describe configMaps
- Map config map to pod definition/template
envFrom:
- configMapRef:
name: <config-map-name>
volumes:
- name: <volume-name>
configMap:
name: <config-map-name>
- Imperative way to create a secret:
kubectl create secret generic <secret-name> --from-literal=<key>=<value>
kubectl create secret generic <secret-name> --from-file=<path-to-file>
- Declarative way to create a secret
kubectl create -f <secret-file-name>
- Encoded data values in secret definition. Although just encoding is not enough, so it is better to use some KMS decryption
- kubectl get secrets
- kubectl describe secrets
- kubectl describe secrets -o yaml : to view the hashed secrets
- Map secret to pod definition/template
envFrom:
- secretRef:
name: <secret-name>
volumes:
- name: <volume-name>
secret:
secretName: <secret-name>
If secret is used as volume mount, each attribute in secret is creates its own file and with value as contents in it
- Host itself runs a set of processes, docker daemon, ssh-server, etc.
- Docker containers unlike VMs share same linux kernel as the hosts' but they are separated by namespaces
- Container has its own namespace and host has its own
- All processes run on container in fact run on host itself but in a different namespace (namespace of container)
- Docker container can see only see its own processes only
- Listing processes in a container (ps aux) will only show processes within container
- Listing processes in the host (ps aux) will show all processes within and out of container(s)
- Docker container has a set of users root users and a set of non-root users
- By default, docker runs processes within container as root users
- User can be changed, user can be set using while running docker using --user flag:
docker run --user=1000 ubuntu sleep 1000
- Another way to set user is creating a custom image from an existing image and setting used in the docker file itself Example dockerfile:
FROM ubuntu
USER 1000
building the above custom image
docker build -t my-ubuntu-image .
run the image w/o specifying the user
12. If we run container as a root user, is it not dangerous?
1. Docker implements the set of security features that limits the capability of the root user within the container
2. Root user within the container is not really same as root user on host
3. Docker uses linux capabilities to achieve this
4. Root user is the most powerful user in a system and can do set of these ops: CHOWN, DAC, KILL, SETGID, SETUID, NET_ADMIN, KILL, etc.
5. The process running as a root user too has unrestricted access of the system
6. Docker's root user by default has limited capabilities, they do not have all the privilleges
7. We can add more capabilities to the container's user while running it: docker run --cap-add KILL ubuntu
8. We can drop capabilities of the container's user while running it: docker run --cap-drop MAC_ADMIN ubuntu
9. We can run container with all privileges as well: docker run --privilleged ubuntu
- Configuring user id of a container, adding/removing privileges of a user in a k8 is also possible
- Security settings can be configured at container/pod level
- If we set at pod level the settings will be applied to all containers within pod
- If we set at both pod and container level, then settings of container level will take precedence over pod settings
- Configuration
apiVersion: v1
kind: Pod
metadata:
name: web-app
spec:
securityContext:
runAsUser: 1000 #all conatainers within this pod will run with user id 1000
containers:
- name: ubuntu
image: ubuntu
command: ["sleep", "1000"]
securityContext:
runAsUser: 2000 #the user id for this container would be 2000 overrinding 1000
capabilities:
add: ["MAC_ADMIN", "KILL"]
- Two types of account in K8: User a/c and Service a/c.
- User account: used by humans, Service account: for automated tasks(by machines)
- User account types (not limited to): Admin (to perform admin tasks), Developer(to access the cluster and deploy apps)
- Service account types are used my an app to interact with k8 cluster, examples:
- A monitoring app like Prometheus uses service a/c to poll k8 metrics/logs to come up with performance metrics
- An automated build tool like Jenkins uses service a/c to deploy app on the cluster
- To create a service a/c:
kubectl create serviceaccount <account-name>
- To view all service a/c:
kubectl get service a/c
- On creation of service a/c a token is created automatically:
kubectl describe serviceaccount <acocunt-name>
- see Tokens - The above token can be used by the external apps for authentication of kube-api as a bearer token.
- Token is stored as a secret object.
- To view the secret object:
kubectl describe secret <secret-name>
- Steps:
- create a service a/c
- assign role based permissions/access control mechanisms
- export the token
- use it in external app while making kube api requests
- If the external app itself is hosted in K8 cluster, the exporting can be made simpler by mounting the secret as a volume to the application.
- To view the secret files in the pod (which has secret mounted as volume):
- exec into the pod: kubectl exec -it
- ls /var/run/secrets/kubernetes.io/serviceaccount -> ca.crt, namespace, token
- cat /var/run/secrets/kubernetes.io/serviceaccount/token
- Default service accounts are mounted automatically to every pods, which has limited permissions.
- To assign a service account: spec/serviceAccountName: <service a/c name>
- To prevent k8 from automatically mounting default service a/c : spec/automountServiceAccountToken: false
- Scheduler decides which node the pod goes to.
- Scheduler takes into consideration: the amount of resources by a pod and availability of it in node.
- If there is no sufficient resources available on any of the nodes, K8 keeps the po in pending state with event reason as insufficient CPU/memory/disk
- Default CPU: 0.5, MEM: 256 Mi, Disk: (Resource Request)
- spec/conatiners:
resources: requests: memory: "1Gi" cpu: 1
- cpu 0.1 means 100m (m -> milli)
- cpu can be requested as low as 1m
- 1 cpu equivalent to
- 1 AWS vCPU
- 1 GCP core
- 1 Azure core
- 1 Hyperthread
- 1Gi memory means 1 Gibibyte while 1G means 1 Gigabyte
- set limits under spec/conatiners/resources, to prevent pod from consuming too much resources and suffocating other pods
limits: memory: "2Gi" cpu: 2
- when pod tries to go beyond the limit cpu, k8 tries to throttle the cpu so that pod will not be able to consume more cpu
- when pod tries to go beyond the limit mem, k8 terminates the pod
- The status OOMKilled indicates that it is failing because the pod ran out of memory. Identify the memory limit set on the POD
- Taints and tolerations are used to set restrictions on what pods can be scheduled on which node.
- They have nothing to do with security.
- Lets' take a use case:
- We have 4 pods: A, B, C, D
- We have 3 nodes: Node1, Node2, Node3
- Now if there are no taints and tolerations configured, then A, B, C, D will be placed on nodes via load balancing/resource management
- But suppose we want to place pods like D (running same as in D) to be scheduled only on Node1
- Then we apply a taint on Node1, so since until now none of the pods have any sort of tolerations configured, none of the pods will be scheduled in Node1
- Now we can enable pod D to be placed on Node1, by adding a toleration on pod D.
- Taints are placed on nodes and Tolerations are placed on pods.
- Apply Taints to nodes:
kubectl taint node <node-name> <key>=<value>:<taint-effect>
- Taint-Effect determine what happens to the pod if they DO NOT TOLERATE this taint, there are 3 taint-effects
- NoSchedule: Pods will not be scheduled
- PreferNoSchedule: K8 will try not to schedule pods but with no guarantee
- NoExecute: New pods will not be scheduled, but if already there are few pods in the node they will be evicted.
- Apply Tolerations to pods (@ spec/containers):
tolerations: - key: "app" operator: "Equal" value: "blue" effect: "NoSchedule"
- Taints and tolerations do not guarantee that certain pods will be scheduled on certain nodes only. They enable nodes to accept certain pods but those pods can very well be placed on other nodes. as well.
- Scheduler does not place any pod on master node: because when K8 cluster is first set up a taint is applied on the master node automatically that prevents placing of other pods on master node.
- To see the above taint in master node:
kubectl describe node kubemaster | grep Taint
- There might be use cases where we will require placing certain pods only certain nodes.
- For example,
- There are 3 nodes (2 nodes with low resources and 1 node with high resources).
- We would like to place pods running high processing apps in node with higher resources.
- The default setup places pods in nodes based on load balancing and resource availability strategy.
- Also, with taints and tolerations, we can guarantee nodes to accept certain pods but not guarantee placing pods on certain nodes.
- A simple way to achieve this is using Node Selectors.
- An example of Pod configuration using node selector
apiVersion: v1
kind: Pod
metadata:
name: myapp-pod
spec:
containers:
- name: data-processor
image: data-processor
nodeSelector:
size: Large
- The key value pair (size: Large) are in fact labels assigned to nodes. Scheduler uses these to assign pods to specific Nodes.
- To label a node:
kubectl label nodes <node-name> <key>:<value>
- Limitations:
- Cannot serve complex requirements: if we want to place pod on a large or medium nodes instead of small.
- Node affinity is the solution here.
- Complex requirements can be executed in Node Affinity.
- The example used in Node Selectors can be re-defined as this:
apiVersion: v1
kind: Pod
metadata:
name: myapp-pod
spec:
containers:
- name: data-processor
image: data-processor
affinity:
nodeAffinity:
requiredDuringSchedulingIgnoredDuringExecution:
nodeSelectorsTerms:
- matchExpressions:
- key: size
operator: In #NotIn, Exists,...
values:
- Large
- If node affinity does not match any of the rules:
- Node affinity types:
- requiredDuringSchedulingIgnoredDuringExecution: Pod will not be scheduled if rules do not match (Pods remain in pending state), but pods already running are ignored (irrespective of the rules).
- preferredDuringSchedulingIgnoredDuringExecution: Pod will be scheduled in available node if rules do not match, and pods already running are ignored (irrespective of the rules).
- requiredDuringSchedulingRequiredDuringExecution: Pod will not be scheduled if rules do not match (Pods remain in pending state), and pods already running are evicted if rules do not match.
- Lets' take a use case
- There are 3 nodes: Red, blue and green. There are other nodes as well.
- There are 3 pods: Red, blue and green. There are other pods as well.
- Our aim is to put red pod in red node, green pod in green node and blue pod in blue node.
- We also do not want any other pods to be placed in our (red, green and blue) nodes.
- We also do not want our pods to be placed on other nodes.
- How to achieve this:
- Lets' try with Taints and Toleration first
- We apply taints red, blue and green to nodes.
- Then we apply tolerations red, blue and green to pods.
- This will help in placing pods with appropriate tolerance end up in corresponding tainted node but this does nt guarantee pod ending up in nodes that do not have taints.
- Lets' try with Node Affinity
- We apply key-value pair labels on nodes.
- We then configure nodes with appropriate affinity.
- This will help us in placing pods in appropriate nodes but other pods also might end up in our nodes.
- Lets' try with Taints and Toleration first
- So a combination of both Taints and Toleration and node affinity is used.
- Microservices enable us to develop small, independent, reusable code.
- Also, it helps us in scaling them.
- However, at times two services are required to work together such as a web server and a log agent.
- We want a web server and a log agent paired together, we do not want to merge them and bloat the code though.
- So we need multi-container pods that share same lifecycle, network space and storage volumes.
- An example of multi-container setup looks something like below:
apiVersion: v1
kind: Pod
metadata:
name: simple-webapp
labels:
name: simple-webapp
spec:
containers:
- name: simple-webapp
image: simple-webapp
ports:
- containerPort: 8080
- name: log-agent
image: log-agent
- Common design patterns:
- SIDECAR: we can run a logging agent along with the main app that will push logs on to a centralized logs-storage
- ADAPTER: sometimes each application produces different format of logs and hence we need to format them before pushing them to centralized system
- AMBASSADOR: very often, it is required to connect to different databases based on env. So based on the env we connect to that DB instance. This logic can be extracted out to an ambassador container which can act as a proxy.
- A pod has a pod status.
- The pod status states where is the pod in its lifecycle.
- If pod is first created, it is in pending state. This is when the scheduler tries to figure out where to place the pod.
- If scheduler cannot find a node to place the pod, then it remains in pending state.
- Once the pod is scheduled, it goes into containercreating status, it is when the image is pulled and containers are created.
- Once all the containers in the pod starts, pod status changes to running state.
- The pod status remains in running state, unless program in the container is completed or the pod is terminated.
- So complete and terminating are the other pod statuses.
- Pod conditions
- PodScheduled
- Initialized
- ContainersReady
- Ready - indicate app inside the pod is running and ready to accept requests
- Container could be running various apps within them
- A Simple script performing a job, a db service, or a large web server serving end users.
- The script may take few milliseconds to get ready
- The db service may tale few milliseconds to connect to db and run migration scripts
- The webserver might require some seconds to powerup before serving requests
- So the apps are not yet ready for those milliseconds to serve any requests
- W/o readiness probe, the pod continues to indicate being ready even though the underlying containers are powering up
- So readiness probes are important to let k8s know of the actual state of the containers
- If Pod is not ready k8s service will not divert request on to it because k8s service relies on pod's ready state to route traffic
- As developers, we know that when exactly the app is ready to serve requests
- So we need a way to tie up the actual app's ready state with k8s status indicating ready or not
- There are a few ways to do so:
- HTTP test: /api/ready is responding with correct status code or not
- TCP test: TCP socket is up or not
- exec command: if command gets executed successfully or not
- Example of HTTP test readiness probe:
apiVersion: v1
kind: Pod
metadata:
name: simple-webapp
labels:
name: simple-webapp
spec:
conatiners:
- name: simple-webapp
image: simple-webapp
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /api/ready
port: 8080
- Example of TCP test:
readinessProbe:
tcpSocket:
port: 3306
- Example of Exec Command test:
readinessProbe:
exec:
command:
- cat
- /app/is_ready
- We can add additional delay to the probe considering that app might take a few more time to start and hence requires readiness probe to be tested after that time. This can be achieved by 'initialDelaySeconds':
readinessProbe:
httpGet:
path: /api/ready
port: 8080
initialDelaySeconds: 10
- If we wish to run the probe periodically and change the state of the container based on it. We cam achieve it by 'periodSeconds':
readinessProbe:
httpGet:
path: /api/ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
- By default, if app is not ready after 3 attempts, the probe will stop and pod will not be sent request to. But we can configure the number of fail attempts by 'failureThreshold':
readinessProbe:
httpGet:
path: /api/ready
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 8
- Liveness probe is very much similar as in readiness probe. But in this case the pod is killed upon failing and new instance of the pod is respawned.
- The configurations stay similar to readiness probe
- HTTP test
apiVersion: v1
kind: Pod
metadata:
name: simple-webapp
labels:
name: simple-webapp
spec:
conatiners:
- name: simple-webapp
image: simple-webapp
ports:
- containerPort: 8080
livenessProbe:
httpGet:
path: /api/healthy
port: 8080
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 8
- TCP test
livenessProbe:
tcpSocket:
port: 3306
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 8
- Exec command test
livenessProbe:
exec:
command:
- cat
- /app/is_ready
initialDelaySeconds: 10
periodSeconds: 5
failureThreshold: 8
- to view logs of a container:
kubectl logs -f <pod-name>
(f option is to stream logs live). - if multiple containers are running in a single pod, it would ask for the container name, else it would fail:
kubectl logs -f <pod-name> <container-name>
.
- What to Monitor:
- Count of nodes in cluster
- Healthy nodes count
- Performance metrics: CPU usage, memory, n/w and disk utilization
- Pod level metrics: number of them and performance metrics of each pod
- Tools to integrate with k8s:
- Metrics server
- Prometheus
- Elastic stack
- Datadog
- Dynatrace
- Heapster - Original project to enable monitoring and analytics on k8s objects - deprecated
- Metrics server
- A trimmed down version of it
- 1 Metrics server per cluster
- Gets metrics from each node, pods, aggregates them and stores them
- In memory monitory solution - no historical data
- Kubelet runs on each node
- it has a sub-component called cAdvisor
- cAdvisor is responsible for retrieving performance metrics and put them to kubelet API
- minikube enable addons metrics-server
- git clone https://github.com/kubernetes-incubator/metrics-server.git - download the deployment binaries
- kubectl create -f deploy/1.8+/ - creates set of pods, services and roles to enable metric server to poll for performance metrics of cluster
- kubectl top node - to view the metrics of nodes
- kubectl top pod - to view the metrics of pods
- Ability to group kubernetes objects together and filter them based on needs is achieved using labels and selectors.
- Labels are basically properties attached to each item.
- Selectors help us filter kubernetes objects based on the attached properties (labels).
- An example of labels and selectors would be:
- When we create pods, we attach some labels.
- And then when we create service to redirect requests to the pods, we create selectors and matchLabels to link service and pods
- An example of a pod with labels is as below (here app: mock-app and function: backend are the labels):
apiVersion: v1
kind: Pod
metadata:
name: simple-webapp
labels:
app: mock-app
function: backend
spec:
containers:
- name: simple-webapp
image: simple-webapp
ports:
- containerPort: 8080
- After creating a pod with certain labels, we can filter it by:
kubectl get pods --selector app=mock-app
- An example of a service using selector to attach itself to pods (here app: mock-app and function: backend under spec/selector are the selectors)
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
selector:
app: mock-app
function: backend
ports:
- protocol: TCP
port: 80
targetPort: 9376
- Having one selector is enough unless further nested filtering is required.
- Annotations are used to record other details for informatory purposes. For details like build information, name or contact.
- When we first create a deployment, it creates a rollout.
- A new rollout creates a new revision.
- In the future, when a new deployment (of same name) is triggered, a new rollout is created with increased version.
- This helps us keeps track of changes made and enables us to rollback to previous version deployment.
- To check status of rollout:
kubectl rollout status deployment myapp-deployment
- To check the history, revision and change-cause of rollout:
kubectl rollout history deployment myapp-deployment
- Deployment strategies:
- Recreate:
- Suppose there are 5 instances of your app running
- When deploying a new version, we can destroy the 5 instances of older version and then deploy 5 instances of newer version
- The issue is there will be a downtime
- This is majorly done during major changes, breaking changes or when backward compatibility is not possible
- This is not default strategy
- Rolling update
- In this strategy, we do not drop all the already running instances
- We drop instances by a certain percentage at a time and simultaneously spawn equal percentage of newer version pods.
- This upgrade is default strategy
- This has no downtime
- Recreate:
- For example
- suppose there is an already existing deployment running 3 replicas of a pod with image nginx:1.7.0
- now you wish to change the version of the image
- this can be done by changing the version of the image in deployment file and running the command:
kubectl appy -f <deployment file path>
- this can also be done by:
kubectl set image deployment myapp-deployment nginx=nginx.1.7.1
- but if we do step #4, then there will be inconsistency in the actual file and the deployment definition in the cluster
- run command:
kubectl describe deployment <deployment name>
to see the details of deployment, and notice the difference in both strategies - How upgrades work under the hood:
- When a deployment is applied, it creates a replica-set and spins up pods with number of instances as mentioned in the deployment configuration
- Then, when the deployment is re-applied with changes, it creates another replica-set and spins up pods with number of instances as mentioned in the deployment configuration and drops pod simultaneously from older replica-set.
- But the thing to note is, the older replica-set still exists, which will be used for rollback if required
- To rollback a deployment:
kubeclt rollout undo deployment myapp-deployment
- this will also run in the similar sequence as it happened while upgrade - After rollback the new replicaset still persists.
- Remember in order to see change cause of historical revisions, we need to add --record flag while editing/applying deployments (needs to be set once per deployment)
- When we do a rollback, the revision to which the rollback happens is removed from history and a new entry is made in the history instead.
- If any error occurs during upgrade, kubernetes will proactively stop the upgrade and stop dropping previously running instances
- There are broadly 2 types of workloads:
- Longer running time workloads: DB, Services, Web-servers, etc. Manually stopped if required.
- Short runtime workloads: Batch processing, analytics, reporting, etc. Stops after finishing the task.
- Let us create a pod definition file (simple-sum.yaml) to do some computational work
apiVersion: v1
kind: Pod
metadata:
name: math-pod
spec:
containers:
- name: math-add
image: ubuntu
command: ['expr', '3', '+', '2']
- now run command:
kubectl apply -f simple-sum.yaml
- status of pod (kubectl get pods) changes from creating -> running -> completed
- But the problem is, as soon as the pod goes to completed state (since it has done with the operation), kubernetes restarts it and the cycle continues
- Because kubernetes wants to keep pods running forever by default. There is a property called restartPolicy which is set to Always by default
- We can override this property to either 'Never' or 'OnFailure'.
- We want to make sure that all pods doing some computational work get created and do a certain job successfully and then are dropped. For this we require a manager which is also known as a Job.
- ReplicaSet ensure running pods forever while Job ensures creating pods and doing assigned tasks successfully
- An example of Job
apiVersion: batch/v1
kind: Job
metadata:
name: math-add-job
spec:
completions: 3
parallelism: 3
template:
spec:
containers:
- name: math-add
image: ubuntu
command: ['expr', '3', '+', '2']
restartPolicy: Never
- 'completions' is analogous to 'replicas'.
- If one of the pod fails, the job tries spin up pods until required completions are not meant
- 'parallelism' forces kubernetes to create pods for a job at the same time
- A Job that can be scheduled is called CronJob
- Template of CronJob is as follows:
apiVersion: batch/v1beta1
kind: CronJob
metadata:
name: reporting-cron-job
spec:
schedule: "*/1 * * * *"
jobTemplate:
spec:
completions: 3
parallelism: 3
template:
spec:
containers:
- name: math-add
image: ubuntu
command: ['expr', '3', '+', '2']
restartPolicy: Never
-
schedule “30 21 * * *” implies that this ob will run at 2130 hrs everyday.
-
One thing to notice is that it has 3 ’spec’s. 1st spec is for CronJob itself. 2nd spec is for Job (because CronJob is an abstraction over Job). 3rd spec is for the underlying container.
- Services enable communication between components within and outside the applications.
- Services enable applications to connect with other resources like: db pods, other services (frontend/backend)
- They kind of enable loose coupling b/w microservices in our application setup.
- Lets understand a default setup w/o services:
- Our pod (lets say it is a FE app that says Hello World!) is within a K8s Node.
- Node IP is 192.168.1.2, Node uses the same n/w as our system.
- So our system IP will also fall in the same IP range: 192.168.1.10
- But the Pod has different n/w (say 10.244.0.0).
- So the Pod IP can be 10.244.0.2.
- In order to access th application which runs in the Pod, we have to ssh into the Node and then do a curl http://10.244.0.2
- But this is inside the K8s cluster, we need to be able to access it from our system by doing curl http://192.168.1.2.
- So we need something in the middle of Node and Pod to redirect the request.
- This is where K8s services come into play.
- The K8s services are like any other K8s objects, one of the use case of services is to listen to the Node port and forward the request to a target pod port.
- This type of service is called a NodePort service as this service listens to a pod of Node and forwards to a pod port.
- ClusterIP: This type of service create a virtual IP inside the cluster to enable communication b/w sets of services within the cluster itself.
- LoadBalancer: This type of service distributes the load across the web servers that it caters to.
- A template of service looks like this:
apiVersion: v1
kind: Service
metadata:
name: my-service
spec:
type: NodePort
ports:
- targetPort: 80
port: 80
nodePort: 30008
selector:
app: myapp
type: frontend
- If 'port' under spec/ports is not defined, then it is defaulted to 'targetPort' under spec/ports.
- If 'nodePort' under spec/ports is not defined, then it is defaulted to anything in the range: 30000 to 32767
- The selector is used to link services to the pods.
- The key-value pairs under selectors should match the labels of the pod.
- To view services:
kubectl get services
. - No we can use the port '30008' to access the app:
curl http://192.169.1.10:30008
. - By default, the algorithm used is Algorithm:Random and SessionAffinity: Yes.
- If pods are distributed across Nodes, the K8s automatically creates service that spans nodes and maps target port to same nodePort on all nodes.
- Another use case for internal communication:
- Suppose we have multiple frontend pods
- We also have multiple backend pods
- We have multiple db pods too
- frontend pods needs to interact with backend pods and in turn backend pods need to interact with db pods.
- Now each frontend pod do not know exactly which backend pod to connect to similar issue also exist b/w backend and db pods.
- Again even if we somehow map ips, the pods are ephemeral and the ip of pods keep changing.
- Hence, clusterIp services provide us a single interface that group pods together to access the pods of similar types.
- A template of clusterIP service looks like this:
apiVersion: v1
kind: Service
metadata:
name: backend-service
spec:
type: ClusterIP
ports:
- targetPort: 80
port: 80
selector:
app: myapp
type: backend
- Take a look at why an ingress is required . Then come back here to see some configuration details.
- Ingress Controller:
apiVersion: apps/v1
kind: Deployment
metadata:
name: nginx-ingress-controller
spec:
replicas: 1
selector:
matchLabels:
name: nginx-ingress
template:
metadata:
labels:
name: nginx-ingress
spec:
containers:
- name: nginx-ingress-controller
image: quay.io/kubernetes-ingress-controller/nginx-ingress-controller:0.21.0
args:
- /nginx-ingress-controller
- --configmap=$(POD_NAMESPACE)/nginx-configuration
env:
- name: POD_NAME
valueFrom:
fieldRef:
fieldPath: metadata.name
- name: POD_NAMESPACE
valueFrom:
fieldRef:
fieldPath: metadata.namespace
ports:
- name: http
containerPort: 80
- name: https
containerPort: 443
apiVersion: v1
kind: ConfigMap
metadata:
name: nginx-configuration
apiVersion: v1
kind: Service
metadata:
name: nginx-ingress
spec:
type: NodePort
ports:
- port: 80
targetPort: 80
protocol: TCP
name: http
- port: 443
targetPort: 443
protocol: TCP
name: https
selector:
name: nginx-ingress
apiVersion: v1
kind: ServiceAccount
metadata:
name: nginx-ingress-serviceaccount
- There are four K8s objects involved in setting an ingress controller:
- Deployment.
- Service: To expose the Deployment.
- Config Map: to feed nginx configuration data like sslprotocol, logpath,
- ServiceAccount: To apply Ingress resource configurations. The service accounts must have right set of roles, clusterroles and rolebindings configured.
- Ingress Resource:
apiVersion: extensions/v1beta1
kind: Ingress
metadata:
name: ingress-service
spec:
rules:
- host: my-apparelstore.com
http:
paths:
- path: /app1
backend:
serviceName: app1
servicePort: 8080
- host: my-apparelstore.com
http:
paths:
- path: /app2
backend:
serviceName: app2
servicePort: 8080