When was the last time when one of yours K8s master went down and what did you do about it and also do we know why did the k8s master went down.
In this blog post, we will go through one of the same scenarios which happened within our k8s cluster and understand the details of the issue.
So on the day of the issue, we got a pager reporting that we have less than expected nodes in our k8s cluster. So we started investigating about the issue so that it does not start happening with our other nodes and in no time , we might be running out of any nodes in our k8s cluster. So before going into the details of the issue, first wanted to list down the details about one of our k8s setup :-
- 2 K8s masters
- 5 ETCD Pods
- k8s version: v1.12
Also some other details are
- We use Kubespray to deploy the k8s cluster
- We use the same set of worker nodes for hosting k8s master pods and ETCD pods. Reason: Economics
- Node which went down was hosting the master k8s pod, so essentially after the node was down , we were left with a single k8s master which is not at all desirable because of our HA guarantees
Why ???
We started investigating what exactly went wrong with the node. These were the steps which we did to narrow down the problem
Let’s Check the status of the node via kubectl
kubectl describe node ${relevantNode}
In this above describe status of the node, we can see clearly that the NodeHasDiskPressure
event came for this particular node and then the other event states that the kubelet is trying to reclaim ephemeral storage.
After this , we went to the node to see the events for the kubelet,
Let’s Check Kubelet Logs
journalctl -u kubelet
In the logs, we can see clearly that kube-apiserver was evicted in order to reclaim ephemeral storage.
Let’s Check Disk Space for the volume /var/log
df -h
In the output, we can see that the volume /dev/mapper/centos-var which is mounted on /var essentially has < 10% of the available disk space. And according to our configuration, we had a hard eviction percentage set to 10% which essentially means that if the disk available goes < 10% , then kubelet will start evicting pods which it just did as we can see from the logs
Explanation
KUBELET continuously monitors volumes via cAdvisor for disk space and if the available disk space goes below some threshold , then KUBELET starts evicting pods to reclaim the desired disk space. For evicting pods, it creates a rank of the pods based on certain conditions and based on that rank it starts evicting pods until the required condition is met.
As we can see in the above snapshot in our case as well , kubelet creates a rank of the pods for eviction. In this rank of the pods, we can see that kube-apiserver is at top of the list and hence it first evicts this K8s-Controller. Because of this eviction, this is what happens next
- Kubelet uses the kube-apiserver to let everybody know that they are available. In this particular cases, when the kube-apiserver and the kubelet are on the same node, kubelet communicates with the kube-apiserver of the same node. However as we know that kubelet has already killed/evicted the kube-apiserver pod to LowDiskVolume which means that kubelet cannot communicate with the world. Hence, we can see in the initial diagram that why node was being shown as NotReady
Fixing the problem
- SSH to the affected node
- Clear up the disk space and make the available disk space >> 10%
- After clearing up the /var/logs, restart kubelet with this command “sudo systemctl restart kubelet”
After restarting kubelet, all the relevant k8s critical pods started onto the affected node along with the kube-apiserver.
After the pods were back, node was also marked as ready because kubelet started publishing metrics related to the node and hence other pods and daemonsets were rescheduled onto this node.