The solutions in this post are geared toward those who are hosting their cluster with AWS and managing the cluster with KOPS. However, the troubleshooting steps apply the most scenarios. I am sharing this in the hopes of saving others the stress that I experienced the first time this happened to me. If you have not had a Kubernetes worker node go in to notReady
state, read on because you will.
If you are here because you have a worker node in notReady
state right now and you are using AWS and KOPS, follow the troubleshooting steps below. I will discuss them afterwards.
- Run
kubectl get nodes
to get the name of the nodes innotReady
state.- At this point I recommend provisioning additional nodes with KOPS to relieve pressure on your other nodes and give the pods on the the
notReady
nodes a place to go.kops edit ig nodes
to bring up the editor.- Set the
maxSize
andminSize
values. - Preview the changes with
kops update cluster <clustername>
. - Apply the changes with
kops update cluster <clustername> --yes
.
- Detailed instructions are available here.
- At this point I recommend provisioning additional nodes with KOPS to relieve pressure on your other nodes and give the pods on the the
- Use
kubectl describe node <node name>
to get the status of the node.- A handy shortcut to the two steps above is
kubectl get nodes | grep '^.*notReay.*$' | awk '{print $1}' | xargs kubectl describe node
- A handy shortcut to the two steps above is
- Look for the
Conditions
heading and check the condition of `NetworkUnavailable,
OutOfDisk,
MemoryPressure,
and DiskPressure` .- If the statuses of those items are helpful, begin troubleshooting those conditions.
- If there are memory or disk issues, there is a good chance that you have a pod or a number of pods wreaking havoc. Fix the problems with those pods if you can, otherwise, prevent them from being scheduled to other nodes by deleting them, or using
kubectl cordon <node name>
on your healthy nodes to prevent new pods from being scheduled to them.
- SSH in to the unhealthy nodes. If you cannot SSH into the nodes, skip ahead. Otherwise, use
ps -eaf
to determine if the docker daemon and kublet are running. - If you have determined that kubelet and the docker daemon are not running on the node, or you are are not able to determine this, use the AWS console or AWS CLI to terminate the node. The autoscale group created by KOPS will provision a new node.
- At this time, if you provisioned additional nodes that you would like to remove, use
kubectl drain <node-name>
to drain the node. Once it has been drained, update the cluster configuration with KOPS to reduce the size of the cluster, and then terminate the node.
- At this time, if you provisioned additional nodes that you would like to remove, use
Of course, if you aren’t using KOPS and an autoscale group, these steps won’t be as helpful to you. In general, I think it is important to quickly diagnose whether or not the docker daemon and kubelet are running on the affected nodes. If the docker daemon is down and cannot be restarted, a quick and simple solution might be replacing the node, assuming you can do that quickly and safely. After all, once the docker daemon is down, your pods aren’t working anyway.
If you have not had the need to fix a node in notReady
state, now is a good time plan what you will do when you encounter this situation. For my hobby cluster that I host with Linod, I will do roughly the same procedure that I describe above, except I won’t be able to terminate the node and expect it to be replace automatically. Instead, I will use kube-linode to provision an additional node to transfer the orphaned pods to. Then, I will use kubectl delete <node name>
to remove the node object, and finally remove the node from my Linode account.
Unfortunately, the steps above do not identify what caused the node to go in to notReady
state. Additionally, once the node is gone, you will probably lose any logs that you might have been able to snoop to figure out what happened. Again, now is a good time to plan what how you will deal with this situation when it comes up. I recommend monitoring your nodes with node_exporter and Prometheus, and using some type of log aggregation tool. At work, I use LogDNA.
If you are looking for a place to get started with Kubernetes, take a look at Kubernetes: Up and Running.