Solution Worker Node Failure
- Lets have a look at the Practice Test of the Worker Node Failure
Solution
-
Fix the broken cluster
-
Fix
node01
-
Check the nodes
kubectl get nodesWe see that
node01has a status ofNotReady. This usually means that communication with the node’s kubelet has been lost. -
Go to the node and investigate
ssh node01 -
Check kubelet status
systemctl status kubeletWe can see from the output that kublet is not running, in fact it has exited. Therefore we should try starting it.
-
Start kubelet
systemctl start kubelet -
Now check it is OK.
systemctl status kubeletNow we can see it is
active (running), which is good. -
Return to controlplane
exit -
Check nodes again
kubectl get nodesIt is good!
-
-
The cluster is broken again. Investigate and fix the issue.
-
Fix cluster
-
Check the nodes
kubectl get nodesWe see that
node01has a status ofNotReady. This usually means that communication with the node’s kubelet has been lost. -
Go to the node and investigate
ssh node01 -
Check kubelet status
systemctl status kubeletWe can see from the output that it is crashlooping
activating (auto-restart), therefore this is likey a configuration issue. -
Check kubelet logs
journalctl -u kubeletThere is a lot of information, however the error we are interested in, which is the cause of all other errors is this one
"failed to construct kubelet dependencies: unable to load client CA file /etc/kubernetes/pki/WRONG-CA-FILE.crt: open /etc/kubernetes/pki/WRONG-CA-FILE.crt: no such file or directory"If kubelet cannot load its certificates, then it cannot autheticate with API server. This is a fatal error, so kubelet exits.
-
Check the indicated directory for certificates
ls -l /etc/kubernetes/pkiWe see it contains
ca.crtwhich we will assume is the correct certificate, therefore we need to find the kubelet configuration file and correct the error there. -
Locate kubelet’s configuration file
kubelet is an operating system service, so its service unit file will give us that info
systemctl cat kubeletNote this line
Environment="KUBELET_CONFIG_ARGS=--config=/var/lib/kubelet/config.yaml"There is the config YAML file
-
Fix configuration
vi /var/lib/kubelet/config.yamlapiVersion: kubelet.config.k8s.io/v1beta1 authentication: anonymous: enabled: false webhook: cacheTTL: 0s enabled: true x509: clientCAFile: /etc/kubernetes/pki/WRONG-CA-FILE.crt # <- Fix this authorization: mode: WebhookNote that you can perform the same edit with a single
sedcommand. This is quicker than editing in vi.sed -i 's/WRONG-CA-FILE.crt/ca.crt/g' /var/lib/kubelet/config.yaml -
Check status
Wait a few seconds, kubelet will be auto-restarted.
systemctl status kubeletNow we can see it is
active (running), which is good. If it is not, then you made a mistake when editing the config file, probably broke the YAML syntax or did not edit the certificate filename correctly. Return to stepvii.above and fix it. -
Return to controlplane
exit -
Check nodes again
kubectl get nodesIt is good!
-
-
The cluster is broken again. Investigate and fix the issue.
-
Fix cluster
-
Check the nodes
kubectl get nodesWe see that
node01has a status ofNotReady. This usually means that communication with the node’s kubelet has been lost. -
Go to the node and investigate
ssh node01 -
Check kubelet status
systemctl status kubeletWe can see it is
active (running), however the API server still thinks there is an issue. So we must again go to the kubelet logs. -
Check kubelet logs
journalctl -u kubeletThere is a lot of information, however the error we are interested in, which is the cause of all other errors is this one
"Unable to register node with API server" err="Post \"https://controlplane:6553/api/v1/nodes\": dial tcp 192.10.46.12:6553: connect: connection refused" node="node01"What do you know about the usual port for API server? It’s not
6553! kubelet uses a kubeconfig file to connect to API server just like kubectl does, so we need to locate and fix that. -
Locate kubelet’s kubeconfig file
kubelet is an operating system service, so its service unit file will give us that info
systemctl cat kubeletNote this line
Environment="KUBELET_KUBECONFIG_ARGS=--bootstrap-kubeconfig=/etc/kubernetes/bootstrap-kubelet.conf --kubeconfig=/etc/kubernetes/kubelet.conf"There are two kubeconfigs. The first one is used when a node is created and is joining the cluster. The second one is used for normal operation. It is therefore the second one we are interested in.
-
Fix the kubeconfig
Port should be
6443vi /etc/kubernetes/kubelet.confapiVersion: v1 clusters: - cluster: certificate-authority-data: REDACTED server: https://controlplane:6553 # <- Fix this name: default-cluster contexts: - context: cluster: default-cluster namespace: default user: default-auth name: default-context current-context: default-context kind: Config preferences: {} users: - name: default-auth user: client-certificate: /var/lib/kubelet/pki/kubelet-client-current.pem client-key: /var/lib/kubelet/pki/kubelet-client-current.pemNote that you can perform the same edit with a single
sedcommand. This is quicker than editing in vi.sed -i 's/6553/6443/g' /etc/kubernetes/kubelet.conf -
Restart kubelet
Since kubelet is already running (not crashlooping), we need to restart it so it gets the updated kubeconfig
systemctl restart kubelet -
Check status
systemctl status kubeletNow we can see it is
active (running), which is good. If it is not, then you made a mistake when editing the kubeconfig, probably broke the YAML syntax. Return to stepvi.above and fix it. -
Return to controlplane
exit -
Check nodes again
kubectl get nodesIt is good! If it is not, then you probably made a mistake setting the port number. Return to
node01and redo from stepvi.above.
-