Cluster node Management
General things to look at when rebooting a node
- Does the node have any
rook-ceph-osd-*
pods? Check the corresponging ceph cluster for health, make sure you bring down one node of that cluster at a time - Does the node have any
haproxy-ingress-*
pods? If node is coming down for a long time, disable the record in Constellix DNS. - Does the node have the
nautilus.io/linstor-server
label? This node is the linstor server. Some are redundant, some are not. - Does the node have
nautilus.io/bgp-speaker
label? There are 2 nodes in the cluster used for MetalLB IPs, keep one of those alive. - Does the node have
node-role.kubernetes.io/master
label and is not admiralty virtual node? Rebooting this node will make the cluster not accessible.
Prerequisites
-
Install Ansible on a local computer.
-
Clone the repo of ansible playbooks:
-
Pull the latest updates from the playbook repo:
Reboot a node due to GPU failure
Special instruction to reboot Ceph nodes
If multiple nodes within a Ceph cluster need to be rebooted, in order to maintain enough OSDs for redundancy, only one node is allowed to be rebooted at a time.
Run this command to enter rook-ceph-tools pod shell, where rook
, rook-east
, rook-pacific
, rook-haosu
, rook-suncave
):
kubectl exec -it -n <namespace> $(kubectl get pods -n <namespace> --selector=app=rook-ceph-tools --output=jsonpath={.items..metadata.name}) -- bash
In the pod shell, run
Wait until [WRN] OSD_DOWN: 1 osds down
to disappear from the ceph health detail
output to reboot the next node.