Ganeti, dealing with node failure
- Get paged
- Stop panicking
- Be sure to log into the broken node to verify it actually died. If the VMs are still running correctly on it, and it’s simply a networking problem, if you proceed to bring them up again you will encounter a bad state known as ‘Split Brain’. This is difficult to recover from, so please verify the dead node is truly dead.
- If there is more than 1 node left, try logging into the cluster IP (kvm.infra.scl1.mozilla.com vs kvm1.infra.scl1.mozilla.com)
- If there is only 1 remaining node, voting won’t work, so ganeti-masterd will have to be started by hand:
root@vm1-1:~# ganet-masterd --no-voting
Once your master node is online, we need to set the failed node to Offline mode.