Maverick

Ganeti, dealing with node failure

  • Get paged
  • Stop panicking
  • Be sure to log into the broken node to verify it actually died. If the VMs are still running correctly on it, and it’s simply a networking problem, if you proceed to bring them up again you will encounter a bad state known as ‘Split Brain’. This is difficult to recover from, so please verify the dead node is truly dead.
  • If there is more than 1 node left, try logging into the cluster IP (kvm.infra.scl1.mozilla.com vs kvm1.infra.scl1.mozilla.com)
  • If there is only 1 remaining node, voting won’t work, so ganeti-masterd will have to be started by hand:
root@vm1-1:~# ganet-masterd --no-voting

Once your master node is online, we need to set the failed node to Offline mode.