Ganeti, dealing with node failure

  • Get paged
  • Stop panicking
  • Be sure to log into the broken node to verify it actually died. If the VMs are still running correctly on it, and it’s simply a networking problem, if you proceed to bring them up again you will encounter a bad state known as ‘Split Brain’. This is difficult to recover from, so please verify the dead node is truly dead.
  • If there is more than 1 node left, try logging into the cluster IP (kvm.infra.scl1.mozilla.com vs kvm1.infra.scl1.mozilla.com)
  • If there is only 1 remaining node, voting won’t work, so ganeti-masterd will have to be started by hand:
root@vm1-1:~# ganet-masterd --no-voting

Once your master node is online, we need to set the failed node to Offline mode.

root@vm1-1:~# gnt-node modify --offline yes vm1-2
Fri Apr 22 07:44:13 2011 - WARNING: Communication failure to node vm1-2.labs.sjc1.mozilla.com: Connection failed (113: No route to host)
Modified node vm1-2
- offline -> True
- master_candidate -> auto-demotion due to offline

Afterwards, assess the situation by running gnt-cluster verify, you should see it throw MANY warnings about instances being offline:

root@vm1-1:~# gnt-cluster verify
Fri Apr 22 07:44:42 2011 * Verifying global settings
Fri Apr 22 07:44:42 2011 * Gathering data (2 nodes)
Fri Apr 22 07:44:42 2011 * Verifying node status
Fri Apr 22 07:44:42 2011 - ERROR: node vm1-1.labs.sjc1.mozilla.com: drbd minor 0 of instance vm1.vm1.labs.sjc1.mozilla.com is not active
Fri Apr 22 07:44:42 2011 - ERROR: node vm1-1.labs.sjc1.mozilla.com: drbd minor 1 of instance vm2.vm1.labs.sjc1.mozilla.com is not active
Fri Apr 22 07:44:42 2011 - ERROR: node vm1-1.labs.sjc1.mozilla.com: drbd minor 2 of instance vm3.vm1.labs.sjc1.mozilla.com is not active
Fri Apr 22 07:44:42 2011 - ERROR: node vm1-1.labs.sjc1.mozilla.com: drbd minor 3 of instance vm4.vm1.labs.sjc1.mozilla.com is not active
Fri Apr 22 07:44:42 2011 - ERROR: node vm1-1.labs.sjc1.mozilla.com: drbd minor 4 of instance vm5.vm1.labs.sjc1.mozilla.com is not active
Fri Apr 22 07:44:42 2011 - ERROR: node vm1-1.labs.sjc1.mozilla.com: drbd minor 5 of instance vm6.vm1.labs.sjc1.mozilla.com is not active
Fri Apr 22 07:44:42 2011 - ERROR: node vm1-1.labs.sjc1.mozilla.com: drbd minor 6 of instance vm7.vm1.labs.sjc1.mozilla.com is not active
Fri Apr 22 07:44:42 2011 - ERROR: node vm1-1.labs.sjc1.mozilla.com: drbd minor 8 of instance vm8.vm1.labs.sjc1.mozilla.com is not active
Fri Apr 22 07:44:42 2011 - ERROR: node vm1-1.labs.sjc1.mozilla.com: drbd minor 9 of instance vm9.vm1.labs.sjc1.mozilla.com is not active
Fri Apr 22 07:44:42 2011 - ERROR: node vm1-1.labs.sjc1.mozilla.com: drbd minor 10 of instance vm10.vm1.labs.sjc1.mozilla.com is not active
Fri Apr 22 07:44:42 2011 - ERROR: node vm1-1.labs.sjc1.mozilla.com: drbd minor 12 of instance vm11.vm1.labs.sjc1.mozilla.com is not active
Fri Apr 22 07:44:42 2011 - ERROR: node vm1-1.labs.sjc1.mozilla.com: drbd minor 13 of instance vm12.vm1.labs.sjc1.mozilla.com is not active
Fri Apr 22 07:44:42 2011 - ERROR: node vm1-1.labs.sjc1.mozilla.com: drbd minor 14 of instance vm13.vm1.labs.sjc1.mozilla.com is not active
Fri Apr 22 07:44:42 2011 - ERROR: node vm1-1.labs.sjc1.mozilla.com: drbd minor 15 of instance vm14.vm1.labs.sjc1.mozilla.com is not active
Fri Apr 22 07:44:42 2011 - ERROR: node vm1-1.labs.sjc1.mozilla.com: drbd minor 16 of instance vm15.vm1.labs.sjc1.mozilla.com is not active
Fri Apr 22 07:44:42 2011 - ERROR: node vm1-1.labs.sjc1.mozilla.com: drbd minor 17 of instance vm16.vm1.labs.sjc1.mozilla.com is not active
Fri Apr 22 07:44:42 2011 - ERROR: node vm1-1.labs.sjc1.mozilla.com: drbd minor 18 of instance vm17.vm1.labs.sjc1.mozilla.com is not active
Fri Apr 22 07:44:42 2011 - ERROR: node vm1-1.labs.sjc1.mozilla.com: drbd minor 19 of instance vm18.vm1.labs.sjc1.mozilla.com is not active
Fri Apr 22 07:44:42 2011 - ERROR: node vm1-1.labs.sjc1.mozilla.com: drbd minor 20 of instance vm19.vm1.labs.sjc1.mozilla.com is not active
Fri Apr 22 07:44:42 2011 - ERROR: node vm1-1.labs.sjc1.mozilla.com: drbd minor 21 of instance vm20.vm1.labs.sjc1.mozilla.com is not active
Fri Apr 22 07:44:42 2011 - ERROR: node vm1-1.labs.sjc1.mozilla.com: drbd minor 22 of instance vm21.vm1.labs.sjc1.mozilla.com is not active
Fri Apr 22 07:44:42 2011 - ERROR: node vm1-1.labs.sjc1.mozilla.com: drbd minor 23 of instance vm22.vm1.labs.sjc1.mozilla.com is not active
Fri Apr 22 07:44:42 2011 - ERROR: node vm1-1.labs.sjc1.mozilla.com: drbd minor 24 of instance vm23.vm1.labs.sjc1.mozilla.com is not active
Fri Apr 22 07:44:42 2011 - ERROR: node vm1-1.labs.sjc1.mozilla.com: drbd minor 25 of instance vm24.vm1.labs.sjc1.mozilla.com is not active
Fri Apr 22 07:44:42 2011 * Verifying instance status
Fri Apr 22 07:44:42 2011 - ERROR: instance vm1.vm1.labs.sjc1.mozilla.com: instance lives on offline node(s) vm1-2.labs.sjc1.mozilla.com
Fri Apr 22 07:44:42 2011 - ERROR: instance vm2.vm1.labs.sjc1.mozilla.com: instance lives on offline node(s) vm1-2.labs.sjc1.mozilla.com
Fri Apr 22 07:44:42 2011 - ERROR: instance vm3.vm1.labs.sjc1.mozilla.com: instance lives on offline node(s) vm1-2.labs.sjc1.mozilla.com
Fri Apr 22 07:44:42 2011 - ERROR: instance vm4.vm1.labs.sjc1.mozilla.com: instance lives on offline node(s) vm1-2.labs.sjc1.mozilla.com
Fri Apr 22 07:44:42 2011 - ERROR: instance vm5.vm1.labs.sjc1.mozilla.com: instance lives on offline node(s) vm1-2.labs.sjc1.mozilla.com
Fri Apr 22 07:44:42 2011 - ERROR: instance vm6.vm1.labs.sjc1.mozilla.com: instance lives on offline node(s) vm1-2.labs.sjc1.mozilla.com
Fri Apr 22 07:44:42 2011 - ERROR: instance vm7.vm1.labs.sjc1.mozilla.com: instance lives on offline node(s) vm1-2.labs.sjc1.mozilla.com
Fri Apr 22 07:44:42 2011 - ERROR: instance vm8.vm1.labs.sjc1.mozilla.com: instance lives on offline node(s) vm1-2.labs.sjc1.mozilla.com
Fri Apr 22 07:44:42 2011 - ERROR: instance vm9.vm1.labs.sjc1.mozilla.com: instance lives on offline node(s) vm1-2.labs.sjc1.mozilla.com
Fri Apr 22 07:44:42 2011 - ERROR: instance vm10.vm1.labs.sjc1.mozilla.com: instance lives on offline node(s) vm1-2.labs.sjc1.mozilla.com
Fri Apr 22 07:44:42 2011 - ERROR: instance vm11.vm1.labs.sjc1.mozilla.com: instance lives on offline node(s) vm1-2.labs.sjc1.mozilla.com
Fri Apr 22 07:44:42 2011 - ERROR: instance vm12.vm1.labs.sjc1.mozilla.com: instance lives on offline node(s) vm1-2.labs.sjc1.mozilla.com
Fri Apr 22 07:44:42 2011 - ERROR: instance vm13.vm1.labs.sjc1.mozilla.com: instance lives on offline node(s) vm1-2.labs.sjc1.mozilla.com
Fri Apr 22 07:44:42 2011 - ERROR: instance vm14.vm1.labs.sjc1.mozilla.com: instance lives on offline node(s) vm1-2.labs.sjc1.mozilla.com
Fri Apr 22 07:44:42 2011 - ERROR: instance vm15.vm1.labs.sjc1.mozilla.com: instance lives on offline node(s) vm1-2.labs.sjc1.mozilla.com
Fri Apr 22 07:44:42 2011 - ERROR: instance vm16.vm1.labs.sjc1.mozilla.com: instance not running on its primary node vm1-1.labs.sjc1.mozilla.com
Fri Apr 22 07:44:42 2011 - ERROR: instance vm17.vm1.labs.sjc1.mozilla.com: instance lives on offline node(s) vm1-2.labs.sjc1.mozilla.com
Fri Apr 22 07:44:42 2011 - ERROR: instance vm18.vm1.labs.sjc1.mozilla.com: instance lives on offline node(s) vm1-2.labs.sjc1.mozilla.com
Fri Apr 22 07:44:42 2011 - ERROR: instance vm19.vm1.labs.sjc1.mozilla.com: instance lives on offline node(s) vm1-2.labs.sjc1.mozilla.com
Fri Apr 22 07:44:42 2011 - ERROR: instance vm20.vm1.labs.sjc1.mozilla.com: instance lives on offline node(s) vm1-2.labs.sjc1.mozilla.com
Fri Apr 22 07:44:42 2011 - ERROR: instance vm21.vm1.labs.sjc1.mozilla.com: instance lives on offline node(s) vm1-2.labs.sjc1.mozilla.com
Fri Apr 22 07:44:42 2011 - ERROR: instance vm22.vm1.labs.sjc1.mozilla.com: instance lives on offline node(s) vm1-2.labs.sjc1.mozilla.com
Fri Apr 22 07:44:42 2011 - ERROR: instance vm23.vm1.labs.sjc1.mozilla.com: instance lives on offline node(s) vm1-2.labs.sjc1.mozilla.com
Fri Apr 22 07:44:42 2011 - ERROR: instance vm24.vm1.labs.sjc1.mozilla.com: instance lives on offline node(s) vm1-2.labs.sjc1.mozilla.com
Fri Apr 22 07:44:42 2011 - ERROR: instance vm25.vm1.labs.sjc1.mozilla.com: instance lives on offline node(s) vm1-2.labs.sjc1.mozilla.com
Fri Apr 22 07:44:42 2011 - ERROR: instance vm26.vm1.labs.sjc1.mozilla.com: instance lives on offline node(s) vm1-2.labs.sjc1.mozilla.com
Fri Apr 22 07:44:42 2011 - ERROR: instance vm27.vm1.labs.sjc1.mozilla.com: instance lives on offline node(s) vm1-2.labs.sjc1.mozilla.com
Fri Apr 22 07:44:42 2011 * Verifying orphan volumes
Fri Apr 22 07:44:42 2011 * Verifying orphan instances
Fri Apr 22 07:44:42 2011 * Verifying N+1 Memory redundancy
Fri Apr 22 07:44:42 2011 - ERROR: node vm1-2.labs.sjc1.mozilla.com: not enough memory on to accommodate failovers should peer node vm1-1.labs.sjc1.mozilla.com fail
Fri Apr 22 07:44:42 2011 * Other Notes
Fri Apr 22 07:44:42 2011 - NOTICE: 1 offline node(s) found.
Fri Apr 22 07:44:42 2011 * Hooks Results

Great. It looks pissed, because Mister Ganeti’s left arm just got cut off. Let’s get a more concise look at the state of things:

root@vm1-1:~# gnt-instance list
Instance Hypervisor OS Primary_node Status Memory
vm1.vm1.labs.sjc1.mozilla.com kvm image+lucid_django vm1-1.labs.sjc1.mozilla.com running 512M
vm2.vm1.labs.sjc1.mozilla.com kvm image+lucid_django vm1-2.labs.sjc1.mozilla.com ERROR_nodeoffline (node down)
vm3.vm1.labs.sjc1.mozilla.com kvm image+lucid_django vm1-2.labs.sjc1.mozilla.com ERROR_nodeoffline (node down)
vm4.vm1.labs.sjc1.mozilla.com kvm image+lucid_django vm1-2.labs.sjc1.mozilla.com ERROR_nodeoffline (node down)
vm5.vm1.labs.sjc1.mozilla.com kvm image+lucid_django vm1-2.labs.sjc1.mozilla.com ERROR_nodeoffline (node down)
vm6.vm1.labs.sjc1.mozilla.com kvm image+lucid_django vm1-2.labs.sjc1.mozilla.com ERROR_nodeoffline (node down)
vm7.vm1.labs.sjc1.mozilla.com kvm image+lucid_django vm1-2.labs.sjc1.mozilla.com ERROR_nodeoffline (node down)
vm8.vm1.labs.sjc1.mozilla.com kvm image+lucid_lpk vm1-2.labs.sjc1.mozilla.com ERROR_nodeoffline (node down)
vm9.vm1.labs.sjc1.mozilla.com kvm debootstrap+default vm1-2.labs.sjc1.mozilla.com ERROR_nodeoffline (node down)
vm10.vm1.labs.sjc1.mozilla.com kvm image+lucid_lpk vm1-2.labs.sjc1.mozilla.com ERROR_nodeoffline (node down)
vm11.vm1.labs.sjc1.mozilla.com kvm image+lucid_lpk vm1-2.labs.sjc1.mozilla.com ERROR_nodeoffline (node down)
vm12.vm1.labs.sjc1.mozilla.com kvm image+lucid_lpk vm1-2.labs.sjc1.mozilla.com ERROR_nodeoffline (node down)
vm13.vm1.labs.sjc1.mozilla.com kvm image+lucid_lpk vm1-1.labs.sjc1.mozilla.com ADMIN_down -
vm14.vm1.labs.sjc1.mozilla.com kvm debootstrap+default vm1-1.labs.sjc1.mozilla.com ADMIN_down -
vm15.vm1.labs.sjc1.mozilla.com kvm image+lucid_lpk vm1-2.labs.sjc1.mozilla.com ERROR_nodeoffline (node down)
vm16.vm1.labs.sjc1.mozilla.com kvm image+lucid_lpk vm1-1.labs.sjc1.mozilla.com ERROR_down -
vm17.vm1.labs.sjc1.mozilla.com kvm image+lucid_lpk vm1-2.labs.sjc1.mozilla.com ERROR_nodeoffline (node down)
vm18.vm1.labs.sjc1.mozilla.com kvm image+lucid_lpk vm1-2.labs.sjc1.mozilla.com ERROR_nodeoffline (node down)
vm19.vm1.labs.sjc1.mozilla.com kvm image+lucid_lpk vm1-2.labs.sjc1.mozilla.com ERROR_nodeoffline (node down)
vm20.vm1.labs.sjc1.mozilla.com kvm image+lucid_lpk vm1-2.labs.sjc1.mozilla.com ERROR_nodeoffline (node down)
vm21.vm1.labs.sjc1.mozilla.com kvm image+lucid_lpk vm1-2.labs.sjc1.mozilla.com ERROR_nodeoffline (node down)
vm22.vm1.labs.sjc1.mozilla.com kvm image+lucid_lpk vm1-2.labs.sjc1.mozilla.com ERROR_nodeoffline (node down)
vm23.vm1.labs.sjc1.mozilla.com kvm image+lucid_lpk vm1-2.labs.sjc1.mozilla.com ERROR_nodeoffline (node down)
vm24.vm1.labs.sjc1.mozilla.com kvm image+lucid_lpk vm1-2.labs.sjc1.mozilla.com ERROR_nodeoffline (node down)
vm25.vm1.labs.sjc1.mozilla.com kvm image+lucid_lpk vm1-2.labs.sjc1.mozilla.com ERROR_nodeoffline (node down)
vm26.vm1.labs.sjc1.mozilla.com kvm image+lucid_lamp vm1-2.labs.sjc1.mozilla.com ERROR_nodeoffline (node down)

We can bring up one instance at a time, like this:

root@vm1-1:~# gnt-instance failover --ignore-consistency vm26
Failover will happen to image vm26. This requires a shutdown of the
instance. Continue?
y/[n]/?: y
Fri Apr 22 07:47:18 2011 * checking disk consistency between source and target
Fri Apr 22 07:47:18 2011 - WARNING: Can't find disk on node vm1-1.labs.sjc1.mozilla.com
Fri Apr 22 07:47:18 2011 * shutting down instance on source node
Fri Apr 22 07:47:18 2011 - WARNING: Could not shutdown instance www1.vm1.labs.sjc1.mozilla.com on node vm1-2.labs.sjc1.mozilla.com. Proceeding anyway. Please make sure node vm1-2.labs.sjc1.mozilla.com is down. Error details: Node is marked offline
Fri Apr 22 07:47:18 2011 * deactivating the instance's disks on source node
Fri Apr 22 07:47:18 2011 - WARNING: Could not shutdown block device disk/0 on node vm1-2.labs.sjc1.mozilla.com: Node is marked offline
Fri Apr 22 07:47:18 2011 * activating the instance's disks on target node
Fri Apr 22 07:47:18 2011 - WARNING: Could not prepare block device disk/0 on node vm1-2.labs.sjc1.mozilla.com (is_primary=False, pass=1): Node is marked offline
Fri Apr 22 07:47:19 2011 * starting the instance on the target node

But doing that for each node would be horrifically slow, so we’ll write a quick and dirty BASH for loop:

root@vm1-1:~# for i in $(gnt-instance list|grep ERROR_nodeoffline|cut -f1 --delimiter= ); do gnt-instance failover --ignore-consistency -f $i; done
Fri Apr 22 07:49:04 2011 * checking disk consistency between source and target Fri Apr 22 07:49:04 2011 -- WARNING: Can't find disk on node vm1-1.labs.sjc1.mozilla.com Fri Apr 22 07:49:04 2011 * shutting down instance on source node Fri Apr 22 07:49:04 2011 -- WARNING: Could not shutdown instance vm2.vm1.labs.sjc1.mozilla.com on node vm1-2.labs.sjc1.mozilla.com. Proceeding anyway. Please make sure node vm1-2.labs.sjc1.mozilla.com is down. Error details: Node is marked offline Fri Apr 22 07:49:04 2011 * deactivating the instance's disks on source node Fri Apr 22 07:49:04 2011 -- WARNING: Could not shutdown block device disk/0 on node vm1-2.labs.sjc1.mozilla.com: Node is marked offline Fri Apr 22 07:49:04 2011 * activating the instance's disks on target node Fri Apr 22 07:49:05 2011 -- WARNING: Could not prepare block device disk/0 on node vm1-2.labs.sjc1.mozilla.com (is_primary=False, pass=1): Node is marked offline Fri Apr 22 07:49:05 2011 * starting the instance on the target node

All your instances should be running now!