Ganeti, dealing with node failure

This morning at Mozilla, we had out first ganeti node failure, which meant we got to learn how to deal with a node failure! So far all of our ganeti clusters have been 2 node clusters, so while relatively simple, it meant an extra quirk in bringing everything up. Thankfully we used the iallocator so as to not overallocate nodes on the cluster. Here’s how the process went.

  • Get paged
  • Stop panicking
  • Be sure to log into the broken node to verify it actually died. If the VMs are still running correctly on it, and it’s simply a networking problem, if you proceed to bring them up again you will encounter a bad state known as ‘Split Brain’. This is difficult to recover from, so please verify the dead node is truly dead.
  • If there is more than 1 node left, try logging into the cluster IP (kvm.infra.scl1.mozilla.com vs kvm1.infra.scl1.mozilla.com)
  • If there is only 1 remaining node, voting won’t work, so ganeti-masterd will have to be started by hand:


root@vm1-1:~# ganet-masterd --no-voting

Once your master node is online, we need to set the failed node to Offline mode.

root@vm1-1:~# gnt-node modify --offline yes vm1-2
Fri Apr 22 07:44:13 2011 - WARNING: Communication failure to node vm1-2.labs.sjc1.mozilla.com: Connection failed (113: No route to host)
Modified node vm1-2
- offline -> True
- master_candidate -> auto-demotion due to offline

Afterwards, assess the situation by running gnt-cluster verify, you should see it throw MANY warnings about instances being offline:

root@vm1-1:~# gnt-cluster verify
Fri Apr 22 07:44:42 2011 * Verifying global settings
Fri Apr 22 07:44:42 2011 * Gathering data (2 nodes)
Fri Apr 22 07:44:42 2011 * Verifying node status
Fri Apr 22 07:44:42 2011 - ERROR: node vm1-1.labs.sjc1.mozilla.com: drbd minor 0 of instance vm1.vm1.labs.sjc1.mozilla.com is not active
Fri Apr 22 07:44:42 2011 - ERROR: node vm1-1.labs.sjc1.mozilla.com: drbd minor 1 of instance vm2.vm1.labs.sjc1.mozilla.com is not active
Fri Apr 22 07:44:42 2011 - ERROR: node vm1-1.labs.sjc1.mozilla.com: drbd minor 2 of instance vm3.vm1.labs.sjc1.mozilla.com is not active
Fri Apr 22 07:44:42 2011 - ERROR: node vm1-1.labs.sjc1.mozilla.com: drbd minor 3 of instance vm4.vm1.labs.sjc1.mozilla.com is not active
Fri Apr 22 07:44:42 2011 - ERROR: node vm1-1.labs.sjc1.mozilla.com: drbd minor 4 of instance vm5.vm1.labs.sjc1.mozilla.com is not active
Fri Apr 22 07:44:42 2011 - ERROR: node vm1-1.labs.sjc1.mozilla.com: drbd minor 5 of instance vm6.vm1.labs.sjc1.mozilla.com is not active
Fri Apr 22 07:44:42 2011 - ERROR: node vm1-1.labs.sjc1.mozilla.com: drbd minor 6 of instance vm7.vm1.labs.sjc1.mozilla.com is not active
Fri Apr 22 07:44:42 2011 - ERROR: node vm1-1.labs.sjc1.mozilla.com: drbd minor 8 of instance vm8.vm1.labs.sjc1.mozilla.com is not active
Fri Apr 22 07:44:42 2011 - ERROR: node vm1-1.labs.sjc1.mozilla.com: drbd minor 9 of instance vm9.vm1.labs.sjc1.mozilla.com is not active
Fri Apr 22 07:44:42 2011 - ERROR: node vm1-1.labs.sjc1.mozilla.com: drbd minor 10 of instance vm10.vm1.labs.sjc1.mozilla.com is not active
Fri Apr 22 07:44:42 2011 - ERROR: node vm1-1.labs.sjc1.mozilla.com: drbd minor 12 of instance vm11.vm1.labs.sjc1.mozilla.com is not active
Fri Apr 22 07:44:42 2011 - ERROR: node vm1-1.labs.sjc1.mozilla.com: drbd minor 13 of instance vm12.vm1.labs.sjc1.mozilla.com is not active
Fri Apr 22 07:44:42 2011 - ERROR: node vm1-1.labs.sjc1.mozilla.com: drbd minor 14 of instance vm13.vm1.labs.sjc1.mozilla.com is not active
Fri Apr 22 07:44:42 2011 - ERROR: node vm1-1.labs.sjc1.mozilla.com: drbd minor 15 of instance vm14.vm1.labs.sjc1.mozilla.com is not active
Fri Apr 22 07:44:42 2011 - ERROR: node vm1-1.labs.sjc1.mozilla.com: drbd minor 16 of instance vm15.vm1.labs.sjc1.mozilla.com is not active
Fri Apr 22 07:44:42 2011 - ERROR: node vm1-1.labs.sjc1.mozilla.com: drbd minor 17 of instance vm16.vm1.labs.sjc1.mozilla.com is not active
Fri Apr 22 07:44:42 2011 - ERROR: node vm1-1.labs.sjc1.mozilla.com: drbd minor 18 of instance vm17.vm1.labs.sjc1.mozilla.com is not active
Fri Apr 22 07:44:42 2011 - ERROR: node vm1-1.labs.sjc1.mozilla.com: drbd minor 19 of instance vm18.vm1.labs.sjc1.mozilla.com is not active
Fri Apr 22 07:44:42 2011 - ERROR: node vm1-1.labs.sjc1.mozilla.com: drbd minor 20 of instance vm19.vm1.labs.sjc1.mozilla.com is not active
Fri Apr 22 07:44:42 2011 - ERROR: node vm1-1.labs.sjc1.mozilla.com: drbd minor 21 of instance vm20.vm1.labs.sjc1.mozilla.com is not active
Fri Apr 22 07:44:42 2011 - ERROR: node vm1-1.labs.sjc1.mozilla.com: drbd minor 22 of instance vm21.vm1.labs.sjc1.mozilla.com is not active
Fri Apr 22 07:44:42 2011 - ERROR: node vm1-1.labs.sjc1.mozilla.com: drbd minor 23 of instance vm22.vm1.labs.sjc1.mozilla.com is not active
Fri Apr 22 07:44:42 2011 - ERROR: node vm1-1.labs.sjc1.mozilla.com: drbd minor 24 of instance vm23.vm1.labs.sjc1.mozilla.com is not active
Fri Apr 22 07:44:42 2011 - ERROR: node vm1-1.labs.sjc1.mozilla.com: drbd minor 25 of instance vm24.vm1.labs.sjc1.mozilla.com is not active
Fri Apr 22 07:44:42 2011 * Verifying instance status
Fri Apr 22 07:44:42 2011 - ERROR: instance vm1.vm1.labs.sjc1.mozilla.com: instance lives on offline node(s) vm1-2.labs.sjc1.mozilla.com
Fri Apr 22 07:44:42 2011 - ERROR: instance vm2.vm1.labs.sjc1.mozilla.com: instance lives on offline node(s) vm1-2.labs.sjc1.mozilla.com
Fri Apr 22 07:44:42 2011 - ERROR: instance vm3.vm1.labs.sjc1.mozilla.com: instance lives on offline node(s) vm1-2.labs.sjc1.mozilla.com
Fri Apr 22 07:44:42 2011 - ERROR: instance vm4.vm1.labs.sjc1.mozilla.com: instance lives on offline node(s) vm1-2.labs.sjc1.mozilla.com
Fri Apr 22 07:44:42 2011 - ERROR: instance vm5.vm1.labs.sjc1.mozilla.com: instance lives on offline node(s) vm1-2.labs.sjc1.mozilla.com
Fri Apr 22 07:44:42 2011 - ERROR: instance vm6.vm1.labs.sjc1.mozilla.com: instance lives on offline node(s) vm1-2.labs.sjc1.mozilla.com
Fri Apr 22 07:44:42 2011 - ERROR: instance vm7.vm1.labs.sjc1.mozilla.com: instance lives on offline node(s) vm1-2.labs.sjc1.mozilla.com
Fri Apr 22 07:44:42 2011 - ERROR: instance vm8.vm1.labs.sjc1.mozilla.com: instance lives on offline node(s) vm1-2.labs.sjc1.mozilla.com
Fri Apr 22 07:44:42 2011 - ERROR: instance vm9.vm1.labs.sjc1.mozilla.com: instance lives on offline node(s) vm1-2.labs.sjc1.mozilla.com
Fri Apr 22 07:44:42 2011 - ERROR: instance vm10.vm1.labs.sjc1.mozilla.com: instance lives on offline node(s) vm1-2.labs.sjc1.mozilla.com
Fri Apr 22 07:44:42 2011 - ERROR: instance vm11.vm1.labs.sjc1.mozilla.com: instance lives on offline node(s) vm1-2.labs.sjc1.mozilla.com
Fri Apr 22 07:44:42 2011 - ERROR: instance vm12.vm1.labs.sjc1.mozilla.com: instance lives on offline node(s) vm1-2.labs.sjc1.mozilla.com
Fri Apr 22 07:44:42 2011 - ERROR: instance vm13.vm1.labs.sjc1.mozilla.com: instance lives on offline node(s) vm1-2.labs.sjc1.mozilla.com
Fri Apr 22 07:44:42 2011 - ERROR: instance vm14.vm1.labs.sjc1.mozilla.com: instance lives on offline node(s) vm1-2.labs.sjc1.mozilla.com
Fri Apr 22 07:44:42 2011 - ERROR: instance vm15.vm1.labs.sjc1.mozilla.com: instance lives on offline node(s) vm1-2.labs.sjc1.mozilla.com
Fri Apr 22 07:44:42 2011 - ERROR: instance vm16.vm1.labs.sjc1.mozilla.com: instance not running on its primary node vm1-1.labs.sjc1.mozilla.com
Fri Apr 22 07:44:42 2011 - ERROR: instance vm17.vm1.labs.sjc1.mozilla.com: instance lives on offline node(s) vm1-2.labs.sjc1.mozilla.com
Fri Apr 22 07:44:42 2011 - ERROR: instance vm18.vm1.labs.sjc1.mozilla.com: instance lives on offline node(s) vm1-2.labs.sjc1.mozilla.com
Fri Apr 22 07:44:42 2011 - ERROR: instance vm19.vm1.labs.sjc1.mozilla.com: instance lives on offline node(s) vm1-2.labs.sjc1.mozilla.com
Fri Apr 22 07:44:42 2011 - ERROR: instance vm20.vm1.labs.sjc1.mozilla.com: instance lives on offline node(s) vm1-2.labs.sjc1.mozilla.com
Fri Apr 22 07:44:42 2011 - ERROR: instance vm21.vm1.labs.sjc1.mozilla.com: instance lives on offline node(s) vm1-2.labs.sjc1.mozilla.com
Fri Apr 22 07:44:42 2011 - ERROR: instance vm22.vm1.labs.sjc1.mozilla.com: instance lives on offline node(s) vm1-2.labs.sjc1.mozilla.com
Fri Apr 22 07:44:42 2011 - ERROR: instance vm23.vm1.labs.sjc1.mozilla.com: instance lives on offline node(s) vm1-2.labs.sjc1.mozilla.com
Fri Apr 22 07:44:42 2011 - ERROR: instance vm24.vm1.labs.sjc1.mozilla.com: instance lives on offline node(s) vm1-2.labs.sjc1.mozilla.com
Fri Apr 22 07:44:42 2011 - ERROR: instance vm25.vm1.labs.sjc1.mozilla.com: instance lives on offline node(s) vm1-2.labs.sjc1.mozilla.com
Fri Apr 22 07:44:42 2011 - ERROR: instance vm26.vm1.labs.sjc1.mozilla.com: instance lives on offline node(s) vm1-2.labs.sjc1.mozilla.com
Fri Apr 22 07:44:42 2011 - ERROR: instance vm27.vm1.labs.sjc1.mozilla.com: instance lives on offline node(s) vm1-2.labs.sjc1.mozilla.com
Fri Apr 22 07:44:42 2011 * Verifying orphan volumes
Fri Apr 22 07:44:42 2011 * Verifying orphan instances
Fri Apr 22 07:44:42 2011 * Verifying N+1 Memory redundancy
Fri Apr 22 07:44:42 2011 - ERROR: node vm1-2.labs.sjc1.mozilla.com: not enough memory on to accommodate failovers should peer node vm1-1.labs.sjc1.mozilla.com fail
Fri Apr 22 07:44:42 2011 * Other Notes
Fri Apr 22 07:44:42 2011 - NOTICE: 1 offline node(s) found.
Fri Apr 22 07:44:42 2011 * Hooks Results

Great. It looks pissed, because Mister Ganeti’s left arm just got cut off. Let’s get a more concise look at the state of things:


root@vm1-1:~# gnt-instance list
Instance Hypervisor OS Primary_node Status Memory
vm1.vm1.labs.sjc1.mozilla.com kvm image+lucid_django vm1-1.labs.sjc1.mozilla.com running 512M
vm2.vm1.labs.sjc1.mozilla.com kvm image+lucid_django vm1-2.labs.sjc1.mozilla.com ERROR_nodeoffline (node down)
vm3.vm1.labs.sjc1.mozilla.com kvm image+lucid_django vm1-2.labs.sjc1.mozilla.com ERROR_nodeoffline (node down)
vm4.vm1.labs.sjc1.mozilla.com kvm image+lucid_django vm1-2.labs.sjc1.mozilla.com ERROR_nodeoffline (node down)
vm5.vm1.labs.sjc1.mozilla.com kvm image+lucid_django vm1-2.labs.sjc1.mozilla.com ERROR_nodeoffline (node down)
vm6.vm1.labs.sjc1.mozilla.com kvm image+lucid_django vm1-2.labs.sjc1.mozilla.com ERROR_nodeoffline (node down)
vm7.vm1.labs.sjc1.mozilla.com kvm image+lucid_django vm1-2.labs.sjc1.mozilla.com ERROR_nodeoffline (node down)
vm8.vm1.labs.sjc1.mozilla.com kvm image+lucid_lpk vm1-2.labs.sjc1.mozilla.com ERROR_nodeoffline (node down)
vm9.vm1.labs.sjc1.mozilla.com kvm debootstrap+default vm1-2.labs.sjc1.mozilla.com ERROR_nodeoffline (node down)
vm10.vm1.labs.sjc1.mozilla.com kvm image+lucid_lpk vm1-2.labs.sjc1.mozilla.com ERROR_nodeoffline (node down)
vm11.vm1.labs.sjc1.mozilla.com kvm image+lucid_lpk vm1-2.labs.sjc1.mozilla.com ERROR_nodeoffline (node down)
vm12.vm1.labs.sjc1.mozilla.com kvm image+lucid_lpk vm1-2.labs.sjc1.mozilla.com ERROR_nodeoffline (node down)
vm13.vm1.labs.sjc1.mozilla.com kvm image+lucid_lpk vm1-1.labs.sjc1.mozilla.com ADMIN_down -
vm14.vm1.labs.sjc1.mozilla.com kvm debootstrap+default vm1-1.labs.sjc1.mozilla.com ADMIN_down -
vm15.vm1.labs.sjc1.mozilla.com kvm image+lucid_lpk vm1-2.labs.sjc1.mozilla.com ERROR_nodeoffline (node down)
vm16.vm1.labs.sjc1.mozilla.com kvm image+lucid_lpk vm1-1.labs.sjc1.mozilla.com ERROR_down -
vm17.vm1.labs.sjc1.mozilla.com kvm image+lucid_lpk vm1-2.labs.sjc1.mozilla.com ERROR_nodeoffline (node down)
vm18.vm1.labs.sjc1.mozilla.com kvm image+lucid_lpk vm1-2.labs.sjc1.mozilla.com ERROR_nodeoffline (node down)
vm19.vm1.labs.sjc1.mozilla.com kvm image+lucid_lpk vm1-2.labs.sjc1.mozilla.com ERROR_nodeoffline (node down)
vm20.vm1.labs.sjc1.mozilla.com kvm image+lucid_lpk vm1-2.labs.sjc1.mozilla.com ERROR_nodeoffline (node down)
vm21.vm1.labs.sjc1.mozilla.com kvm image+lucid_lpk vm1-2.labs.sjc1.mozilla.com ERROR_nodeoffline (node down)
vm22.vm1.labs.sjc1.mozilla.com kvm image+lucid_lpk vm1-2.labs.sjc1.mozilla.com ERROR_nodeoffline (node down)
vm23.vm1.labs.sjc1.mozilla.com kvm image+lucid_lpk vm1-2.labs.sjc1.mozilla.com ERROR_nodeoffline (node down)
vm24.vm1.labs.sjc1.mozilla.com kvm image+lucid_lpk vm1-2.labs.sjc1.mozilla.com ERROR_nodeoffline (node down)
vm25.vm1.labs.sjc1.mozilla.com kvm image+lucid_lpk vm1-2.labs.sjc1.mozilla.com ERROR_nodeoffline (node down)
vm26.vm1.labs.sjc1.mozilla.com kvm image+lucid_lamp vm1-2.labs.sjc1.mozilla.com ERROR_nodeoffline (node down)

We can bring up one instance at a time, like this:


root@vm1-1:~# gnt-instance failover --ignore-consistency vm26
Failover will happen to image vm26. This requires a shutdown of the
instance. Continue?
y/[n]/?: y
Fri Apr 22 07:47:18 2011 * checking disk consistency between source and target
Fri Apr 22 07:47:18 2011 - WARNING: Can't find disk on node vm1-1.labs.sjc1.mozilla.com
Fri Apr 22 07:47:18 2011 * shutting down instance on source node
Fri Apr 22 07:47:18 2011 - WARNING: Could not shutdown instance www1.vm1.labs.sjc1.mozilla.com on node vm1-2.labs.sjc1.mozilla.com. Proceeding anyway. Please make sure node vm1-2.labs.sjc1.mozilla.com is down. Error details: Node is marked offline
Fri Apr 22 07:47:18 2011 * deactivating the instance's disks on source node
Fri Apr 22 07:47:18 2011 - WARNING: Could not shutdown block device disk/0 on node vm1-2.labs.sjc1.mozilla.com: Node is marked offline
Fri Apr 22 07:47:18 2011 * activating the instance's disks on target node
Fri Apr 22 07:47:18 2011 - WARNING: Could not prepare block device disk/0 on node vm1-2.labs.sjc1.mozilla.com (is_primary=False, pass=1): Node is marked offline
Fri Apr 22 07:47:19 2011 * starting the instance on the target node

But doing that for each node would be horrifically slow, so we’ll write a quick and dirty BASH for loop:

root@vm1-1:~# for i in $(gnt-instance list|grep ERROR_nodeoffline|cut -f1 --delimiter= ); do gnt-instance failover --ignore-consistency -f $i; done

Fri Apr 22 07:49:04 2011 * checking disk consistency between source and target
Fri Apr 22 07:49:04 2011 – WARNING: Can’t find disk on node vm1-1.labs.sjc1.mozilla.com
Fri Apr 22 07:49:04 2011 * shutting down instance on source node
Fri Apr 22 07:49:04 2011 – WARNING: Could not shutdown instance vm2.vm1.labs.sjc1.mozilla.com on node vm1-2.labs.sjc1.mozilla.com. Proceeding anyway. Please make sure node vm1-2.labs.sjc1.mozilla.com is down. Error details: Node is marked offline
Fri Apr 22 07:49:04 2011 * deactivating the instance’s disks on source node
Fri Apr 22 07:49:04 2011 – WARNING: Could not shutdown block device disk/0 on node vm1-2.labs.sjc1.mozilla.com: Node is marked offline
Fri Apr 22 07:49:04 2011 * activating the instance’s disks on target node
Fri Apr 22 07:49:05 2011 – WARNING: Could not prepare block device disk/0 on node vm1-2.labs.sjc1.mozilla.com (is_primary=False, pass=1): Node is marked offline
Fri Apr 22 07:49:05 2011 * starting the instance on the target node

 

All your instances should be running now!

Leave a Reply

Your email address will not be published. Required fields are marked *

This site uses Akismet to reduce spam. Learn how your comment data is processed.