Sunday, November 25, 2012

Disaster Recovery

So when I set up my shiny new Fedora 16 box, I was careful to set up a pair of disks as Raid 1.  Any disk failure and I would be covered, right?  Wrong.  What I neglected to do was put the GRUB2 MBR (Master Boot Record) boot loader on my second disk.  So, even though I had all my data mirrored onto the second disk, if the first disk failed, my system would not boot because the MBR was only on the first disk.  Did I say if the first disk failed, I meant to say when the first disk failed.  And sure enough, it failed.  Unbootable system.  I tried swapping disks around, etc, but no MBR on the second disk so no boot.  Here's what I did to recover.

The scenario - 2 disks, sda and sdb set up in raid 1 mirror.  Only sda has the MBR for grub2.  sda failed, leaving me with a good sdb but the system could not be booted.

First, I got a trusty Knoppix boot DVD.  At first I'd tried to use the Knoppix CD but it only boots at 32 bit and I couldn't chroot to my installed system (on second disk) because it was 64 bit.  Knoppix 7.0.4 only has grub and not grub2 so I need to use the grub2 code that's on my working sdb.  So, Knoppix DVD, available at http://www.knopper.net/knoppix/index-en.html.  I tried messing around creating a boot USB on my Mac using unetbootin, etc but ultimately wasn't able to create anything that booted.  (Had to use my Mac because obviously my Linux box was not working...)

Once I had a working Knoppix DVD, I booted on it (boot parameter "knoppix64") and accessed my working sdb (which Knoppix Linux had brought on as sda since the failed sda was at this point now removed and disk names are done as they are seen).  Since it's a metadisk, in order to mount the disk, I had to do:

mdadm --assemble --auto=yes /dev/md0 /dev/sda2

sda1 is swap.  sda2 is a large LVM partition.  The mdadm --assemble brought up metadevice md0 with member device sda2.  mdadm was kind enough to bring on the array even though it was missing 1/2 of the mirror.  So nice.

Next, to mount the filesystems:

mkdir /media/lvm
mount /dev/vg00/lvslash /media/lvm
mount /dev/vg00/lvboot /media/lvm/boot 
Mounting the LVM devices (use the "pvs", "vgs", and "lvs" commands (no args) to display the contents of the various LVM components.  The original / and /boot are mounted under /media/lvm.

Now at this point I want to run grub2 to get it to re-install the MBR.  But, no amount of messing around with various components or options could get it to run and see things and install the MBR.  I kept complaining that it couldn't find /boot/something/grub2 stage1.  But the file was there at the path it was looking at.  My only guess was that it was trying to find it by reading the filesystem directly and couldn't since the LVM or grub on Knoppix was different.

Regardless, I needed to do a chroot to fake out the system into thinking it was running for real on my disks and not Knoppix.  The chroot was easy enough, but then grub2 would refuse to run because it couldn't find running device info (which is on /dev but not the chroot /dev).  ARGH!  The fix is to remount /dev and /proc from the real system onto the chroot system.  Surprisingly this just works:

mount --bind /dev/ /media/lvm/devmount --bind /proc /media/lvm/proc

That done, I can chroot to the filesystem:

 chroot /media/lvm
and then run grub2 to install the new MBR

grub2-install --recheck /dev/sdagrub2-install /dev/sda 

After this was done, I exited out of the chroot, unmounted filesystems, rebooted on the new MBR, and like magic, I'm good to go, the working disk is fine.  Next up, to get a replacement disk and create the LVM for the mirror and resynchronize across two disks.  And also to put the MBR on the second disk :)