Dailydave mailing list archives

The Small Company's Guide to Hard Drive Failure and Linux


From: Dave Aitel <dave () immunitysec com>
Date: Thu, 18 Nov 2004 09:49:09 -0500

So I learned a lesson about reliability and I thought I'd share. Recently, the main hard drive, a little 40 gigger that runs www.immunitysec.com, which also happens to be mail.immunitysec.com and dns.immunitysec.com, started to display read errors in the kernel logs (viewable via "dmesg" if you are root). This also caused data corruption in a few cases, and some other badness such as long pauses during writes. Jeremy says that he's never had a hard drive fail on him, but it happens all the time, so it'll probably happen to you, probably the day after you sign a big contract with someone and need to do something other than mess with your hard drive.

Once this starts happening, you have a few options. The first one is to use fsck -c to mark blocks bad (this requires you to boot to single user mode). This is a stop-gap measure while you go prepare a new drive, since the bad sectors are a sign that your drive is about to die completely.

So your real option is to backup the hard drive and replace it. It's nearly impossible to maintain a "GOLD" hard drive which you can insta-replace your linux box with. Lots of things happen on a minute-to-minute basis - mail, cvs checkins, log entries, etc. These all need to be backed up and restored perfectly on the new drive. So some downtime is probably necessary.

One might think you could use dd to duplicate your drive. I initially tried this, and my results were not good. I did remember to use dd if=/dev/hda of=/dev/hdc conv=noerror (the noerror flag is important). However, this takes forever and a day. Basically it'll take all night. So be prepared for that, even on a small drive and a fast system. They sell devices that can do it faster, I think, but you don't have one, do you?

The other issue with dd is that typically replaces every sector on the disk. So you'll need a disk EXACTLY the same size as the previous disk. My disk was one meg smaller (40.0 Meg instead of 40.9 Meg). This was an annoying problem.

So instead of that, one nice option is just to get your new drive, make the partitions manually with fdisk on it to replicate the original drive, and then use tar to copy the contents across. I used knoppix heavily here.

The command for tar is:
cd /mnt/hda1 #drive to copy from
tar cf - . | (cd ../hdd1; tar xf -)
This will maintain the users and permissions and stuff. After you're done copying all the partitions (you don't do the swap partitions, obviously), you then double check them to make sure they're right. Mounting them and doing a df -k is useful so you can make sure they look the same. (This method will allow you to expand or resize partitions as well, and is faster than dd, although still slow).

The next step is to re-lilo (or grub) your hd. (hahaha, that sentance made no sense, but bear with me here). Anyways, you want to use the lilo on your hd, not the lilo that comes with knoppix, which doesn't seem to work. The trick to this is to use lilo -r /mnt/hda1 (your root partition, which has /etc/ and /boot/ on it.). However, this won't work with knoppix, since the default mount points have the "nodev" option set. You'll need to remount them with the dev option set before you can run lilo on them.

After that, the drive should be bootable. I'd move it to the "master" drive, if you haven't already, and test that part out.

One thing I did that you might also do is just go to the co-lo (your hardware IS at a co-lo, right?) take the drive out, and bring it home. Like many cheap boxes, the box in the co-lo didn't even have a cdrom, and wasn't the fastest box. Bringing it home actually saved time in the long run, since I could use a real desktop (with swapable drive bays - always get those) to do the work. You'll want to get a few spare drives, since the first drive I tried to restore onto was bad as well. This is an important note - never buy recertified drives. Always get spankin' new drives. It might be fun to do a strings on some of those recertified drives, but I didn't have time.

One thing my lilo did that was weird was rewrite the fstab to use "LABEL=/" instead of /dev/hda1. If you happen to be hosted at Pilosoft (or another co-lo that is run by someone on the local linux users group list - very good idea!) they might jump in and save your butt when you load it up and it doesn't work and you're too tired to figure out why. The next step after doing all this is typically to make a plan that involves not having to ever do this again. For those of you not in the know - you want a hardware supported (get a good modern motherboard) RAID-1 solution and you want to be able to swap out one of your two drives (mirrored) when Linux tells you that one is bad. You also want to have some sort of backup solution running (of course), and you want to have a secondary DNS server and a backup machine somewhere in another state (or country) that can take over if your main CO-LO goes under or something. Something that can provide basic mail and web services is nice. It might be good to hire an admin who is not you.

It's not uncommon for a linux machine not to work properly when you reboot. This is because it's probably been a few years since you rebooted, and you probably redid a lot of libraries in the meantime, some of which arn't in the right places. So after you do all this, it's good to test out all your services and make sure they are, in fact, doing what you think they are. Your co-lo (whom you bought the computer from, most likely) often provides a guarantee of hard drives (Pilosoft does). Don't save the 50 bucks by taking them up on this, since that bad hard drive still has all your corporate data on it. You should be getting swamped with email now, since your mail was down and now it's up and the interweb is resending things for you.

Oh, and always run grsecurity kernels. SELinux is just a pale imitation. Recently grsecurity has added things like brute force prevention, which detects exploit attempts and stops them. I assume he's doing a check to see if eip is pointing to a writable page, and if so, counting down from 5 or something, and if 0, then not allowing a fork or exec. Either way, it's a good thing.

Dave Aitel
Immunity, Inc.



_______________________________________________
Dailydave mailing list
Dailydave () lists immunitysec com
https://lists.immunitysec.com/mailman/listinfo/dailydave


Current thread: