Dailydave mailing list archives

The Small Company's Guide to Hard Drive Failure and Linux

From: Dave Aitel <dave () immunitysec com>
Date: Thu, 18 Nov 2004 09:49:09 -0500

So I learned a lesson about reliability and I thought I'd share.Recently, the main hard drive, a little 40 gigger that runswww.immunitysec.com, which also happens to be mail.immunitysec.com anddns.immunitysec.com, started to display read errors in the kernel logs(viewable via "dmesg" if you are root). This also caused datacorruption in a few cases, and some other badness such as long pausesduring writes. Jeremy says that he's never had a hard drive fail on him,but it happens all the time, so it'll probably happen to you, probablythe day after you sign a big contract with someone and need to dosomething other than mess with your hard drive.

Once this starts happening, you have a few options. The first one is touse fsck -c to mark blocks bad (this requires you to boot to single usermode). This is a stop-gap measure while you go prepare a new drive,since the bad sectors are a sign that your drive is about to diecompletely.

So your real option is to backup the hard drive and replace it. It'snearly impossible to maintain a "GOLD" hard drive which you caninsta-replace your linux box with. Lots of things happen on aminute-to-minute basis - mail, cvs checkins, log entries, etc. These allneed to be backed up and restored perfectly on the new drive. So somedowntime is probably necessary.

One might think you could use dd to duplicate your drive. I initiallytried this, and my results were not good. I did remember to use ddif=/dev/hda of=/dev/hdc conv=noerror (the noerror flag is important).However, this takes forever and a day. Basically it'll take all night.So be prepared for that, even on a small drive and a fast system. Theysell devices that can do it faster, I think, but you don't have one, do you?

The other issue with dd is that typically replaces every sector on thedisk. So you'll need a disk EXACTLY the same size as the previous disk.My disk was one meg smaller (40.0 Meg instead of 40.9 Meg). This was anannoying problem.

So instead of that, one nice option is just to get your new drive, makethe partitions manually with fdisk on it to replicate the originaldrive, and then use tar to copy the contents across. I used knoppixheavily here.


The command for tar is:
cd /mnt/hda1 #drive to copy from
tar cf - . | (cd ../hdd1; tar xf -)

This will maintain the users and permissions and stuff. After you'redone copying all the partitions (you don't do the swap partitions,obviously), you then double check them to make sure they're right.Mounting them and doing a df -k is useful so you can make sure they lookthe same. (This method will allow you to expand or resize partitions aswell, and is faster than dd, although still slow).

The next step is to re-lilo (or grub) your hd. (hahaha, that sentancemade no sense, but bear with me here). Anyways, you want to use the liloon your hd, not the lilo that comes with knoppix, which doesn't seem towork. The trick to this is to use lilo -r /mnt/hda1 (your rootpartition, which has /etc/ and /boot/ on it.). However, this won't workwith knoppix, since the default mount points have the "nodev" optionset. You'll need to remount them with the dev option set before you canrun lilo on them.

After that, the drive should be bootable. I'd move it to the "master"drive, if you haven't already, and test that part out.

One thing I did that you might also do is just go to the co-lo (yourhardware IS at a co-lo, right?) take the drive out, and bring it home.Like many cheap boxes, the box in the co-lo didn't even have a cdrom,and wasn't the fastest box. Bringing it home actually saved time in thelong run, since I could use a real desktop (with swapable drive bays -always get those) to do the work. You'll want to get a few spare drives,since the first drive I tried to restore onto was bad as well. This isan important note - never buy recertified drives. Always get spankin'new drives. It might be fun to do a strings on some of those recertifieddrives, but I didn't have time.

One thing my lilo did that was weird was rewrite the fstab to use"LABEL=/" instead of /dev/hda1. If you happen to be hosted at Pilosoft(or another co-lo that is run by someone on the local linux users grouplist - very good idea!) they might jump in and save your butt when youload it up and it doesn't work and you're too tired to figure out why.The next step after doing all this is typically to make a plan thatinvolves not having to ever do this again. For those of you not in theknow - you want a hardware supported (get a good modern motherboard)RAID-1 solution and you want to be able to swap out one of your twodrives (mirrored) when Linux tells you that one is bad. You also want tohave some sort of backup solution running (of course), and you want tohave a secondary DNS server and a backup machine somewhere in anotherstate (or country) that can take over if your main CO-LO goes under orsomething. Something that can provide basic mail and web services isnice. It might be good to hire an admin who is not you.

It's not uncommon for a linux machine not to work properly when youreboot. This is because it's probably been a few years since yourebooted, and you probably redid a lot of libraries in the meantime,some of which arn't in the right places. So after you do all this, it'sgood to test out all your services and make sure they are, in fact,doing what you think they are. Your co-lo (whom you bought the computerfrom, most likely) often provides a guarantee of hard drives (Pilosoftdoes). Don't save the 50 bucks by taking them up on this, since that badhard drive still has all your corporate data on it. You should begetting swamped with email now, since your mail was down and now it's upand the interweb is resending things for you.

Oh, and always run grsecurity kernels. SELinux is just a pale imitation.Recently grsecurity has added things like brute force prevention, whichdetects exploit attempts and stops them. I assume he's doing a check tosee if eip is pointing to a writable page, and if so, counting down from5 or something, and if 0, then not allowing a fork or exec. Either way,it's a good thing.


Dave Aitel
Immunity, Inc.



_______________________________________________
Dailydave mailing list
Dailydave () lists immunitysec com
https://lists.immunitysec.com/mailman/listinfo/dailydave

Current thread:

The Small Company's Guide to Hard Drive Failure and Linux Dave Aitel (Nov 18)
- Re: [nylug-talk] The Small Company's Guide to Hard Drive Failure and Linux alex (Nov 18)
  - Re: Re: [nylug-talk] The Small Company's Guide to Hard Drive Failure and Linux Paul Wouters (Nov 18)
    - Re: Re: [nylug-talk] The Small Company's Guide to Hard Drive Failure and Linux ken_i_m (Nov 23)
- Re: The Small Company's Guide to Hard Drive Failure and Linux Paul Wouters (Nov 18)
- Re: The Small Company's Guide to Hard Drive Failure and Linux Frank Berger (Nov 18)
  - Re: The Small Company's Guide to Hard Drive Failure and Linux Derek Vadala (Nov 18)
    - Re: The Small Company's Guide to Hard Drive Failure and Linux Dave Aitel (Nov 18)
    - Re: The Small Company's Guide to Hard Drive Failure and Linux miah (Nov 18)
    - Re: The Small Company's Guide to Hard Drive Failure and Linux Anthony.zboralski (Nov 18)
    - Re: The Small Company's Guide to Hard Drive Failure and Linux Derek Vadala (Nov 18)

(Thread continues...)