This is from memory, as it happened about a year ago, but I figured I’d document it, in case it helps someone.
I own a Promise UltraTrak100 TX8 SCSI to IDE RAID array. If it helps, here are local mirrors of product manual and specifications.
I had a case once, when I shut the array down, and one of the drives did not spin back up.
Situation:
disk0: Good
disk1: Good
disk2: Did not Spin up
disk3: Good
disk4: Good
disk5: Good
disk6: Good
disk7: Good
Array (configured as 8 disk RAID5, Maxtor 120 gig drives, 800.5 GB of formated disk space) of course started beeping, so I grumbled, yanked a cold spare off the shelf, and put in in place of “failed” drive and went to sleep.
Situation:
disk0: Good
disk1: Good
disk2: replaced with good one, put onto shelf, resyncing
disk3: Good
disk4: Good
disk5: Good
disk6: Good
disk7: Good
About 3 hours into raid resync (no, UltraTrak100s are not really speedy), array instead of short beeps raised a rucus, and it’s crying woke me up. Turned out that another drive failed while into resync. So the nightmare happened – there were two failed drives in a RAID5, and of course the array is not designed to handle this.
Situation:
disk0: Good
disk1: Good
disk2: resyncing
disk3: Good
disk4: Good
disk5: failed with bad sectors
disk6: Good
disk7: Good
I had not backups. As an aside, when you have 800 gigs of on-line storage, all used, how do you back it up? DLT7K (which I also have) would take maybe 3 days, and at this point, do I trust the tapes? After all, when you have 20 tapes, probability of tape read failure would be raised to n^20. Then there is dust in the drive, SCSI cables (differential SCSI in my case), power fluctuations, etc. The only way to back up 1 TB is to put a second 1TB array near it, and mirror them, and start using filesystem snapshots (like NetApp does, or Solaris 8 and newer). Any way, backups are a subject of a rant of it’s own.
So I grumled, and cursed, but went ahead and examined the original drive, one that didn’t spin up. SMART was complaining that the drive takes too long to spin up, but in the end I managed to convince it to spin up. So now I had a case where I had an array with two “bad” drives, yet one drive was actually “good”, only market as bad in the NVRAM of the UltraTrak.
After a while on long distance calls to Promise, I got to talk to a chinese guy who actually was one of the developers. He told me of a magic way to try as last resort.
So don’t do this at home, this is serious evil, etc.
He told me to turn array off, yank all the drives out of the array, and put one new drive into it.
Upon power on, array would complain about lack of the original drives. Then he told me to delete the existing configuration, and power the array off.
After that, he told me to put the drives including drive that was originally having problems spinning up back into the array in the original order:
Situation:
disk0: Good
disk1: Good
disk2: Drive that not Spin up originally, but got convinced to spin up again
disk3: Good
disk4: Good
disk5: failed with bad sectors
disk6: Good
disk7: Good
Then he told me to go and configure the array again from scratch, RAID5, whole disks, etc.
At the moment when I were to commit the configuration of the array I had to be careful. Essentially at that point all of the lights on the disks in the array would flash in sequence, as the configuration of the array would be written to disks. After that there would be a 1 second pause. During it, I had to turn the array off.
This is a once time shot. If one doesn’t turn the array off at this 1 second interval, the array would proceed with formatting the disks, and all of the data would be lost.
As I did it, array wrote configuration of the array to disks, matching the configuration that I had before, but did not re-initialize the array. So the data was still there.
When I powered the array on, it span up all the drives, and proceeded to claim that it’s fully functional.
So I manually failed drive 5, that had bad sectors on it by yanking it out of the array, and replacing it with a cold spare.
About 10 hours later array re-initialized. Then I failed disk2, that had issues spinning up, and replaced it. Array re-initialized.
You have no idea how stressed I were until the first rebuild was done.
Any way, maybe this will help someone. Obviously this is not exactly a technique for the faint of heart, and is not supported by Promise. But it saved my ass. If you have spare disks, try building a test array (of like 2 disks) and practice on it first. And, have good backups.
This should work on UltraTrak100 TX4 as well, but I have no idea about any other models. Probably not. Talk to Promise, they can be nice to you.