|
The HP Proliant ML150 G2 server fiasco.
Server installed over the weekend of March 17, 2006. Very soon after the server was installed (actually on Monday March 19), things started to go bad. Corrupt disk. Lost files. Damaged files. Lost folders. Data was disappearing faster than I could back it up!
After moving all data off the raid and on to a simple volume, I installed the HP Storage Manager. When first installed, it gave no indication that there was anything wrong with the raid. But after running for a couple of days, it displayed as below. One of the raid 5 drives actually entered a 100% failed state. This drive probably was in the process of dying for days, but since there was a lack of communication between the drives and the raid controller due to the absence of the I2C cable, the raid behaved as if nothing was wrong when things were, in fact, VERY VERY wrong!
After installing the I2C cable, notice the changes below. There is actually an Enclosure management device that was not there before! Also, the failed drive has been replaced, and the raid5 is being rebuilt.
Below is a picture of the I2C cable as it came with the machine still in its original packaging. I had no clue what it was for - though it might be an unused audio cable for the DVD - thought no more of it. It really should have been installed before the server was delivered to us. I believe that the absence of this cable led to the chaos that was experienced with this machine.

Sunday 4.9.2006
This weekend, prior to installing another new piece of hardware (a raid backplane board), I decided to put together a test procedure that would attempt (by writing great gobs of data) to break the raid system. The test was run on Sunday morning, and the raid did indeed break once again with windows reporting that the disk was corrupt. This breakage occurred within only a few minutes of the test beginning.
I looked at the server system logs and could see the failure points. I looked at the HP Storage Monitor and it reported that all drives and systems were optimal. I thought that this was weird. The hardware monitoring system reported no problems, while the server operating system did. The problems were being discovered at the operating system's file system level, not at the hardware level.
I programmatically removed the logical drive that was the raid and then reintroduced it to the the server with a complete reformat. I copied all system data from the temporary location back on to the raid. No problems with that massive copy of data. Again, and actually a few times, I ran my "break the raid" procedure. So far, I have been unable to break it! This is a good thing I believe. I will run the procedure on it again during evening hours this week in an attempt to cause the system to fail. If I cannot get it to fail, I think that soon I will once again make it the live repository for college system data.
I believe that the fact that the raid was originally formatted before the I2C cable was installed caused the format itself to be flaky! I think we finally have a stable server after having re-formatted the raid (and repacing a failed drive and installing the I2C cable of course). I think this sounds right.
Sunday 4.9.2006
Reintroduced user data to raid and set up sharing/security. We'll see if we can get a week's work out of this raid without it eating data!
|