Post Mortem

Anyways, it’s alive.  It may have taken a little longer than expected, but it’s back.  Hopefully. 

I’ve rebuilt all the virtual machines, mostly from backups, so I didn’t actually lose anything, but I’ve also changed a lot of the layout behind the scenes.  This, along with ordering new parts, and the rest of life, has kept the site off longer than I would’ve liked, but so is life without enterprise level machines and support.

So now it’s time for a post mortem on all this fun stuff.

The week of April 17th is when this story will start.  Basically, the website kept going down and the server hosting it became unresponsive to everything but ping.  I couldn’t SSH into the box or actually log in AT the box or anything.  So, I’d simply restart it.  After this happened a few times, I started scouring the logs to see what exactly was going on.  Basically, I couldn’t find anything.  As you can remember from a previous post, I thought that I had the problem licked.  However, I had never actually seen an error message or anything telling me exactly what was going on.  I was just going on gut instinct. 

So after figuring I fixed the problem, I went on with life, and it did work for quite a few days.  And then it started happening again.  So I decided to reinstall ESX thinking it may be a problem with that.  It still hung a few times, and since I couldn’t actually log in at the box, I decided to log in as soon as I rebooted the server and just leave it logged in.  Maybe something was being written to the display before it hung.  Well, the server worked for awhile, and then sometime on Sunday the 28th it went down again.

At the time I wasn’t at home, and had to wait until I got home, which was around 10 PM.  I go to the machine, and sure enough, I have the first actual error I’ve seen.

SCSI Host 0 Reset (PID 0)
Time Out Again—-

So, looking at the error, I thought that it may be the hard drive on SCSI ID 0.  Looking back, this was the first sign as to what was actually wrong.  I then replaced the hard drive and boot it back up.  The machine doesn’t go anywhere.  No ESX, no nothing other than trying to boot from the NIC.  Definitely not a good sign.  This was a RAID 5 setup, it should’ve recreated the array and everything should’ve been fine after I replaced the hard drive.  Well, apparently it didn’t want to do that, but it was too late now.  This was sign number two as to what the true problem was.  It was now 2 AM on Sunday, with work the following day, so I turned everything off and gave up for the night.

The following day I attempt to fix it again.  Since I still wasn’t sure what was going on, and I wanted to rule out the RAM, I ran MemTest86+ on the machine for a few hours.  No problems found.  I tried to do an upgrade with ESX, but ESX told me it couldn’t find any of the old partitions or installs.  Great.  Well maybe it’s just the partition table that’s gone, and not all the data.  I found this great utility CD called the ultimate boot CD.  On it there’s a program called TestDisk, which can salvage Linux partition tables.  After having to mess with the boot CD awhile to get the MegaRaid SCSI drivers installed on it, I was off and running.  Needless to say, that didn’t work, no partitions found.

Well, that means all the data’s essentially gone, since I was definitely not going to pay someone to get it back.  Thankfully I had started doing backups not more than 2 weeks prior to all this happening!

The rebuild of all the virtual machines then commenced.  However, with the server hanging it took quite awhile in order to get everything back up and running.  What made it even more interesting was the myriad of errors that each hang would create.  Honestly, I don’t think I saw the same error more than twice the whole time it was down.

During this time I also redid the setup to put all my machines in an Internal network behind an ISA server.  Right now there’s the external network (the internet), a perimeter network (my workstation, Binford’s workstation, and some misc machines that don’t need security), and then the internal network (all my enterprise level machines).  There is still one huge problem with this, but I’m still working on it, and it’s not a big deal.  Basically, from my workstation and Binford’s workstation you can’t ping the internal network unless the machine you’re trying to ping, pings out first.  It’s something to do with our messed up physical infrastructure, but hopefully I can fix it.

Basically, this whole time was to try and get the site and back-end up to where it was prior to the problems, and also fix what was wrong.  The more and more it happened, the more and more I thought it was the SCSI card.  So I changed the channel that all the drives were on, and rebuilt the array.  Needless to say that didn’t help much, and so this past Sunday I bought a new Dell Perc 3/DC card on ebay for $61 shipped.  Yesterday it came in, and last night I migrated all the virtual machines off, installed the new card, rebuilt the array, reinstalled ESX, migrated the virtual machines back on, and then brought the machines back on.

Right now we’re flying on the new Perc Card that has 112 MB more cache, and the ability to initialize an array in under 5 seconds as opposed to 100 minutes.  Hopefully we don’t see a hang.  Let’s all hold our breath, mkay?