ZOMG!!! WHAT A DAY!

Today was one of those days.. when everything goes wrong. Believe me…. Everything. Went. Wrong.

3am – pages started coming in that our equipment up on the tank in Bloomington Hills was having packet loss, going up and down, etc. Appears the backhaul must have blown out of alignment with the fun winds we’ve been having. I noticed our Candy Cane backhaul had been down since midnight. (sorry tomp.) I was tempted to go up to the candy cane and see what I could see, however, it was dark, and scary climbing rocks at night, with the wind. I opted to wait until the morning.
5am – page about BGP peering going down.. back up 5 minutes later.
6am – pages coming in about power outages on our UPSes. Then pages few seconds later about power going back to normal. This and that. For a few minutes. annoying. Then our Spam database (postgresql) server rebooted (his name is Gibraltar). We determined that this is due to the 2-4ms time that it takes our UPSes to switch to battery during an outage. This server is a power hungry monster (16xSAS 15kRPM drives, dual woodcrest cpus, 8GB ram, tons of fans). We figured it must not be able to handle a 2-4ms switchover (lack of capacitors? who knows). So, it took some hand holding to make sure that server came up right. Otherwise, mail wouldn’t flow.
6:30am – Once Gibraltar came up, I started ‘cvsup-ping’ the OS changes. It stopped after the first file and I logged into the CVSup server (mirrors). Noticed that the ethernet had been going up and down. What was really odd, was the “ was using my IP!” error. For some reason, this box (mirrors) had stolen the IP from Quillo (our old 1U web server box). There is no way he could have done that, since it had to have happened when I had started cvsup-ping. Soo, weird. Anyway, I fixed that issue on Mirrors, Gibraltar started cvsupping and all was good.
7:30am – Left home. Went to office to get ladder. Called Cory @ SGCity to get keys to the gate so I could drive my 4WD truck up. Upon arriving at the office right before 8am, Dan (one of our techs) said all the machines had been powered off and were having issues logging in. After a reboot of his machine, it resolved itself.
8:00am – Had Randy call SG Water to get up on the tank to check alignment. Drove up to candy cane. It appears that our backhaul unit that feeds our CandyCane access had blown over in the wind. Six concrete lag screws that were put into the concrete roof of our little ’shack’ had all popped out. I tried rigging it back up with rocks and dumb tie down wire.. which worked until Randy came up with a “non-penetrating roof mount” around 9am.

It was one thing after another this morning. I ordered an “online” UPS (minuteman brand) to fix that 2-4ms problem for the one server that can’t handle it. Hopefully, it’ll be here before another “reboot”. The weird thing about that issue is that I have dual power supplies, plugged into different power strips, plugged into different batteries, on different power circuits. Yet, I still have the issue. I’m hoping to plug one of those power supplies into this new “online UPS” and that’ll solve my problem. *fingers crossed*.

Oh and our installer quit yesterday…well, we think. He just dropped off the truck with the phone/laptop/keys inside of the truck in the parking lot, without telling anybody. Soo congrats tomp on starting Friday. May your future be filled with many-a-non-problematic-install.

1 comment so far

Wassup Cassidy?! I just checked out my brutha’s blog today and noticed a link to yours and thought I’d say hey. I really like your guys’ blogs. Tom seriously got the hookup with the new job — I’m a little jealous :p Reading a few of your posts reminded me so much of working there… always hearing about the newest equipment and upgrades and stuff. I loved always being able to ask you guys tons of questions. Man… I’d totally forgotten how much fun that was, heh. Anywho, hope you’re doing well. Talk to ya later

zebulun
June 6th, 2007 at 10:51 pm

Leave a Reply

You must be logged in to post a comment.