Best Inteviewee Ever..

So this old guy comes in to interview for our entry level Support Technician position. He’s been in construction his whole life and done computers on his own, on the side.

During the interview his cell phone rings. As the phone keeps ringing he pulls it out of his holster and looks at it, all the while continuing to talk to us and then puts it back. He does NOT disable the ring nor put it on vibrate. So it continues to ring like 5 or 6 times. Loud and annoying! All the while Jackie and I are trying to interview him.

Five minutes later, the same thing happens. His phone rings.. this time he doesnt pull it out but just lets it ring.. 10 or 12 times until it goes to voicemail.

Wonder why he’s not going to get the job.


Power Troubles

At work we have a Cisco 6513 switch. It’s pretty huge. Has 13 bays for cards to go in and do their thing.

Unfortunately, I only have 120V power feeding the beast. I have two 2500W power supplies, but
thanks to the gay 120V power, I can only use 1200W of that. Sucks, I know.

So, here comes Cassidy with a new card he wants to pop into slot 12. It’s a 16 port GBIC blade,
basically to replace the 8-port one on slot 13 (see below) with added fancy stuff. Unfortunately,
it doesn’t look like I’ll have enough power for it.

To top it all off, I ordered a 16-port copper Gigabit blade as well today for this chassis. There’s
no way it’ll get power either. So something has to be done.

Option #1: Convert to 208V power.
Option #2: Get DC power supplies and supply the 6513 with power via DC.
Option #3: Cry and try to keep the tears away from the power.
Option #4: Pop in all the blades I want so it uses both power supplies (not redundant),
but if one of the power supplies dies, there wont be enough power for everybody.

Here’s the output of the “show power” command for all to see:

coreswitch>sho power
system power redundancy mode = redundant
system power total =     1153.32 Watts (27.46 Amps @ 42V)
system power used =      1099.98 Watts (26.19 Amps @ 42V)
system power available =   53.34 Watts ( 1.27 Amps @ 42V)
                        Power-Capacity PS-Fan Output Oper
PS   Type               Watts   A @42V Status Status State
---- ------------------ ------- ------ ------ ------ -----
1    WS-CAC-2500W       1153.32 27.46  OK     OK     on
2    WS-CAC-2500W       1153.32 27.46  OK     OK     on
                        Pwr-Requested  Pwr-Allocated  Admin Oper
Slot Card-Type          Watts   A @42V Watts   A @42V State State
---- ------------------ ------- ------ ------- ------ ----- -----
1    WS-X6K-SUP2-2GE     145.32  3.46   145.32  3.46  on    on
2    WS-X6K-SUP2-2GE     145.32  3.46   145.32  3.46  on    on
3    WS-X6148-RJ-21      100.38  2.39   100.38  2.39  on    on
4    WS-X6548-RJ-21      121.80  2.90   121.80  2.90  on    on
5    WS-X6548-RJ-21      121.80  2.90   121.80  2.90  on    on
6    WS-X6548-RJ-21      121.80  2.90   121.80  2.90  on    on
7    WS-X6500-SFM2       129.78  3.09   129.78  3.09  on    on
8    WS-X6500-SFM2       129.78  3.09   129.78  3.09  on    on
13   WS-X6408A-GBIC       84.00  2.00    84.00  2.00  on    on

PPPoE 678 Errors

We run a fairly large wireless network. Instead of giving out DHCP and letting everybody at everybody else’s throats, we end up running PPPoE to the clients machine or router. This has its own set of issues, but the benefits have outweighed them since we’ve started.

We’ve recently started getting some “678 errors” on some clients off of a particular tower. This error is what appears with those using WindowsXP to connect. Turns out the issue was due to a default pppoe limit that Cisco has.

We currently have a NPE-G1 that we terminate our PPPoE sessions on. Probably around 10-20 VLANs with anywhere from 10 to 120 sessions per VLAN.

So for those that do PPPoE termination on a large scale, be warned. In the “bba-group pppoe” profile you will want to set the “sessions per-vlan limit” to something GREATER than 100. 100 is the default. Quite a low default, and was a pain to figure out. I’ve since set all of our bba-groups to 1024.

So see-yah PPPoE 678 errors. It’s been fun. Truly it has.

The unmentioned secret about APC UPSes…

UPSes are things that I deal with on a daily basis at work. They are the devices that keep my equipment up in case of brief power outages, brown-outs, low voltage coming in from the power company, all of the above.

I have had good luck with the APC 3000RMXL over the years. It’s a 3U rack mount UPS with expandability, swappable backplates (so you can outfit it with a 20 or 30amp receptacle instead of the default 15amp ones), and also lets you daisy chain it with battery units to extend the lifetime. It’s heavy, but has good reliability.

One of the issues I’ve come to notice is with my Spam Database server. This is a 3U 24drive SAS machine with dual power supplies and wants gobs of power. I have my database server plugged into different APC power strips (that let me remote control the status of the port.. on, off, etc) which are fed by different APC UPSes on different circuits. At times during brief power outages, the database server reboots. I’ve come to figure out it was during the time of brief outages. You’d think my “Uninterruptible” device would mean “Uninterrupted” power to my devices to which it powers. Unfortunately, this is not the case. According to APC’s website , there is a transfer time of 2 to 4ms between the time it takes to get power switched over to the battery on utility failure. Typically, this is enough time for the capacitors on my power supplies to handle the brief interruption. However, in the case of this database server, I’ve found that it’s not good enough. I needed a UPS that was ALWAYS ‘on-line’.

Some google searches later and I found out about Minuteman UPS. Pretty good priced UPSes that offer a “always on” solution. APC does not have any that I can find in the “SmartUPS” line. I ended up getting a Endeavor ED1500RM2U to power a single power supply to my database server. This should fix any of the issues that i’ve had with my database server rebooting during brown-outs… I hope. I’ve installed it over a month ago and no issues since. Here’s hoping that it’ll fix the issue for good.

So here’s hoping that APC might offer an ‘online’ edition to their SmartUPS line. 2-4ms just didn’t cut it in this case.

ZOMG!!! WHAT A DAY!

Today was one of those days.. when everything goes wrong. Believe me…. Everything. Went. Wrong.

3am – pages started coming in that our equipment up on the tank in Bloomington Hills was having packet loss, going up and down, etc. Appears the backhaul must have blown out of alignment with the fun winds we’ve been having. I noticed our Candy Cane backhaul had been down since midnight. (sorry tomp.) I was tempted to go up to the candy cane and see what I could see, however, it was dark, and scary climbing rocks at night, with the wind. I opted to wait until the morning.
5am – page about BGP peering going down.. back up 5 minutes later.
6am – pages coming in about power outages on our UPSes. Then pages few seconds later about power going back to normal. This and that. For a few minutes. annoying. Then our Spam database (postgresql) server rebooted (his name is Gibraltar). We determined that this is due to the 2-4ms time that it takes our UPSes to switch to battery during an outage. This server is a power hungry monster (16xSAS 15kRPM drives, dual woodcrest cpus, 8GB ram, tons of fans). We figured it must not be able to handle a 2-4ms switchover (lack of capacitors? who knows). So, it took some hand holding to make sure that server came up right. Otherwise, mail wouldn’t flow.
6:30am – Once Gibraltar came up, I started ‘cvsup-ping’ the OS changes. It stopped after the first file and I logged into the CVSup server (mirrors). Noticed that the ethernet had been going up and down. What was really odd, was the “ was using my IP!” error. For some reason, this box (mirrors) had stolen the IP from Quillo (our old 1U web server box). There is no way he could have done that, since it had to have happened when I had started cvsup-ping. Soo, weird. Anyway, I fixed that issue on Mirrors, Gibraltar started cvsupping and all was good.
7:30am – Left home. Went to office to get ladder. Called Cory @ SGCity to get keys to the gate so I could drive my 4WD truck up. Upon arriving at the office right before 8am, Dan (one of our techs) said all the machines had been powered off and were having issues logging in. After a reboot of his machine, it resolved itself.
8:00am – Had Randy call SG Water to get up on the tank to check alignment. Drove up to candy cane. It appears that our backhaul unit that feeds our CandyCane access had blown over in the wind. Six concrete lag screws that were put into the concrete roof of our little ’shack’ had all popped out. I tried rigging it back up with rocks and dumb tie down wire.. which worked until Randy came up with a “non-penetrating roof mount” around 9am.

It was one thing after another this morning. I ordered an “online” UPS (minuteman brand) to fix that 2-4ms problem for the one server that can’t handle it. Hopefully, it’ll be here before another “reboot”. The weird thing about that issue is that I have dual power supplies, plugged into different power strips, plugged into different batteries, on different power circuits. Yet, I still have the issue. I’m hoping to plug one of those power supplies into this new “online UPS” and that’ll solve my problem. *fingers crossed*.

Oh and our installer quit yesterday…well, we think. He just dropped off the truck with the phone/laptop/keys inside of the truck in the parking lot, without telling anybody. Soo congrats tomp on starting Friday. May your future be filled with many-a-non-problematic-install.

New hardware..

I had ordered two new servers (non-RAID, diskless) to replace two servers that like to crash and reboot whenever they darn well please. (see previous post).

Early this morning (12am to 1.30am) I swapped them out. The first server was ‘Zahara’, one of our mailbox storage servers. It’s where 1/4 of our mailboxes reside. A quick RAID card change, and disk swap and it was back up. No issues.

The second server was ‘Cobre’, our CPanel hosting server. This one was running an older (3ware 8506) RAID card, and I wanted to upgrade to a newer (3ware 9500S) RAID card to get better everything. Luckily for me 3ware released a ‘convert.exe’ file that will convert your old 8500 series RAID arrays into 9500/9550SX arrays. All it took was downloading the convert.exe to a floppy, making a boot floppy, and rebooting the box (before I swapped chassis and RAID cards) and running ‘convert *’. It marked the RAID array as workable on the new 9500 card. Yeah, I know. Too technical. But it was pretty amazing that it worked. I had tried it earlier at the office to make sure it would save my data, and it did.. but there’s always that chance that it might not work and then there goes my morning.. rebuilding a server from backups. Ugh.

Anywho.. after I ran the convert.exe script, I swapped the 4 drives into the new server, powered it up and all was good. Until it kernel panicked. Yeah, I know. It sucked. Sucked hard. I was worried that it’d still panic on startup each time. So, reboot into single-user-mode, fsck my drives, did a ‘make installworld’. I had not done that since installing my new kernel with support for the new RAID card. After that, another reboot and all came up. Well.. except my big Catalyst 6513 had some weird ARP issues with the IPs that the cpanel server was asking for. It just didnt end, really. Well.. it did.. around 1.30am. After a fun hour and a half in a cold room working with hardware.

Sometimes I just want to be a chef.

Sunday Morning MySQL issues..

You’d think I get to sleep in most Sunday mornings. Well, I do. Most Sundays, that is. However, this Sunday is a different story. Our hosting server (cpanel) has had issues for quite some time and likes to reboot itself at random times.. usually it comes up fine. This case it did not.

The server runs ‘fsck’ to scan the drives for any errors when it reboots, since it doesn’t unmount them on the crash. I know, it sucks. This time, my data in the /var/db/mysql/mysql folder had some issues. The mysql permissions tables are what resides in there. So MySQL refused to start. What a joke.

So I moved the mysql folder to mysql_old, re-ran the mysql_install_db script to regenerate the permissions table. Then set my mysql ‘root’ password. Lucky for me, cpanel backs up the MySQL user privileges in the users backup directory every night (or is it morning?).

So some quick bash scripting later:

for blah in $( ls /backup/cpbackup/daily/*/mysql.sql ); do
echo "Restoring.. $blah"
mysql --password=secretmysqlpassword mysql < $blah
done;

And I have all my user permissions back in the database. It probably would have been easier had I had a recent backup of the mysql database. But this way, I have the passwords in tact.

Nothing like writing some bash scripting, and fixing server issues at 6:30am on a nice sunday morning.

Back to bed..

Power cables

Yeah. That’s right. Power Cables.

In my day-to-day job, I have to wire and manage a bunch of servers. Usually this involves providing power to these servers. You know the power cords I’m speaking of. 5-6 foot. Black. Standard Issue power cord for a computer. Unfortunately, the majority of my servers are in a rack stacked on top of each other. This causes a bit of a tangled mess when you’re going from a power strip in the rear of the rack into the server with 5-6 feet of power cords. Then multiply that by 30-40 1U servers in a single rack. You get the idea.
Originally, my design was to coil the excess power cable into a nice tight ring. This got hairy after plugging more than 24 of these cables in. The new solution? 1 foot power cords. Yeah that’s right. 1 foot. $1.70 each at sfcable.com. I found them elsewhere for $5-$7 each and was disgusted with the price.. Luckily sfcable came along and made me enjoy wiring power cables again.

If you manage a bunch of servers in a rack, and have a power strip less than a foot away from the rear of your servers get these short cables. They’re easier to wire, look cleaner and will make your job a lot easier keeping things neat and pretty.  Oh and they also come in 2ft 2.46ft 3ft lengths. Perfect for any rack mount server.

Some peoples wires

Day before thanksgiving I went with Randy and Kelly to install a hotspot at a new store opening today. Unfortunately, it was a wiring nightmare. Whoever wired this place did a sweet job of wiring the telco phone. You can see their method of patching into the punchdown block here. They also split the Cat5 into two pair each to get two compuuter devices able to use each Cat5 cable. Talk about problematic. It was a pain. Hopefully, things like this aren’t being wired everyday by your average joe-electrician.

When it all goes bad…

So last night (Sunday) I went by our Datacenter to install two BBU (Battery-Backup-Units) in two of our storage servers for mail. These BBU units are nice for when power goes out and you dont loose data that’s in your write cache. Anyway, we have 3ware 9500 RAID cards.

So, I installed the BBUs. Which was kind of tight with the LED monitoring cables (which is a whole different story), came home.. about midnight I started the BBU test which disabled write cache on the card while it drains the battery..then it’ll recharge the battery after it’s done and then re-enable write caching.
Around 6am I woke up, as usual, and checked the status on both machines. Both were working fine and still ‘testing’.. went back to bed. around 9am I got a text message from a co-worker saying mail was down. Sure enough ‘ubrique’ one of the servers was dead in the water. console showed some SCSI (sata raid) errors. Rebooting remotely the machine did not come back up.. nadda..

Driving over was a pain. But I ended up swapping cards.. finding out FreeBSD 5.4 didnt have 3ware 9550 drivers working right yet out of the box.. and brought down ‘guadix’ another of our servers to get the storage server back up.. anyway, overnighting a new 9500 card from ASA. hopefully, this wont happen like this again . I’m looking at getting a cold-storage 9500 card.. as well as a diskless backup server so I could swap drives in a pinch.

Here’s to a fun morning.