7/26/16
I preface this entry with a disclaimer. It was impossible that this day was going to go well. I started it knowing that it would probably be unpleasant, but I did not expect tech-hell to be the reason for a ruined day. I expected dealing with a car insurance company, a collision repair shop, and an unhappy girlfriend (since see's diabetic, we'll call her Wilford) at 8 in the morning would ruin my day... Instead, all of that went perfectly fine! I departed the collision shop at 8:10 and was ready for the day to get started. I was happy -- too happy.
Because I was loaning my car to my girlfriend whose car was in the shop, I used the street entrance into the office instead of the sky bridge. When I waved my badge in front of the RFID reader, the door started turning, I walked inside -- SPLAT -- "Security violation... Security violation" howled from the annoying speaker in the revolving door. My glasses were bent to hell, my nose hurt, the guy behind me was giggling. I stepped back, let the door think about what it had done, removed my bent glasses, and swiped my badge again. Cautiously this time, I walked into the chamber of doom and made it into the bowls of my ridiculously secure looking insecure building. The guy I rode the elevator with insisted on kibitzing so I obliged and we talked about how great the weather was and how 'bout that sports ball team while I tried to bend my very necessary glasses back into some sort of usable form.
Tired from the small-talk with Jimmy Cayne the banker, I walked into the office with big plans to begin developing a fool proof procedure for upgrading the small supercomputer I maintain. People hate it when you have to take the cluster down for maintenance, so I try to avoid it and aggregate all maintenance into a few windows per year. I sat down and started my 3 minute morning routine of checking the price of Bitcoin and a few stocks then logging into the supercomputer to see what's going on in the world of VASP (remind me to tell you about VASP some time). Anyways, I guess overnight some M$ patches had deployed and rebooted my workstation for me. I get my browser opened and all my favorite equities and Bitcoin were down (as expected). I decided it was time to check on the supercomputer. I opened up a shell and sshed to the login node. It asked me for my password. This didn't feel right. I thought I'd set up public/private key authentication... I guess not. I typed my password quickly, ready to get on with it, and it asked me again. I figured I'd fat figured the password and tried again more methodically. It prompted again "Password: ."
"What the fuck?"
I sent Rain Man, my boss, a text alerting him of my discovery and began work.
Okay. LDAP isn't working... Interesting -- Several months ago I setup 2 haproxy servers with corosync and 2 replicated ldap servers to prevent this... Interesting.
How about I login to vCenter and see what's going on. "vSphere Client could not connect to ."
"What the fucking fuck!?"
Okay... I just checked the price of Bitcoin, so DNS (which is a VM) is working. Interesting... Let's login to all of our hypervisors directly. I found a list of 50ish VMs --
! ! ! ! ! !
All of them with scary warning signs next to their names. And so it begins...
We had one hypervisor that by design didn't use the same storage as all of our other hypervisors. His VMs were running just fine. Suddenly it occurred to me that one of our LDAP servers and one of your haproxy servers should probably be living on that hypervisor. We'll put that on the list of shit to do after we fix this.
It's clear now that we have a problem with the NAS that all of the VMs live on. Just yesterday, I'd noticed that we had 1.5TB of space left out of 5TB for our VM storage pool. But, that was plenty and I wasn't expecting us to outgrow that overnight. WTF! I login to our NAS and discover that snapshots were turned on. I distinctly remember deciding not to use file system snapshots on the NAS, but I guess I was dreaming. I deleted a few snapshots and saw the available space on the NAS increment upward a bit... but not as much data as I was deleting. My internal voice said: "Oh! I have dedup on! It makes since that deleting snapshots wouldn't free up a ton of space because we're only deleting the differences." Perfect. I have 500GB available.
I started rebooting the VMs that had crashed and running file system checks and repairs on them. I focused first on my haproxy servers, webservers, and ldap servers. These were customer facing and really everything depends on them. Once those were up I took a moment to look into when we went down. 8:17am... Just about the time my face was getting smashed by a revolving door. I could see that this day was going to be a dandy.
I skipped the self-pity and went straight to work on the rest of the down VMs. By 10am everything was back up and running as normal. I went back to the storage issue and deleted a few VMs that I didn't need anymore. I deleted some VMWare snapshots that I was never going to revert to. In all, I had about 1TB of space available on the NAS again. I couldn't believe that we'd grown 1.5TB over night, but I suppose anything is possible.
After lunch I checked in on things and noticed that available space had inched downward by a little bit. I also noticed some new hourly snapshots (which were limited to 5, so it shouldn't be creating any more than exist right now). I migrated a few VMs to another storage appliance we have and saw that we were back up into that 1TB area.
Throughout the day I checked in on storage and everything seemed to be settled down. I joined a friend for a run to our favorite liquor store (I call it Mecca) and bought a giant bottle of Elijah Craig Bourbon. I went home to an energized and needy dog, took a shower, and decided to do a little coding. At 8:12pm I went to submit a job to the supercomputer to test the code I'd been working on... FUUUUUCK!
Where is all of my space going? I moved more VMs totaling about 1TB to our other NAS. I brought everything back online again and emailed all of the users expressing my deepest apologies. At 10pm I then poured myself a glass of bourbon.
Upvoted
Downvoting a post can decrease pending rewards and make it less visible. Common reasons:
Submit