It’s been a while since I’ve done any sort of O&M, but I have always taken pride in my ability to troubleshoot difficult problems. I was having an issue with one of my Ubuntu VMs. Everything “appeared” to be fine, but in hindsight, I now realize they weren’t. After checking on everything I thought was the problem (CPU usage on the VM, network issues, and anything else I could think of), I was at a loss as to why I had continual PHP timeouts on an apache web-server (yes the one that hosts this blog). I decided, as a last ditch effort, that I would bring my entire infrastructure down and bring it back up cold. It was after I did this that I realized I had been looking in the wrong place for the cause.
As you can see in the charts, my CPU, Memory and Disk utilization on my ESX server was not only high, but pretty erratic. If I had noticed this before the reboot I could of investigated the cause or at least used top to figure out what was chewing up the CPUs. Now I will never know what was causing the problem, unless of course it returns. If it does I will be better prepared to solve the issue, as I will be armed with all the information. The moral of this story is to keep the troubleshooting to it’s basic level before assuming the problem exists higher up. Approach the problem in a logical manner working from the bottom up. I hope you learn from my mistakes, as I know I sure have
Funny I had a similar story over the last week or so. Nice practice in keeping troubleshooting skills sharp.