2010/05/07

parallel xargs and Xen/VMware / performance / storage performance data gathering

I have a problem where certain virtual machines drop offline for brief periods of time.

Why? I don't know, but I suspect an I/O related issue (NFS / network hiccup). There's nothing in the logs.

How can I test this? Well, the built-in tools (sar/sysstat reports) Vcenter, or XenCenter don't keep historical data down to the second, so I will miss any hiccup that lasted only a few seconds. Furthermore, long-term (minute, 5 minute, etc) stats/graphs tend to average out the spikes, so I won't necessarily see a spike of a fine resolution (seconds), unless it's enough to register on a scale of a time resolution an order of magnitude greater.

I know that when local disk I/O is blocked, then iostat shows the cpu spending 100% in iowait, and an individual device queue maxes out (disk device may also go to 100% utilization, with no transactions or data actually read/written). So, how to get/keep historical data to compare against my Nagios alerts? How about gathering data with 1 second granularity, and then when an alert comes through, compare that timestamp against what I've found in the logs.

how to do that in a quick, parallel, automated fashion without manually logging in to 50 servers? Parallel execution with xargs and ssh:
First, distribute a public key for ssh to use, so you don't have to manually type the password on each box. Then,

echo xenhost1 xenhost2 xenhost3 vm1 vm2 vm3 vm4 vm5 | xargs -P 100 -d" " --replace="TARGET" bash -c 'ssh -i /root/.ssh/ssh_keyfile TARGET "iostat -xnt 1 || iostat -nt 1" > TARGET.log'

(the || is because Citrix XenServer 5.5.0 uses a different iostat version and syntax than the other RHEL/CentOS systems I have)

When I get an that "vm xyz is unavailable", I can look at the log file for each of the xen hosts, the nfs server, and the vm itself to see whether there was an I/O, cpu load, or other problem.

Also, once I have all the data, I can grep out lines that I care about and import them into excel to have trending data with 1 second granularity, e.g.:
egrep "nfsserver:/export/home" xenhost1.log > foo.csv

Of course, you could use this xargs and logfile method to snapshot many aspects of a system over time to troubleshoot a problem or for your own records (what process list, network connections, etc.), things that may not be caught by syslogd, dmesg, cacti, etc. I did something like this to capture total IOPS across our enterprise when we were planning to consolidate a bunch of local disk to a central NAS+SAN. Had to capture iops per-server over a period of a week. It made for a 30MB spreadsheet, but in the end I know my numbers were accurate for the peaks and valleys of our i/o load.