Big-win I/O performance increase coming to KVM guests in RHEL6.4

I finally got the pony I’ve been asking for.

There’s a very interesting (and impactful) performance optimization coming to RHEL6.4.  For years we’ve had to do this sort of tuning manually, but thanks to the power of open source, this magical feature has been implemented and is headed your way in RHEL6.4 (try it in the beta!)

enterprise

What is this magical feature…is it a double-rainbow ?  Yes.  All the way.

It’s vhost thread affinity via virsh emulatorpin.

If you’re familiar with the vhost_net network infrastructure added to Linux, it moves the network I/O out of the main qemu userspace thread to a kthread called vhost-$PID (where $PID is the PPID of the main KVM process for the particular guest).  So if your KVM guest is PID 12345, you would also see a [vhost-12345] process.

Anyway…with the growing amount of CPUs/RAM available and proliferation of NUMA systems (basically everything x86 these days), we have to be very careful to respect NUMA topology when tuning for maximum performance.  Lots of common optimizations these days center around NUMA affinity tuning, and the automatic vhost affinity support is tangentially related to that.

If you are concerned with having the best performance for your KVM guest, you may have already used either virsh or virt-manager to bind the VCPUs to a physical CPUs or NUMA nodes.  virt-manager makes this very easy by clicking “Generate from host NUMA configuration”:

vcpupin

OK that’s great.  The guest is going to stick around on those odd-numbered cores.  On my system, the NUMA topology looks like this:

# lscpu|grep NUMA
NUMA node(s): 4
NUMA node0 CPU(s): 0,2,4,6,8,10
NUMA node1 CPU(s): 12,14,16,18,20,22
NUMA node2 CPU(s): 13,15,17,19,21,23
NUMA node3 CPU(s): 1,3,5,7,9,11

So virt-manager will confine the guest’s VCPUs to node 3.  You may think you’re all set now.  And you’re close and you can see the rainbow on the horizon.  You have significantly improved guest performance already by respecting physical NUMA topology, there is more to be done.  Inbound pony.

Earlier I described the concept of the vhost thread, which contains the network processing for it’s associated KVM guest.  We need to make sure that the vhost thread’s affinity matches the KVM guest affinity that we implemented with virt-manager.

At the moment, this feature is not exposed in virt-manager or virt-install, but it’s still very easy to do.  If your guest is named ‘rhel64’, and you want to bind it’s “emulator threads” (like vhost-net) all you have to do is: 

# virsh emulatorpin rhel64 1,3,5,7,9,11 --live
# virsh emulatorpin rhel64 1,3,5,7,9,11 --config

Now the vhost-net threads share a last-level-cache (LLC) with the VCPU threads.  Verify with:

# taskset -pc <PID_OF_KVM_GUEST>
# taskset -pc <PID_OF_VHOST_THREAD>

These should match.  Cache memory is many orders of magnitude faster than main memory, and the performance benefits of this NUMA/cache sharing is obvious…using netperf:

Avg TCP_RR (latency)
Before: 12813 trans/s
After: 14326 trans/s
% diff: +10.5%
Avg TCP_STREAM (throughput)
Before: 8856Mbps
After: 9413Mbps
% diff: +5.9%

So that’s a great performance improvement; just remember for now to run the emulatorpin stuff manually. Note that as I mentioned in previous blog posts, I always mis-tune stuff to make sure I did it right. The “before” numbers above are from the mis-tuned case 😉

Off topic…while writing this blog I was reminded of a really funny story I read on Eric Sandeen’s blog about open source ponies. Ha!