Performance Analysis and Tuning Videos from Red Hat Summit 2014

This year’s Red Hat Summit took place at the Moscone Center in downtown San Francisco. Red Hat’s Performance Engineering team had it’s opportunity to showcase our contributions to products and customers with presentations on performance tuning for RHEL, databases, and Red Hat Storage (with behind-the-scenes/support data for many other talks).

Summit is always exciting, because as a company, Red Hat finally gets to reveal what we’ve been cooking. For example, you may have seen Jim Whitehurst announce during his keynote, a RHEL variant for containers called Red Hat Enterprise Linux Atomic Host via the open source Project Atomic. Having witnessed the internal development velocity and excitement from customers/partners at Summit around Atomic in particular, I am just so happy for our extremely hard working development teams who are doing everything out in the open, the “Red Hat Way”, as it absolutely should be.

Red Hat made so many announcements, I’d encourage you to look at their Twitter feed to catch it all.

This year marked my 2nd turn as a partner in the Performance Analysis and Tuning presentation. If you haven’t attended a Summit before, this 2-part session is typically (this year included) one of the most highly anticipated and attended sessions. Our A/V team has already posted the videos for both parts: Part 1 and Part 2.

Red Hat also announced the imminent availability of the Red Hat Enterprise Linux 7 Release Candidate. The RC includes quite a few performance improvements and important fixes (including this one, which I mentioned during one of the perf talks). To compliment the RC, our docs team has also refreshed the official RHEL7 Documentation, which means I don’t have to keep pointing people to my blog to figure out nohz_full anymore 🙂

If you haven’t tried the RHEL7 beta, I’d strongly encourage you look at the RC when it hits RHN. It’s also probably best that you do a fresh install.

From helping characterize RHEL7, to OpenStack, Red Hat Storage, OpenShift and Docker, it’s been just an insane few years. The most fun I’ve had in my career, too. #opensource rocks!

nohz_full=godmode ?

Starting with some background…What is the kernel timer tick (aka LOC interrupt), and what does it do ?

The kernel timer tick is a interrupt triggered at a periodic interval (based on the kernel compile option CONFIG_HZ). The tick is what keeps track of kernel statistics such as CPU and memory usage and provides for scheduler fairness through it’s load balancer. It also does timekeeping, i.e. to keep gettimeofday updated.

When the tick fires (as often as every millisecond, based on value of CONFIG_NO_HZ), it will get scheduled ahead of whatever’s currently running on a CPU core. In other words, whatever was running (with all of it’s valuable data cache-hot) will be interrupted by the tick. The CPUs L1 instruction and data caches (the smallest yet fastest) are invalidated, somewhere around 1000 times a second (if the task was 100% CPU-bound which the majority are not).

This is not an All Is Lost scenario, but certain workloads might see a 1-3% hit that could be attributed to this interference. It also caused some noticeable jitter, especially since what happens inside the tick is not deterministic. The total time the tick runs is not a predictable/constant value.

That was a mouthful, so let me dissect it a bit by describing various kernel config options that control how often this tick fires.

Prior to the introduction of the “tickless kernel” in kernel 2.6.21, the timer tick ran on every core at the rate of CONFIG_HZ (i.e. 1000/sec). This provided for a decent balance of throughput and latency. It had the side-effect of waking up every core constantly, which wasn’t necessary when nr_running=0 (a per-core attribute…see /proc/sched_debug). The scheduler says there’s nothing to run on the core, so let’s disable the tick there and save some power by not waking the CPU up from a deeper c-state. Actually it saves lots of power; linux has become quite a responsible citizen in this regard.

In summary:

RHEL5 – CONFIG_HZ=1000
- No Tickless support
- Ticks 1000/sec on every CPU no matter what

RHEL6 – CONFIG_HZ=1000, CONFIG_NO_HZ=y
- Tickless when nr_running = 0
- Ticks 1000/sec when nr_running > 0

RHEL7 – CONFIG_HZ=1000, CONFIG_NO_HZ=y, CONFIG_NO_HZ_FULL=y, etc.
- Opt-in support for nohz_full
- Tickless when nr_running <= 1
- Ticks 1000/s when nr_running > 1

Note: for RHEL7, you will need 3.10.0-68 or later.

Red Hat’s Frederic Weisbecker has been working with other industry leaders such as Paul McKenney from IBM (and many others) to implement a feature called Full NO HZ. During the development phase, it has changed names several times (i.e. adaptive tickless). These days the kernel cmdline option to toggle it is nohz_full, so that’s what I’m calling it.

This feature requires yet another slew of kernel config options, along with some userspace gymnastics (that I’ll detail later) to get everything lined up. So far the use-cases for disabling the tick has been embedded applications, HPC/scientific, and the financial guys who need real-time characteristics.

It makes sense then to have these features enabled, but defaulted to OFF such that these folks can opt-in. As you’ll see it’s not really necessary for everyone, nor do most workloads expose the tick as the “top-talker” in traces. But several can, and it was for those customers that the feature was developed.

nohz_full has the following characteristics:

Stop interrupting userspace when nr_running=1 (see /proc/sched_debug).
- If runqueue depth is 1, then the scheduler should have nothing to do on that core.
Move all timekeeping to non-latency-sensitive cores.
Mark certain cores as nohz_full cores via cmdline. In this example, the system has 2 sockets, 8 cores each, 16 cores total, logical cores disabled. I want to dump everything I can over to core 0, leaving cores 1-15 for my performance critical application:

Kernel cmdline: nohz_full=1-15 isolcpus=1-15 selinux=0 audit=0

# dmesg|grep dyntick
dmesg: [ 0.000000] NO_HZ: Full dynticks CPUs: 1-15.

In addition to cmdline options nohz_full, the user must move RCU threads themselves.

 # for i in `pgrep rcu` ; do taskset -pc 0 $i ; done

Frederic has written a small harness that uses kernel tracepoints and the ftrace interface to test and debug during this feature’s development. It’s available here:

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/dynticks-testing.git

That harness spits out something like this:

root@localhost: ~/dynticks-testing # cat trace.1
 # tracer: nop
 #
 # entries-in-buffer/entries-written: 10392/10392 #P:16
 #
 # _-----=> irqs-off
 # / _----=> need-resched
 # | / _---=> hardirq/softirq
 # || / _--=> preempt-depth
 # ||| / delay
 # TASK-PID CPU# |||| TIMESTAMP FUNCTION
 # | | | |||| | |
 -0 [001] d... 1565.585643: tick_stop: success=yes msg=
 user_loop-10409 [001] d.h. 1565.586320: hrtimer_expire_entry: hrtimer=ffff881fbfa2ec80 function=tick_sched_timer now=1565474000583
 user_loop-10409 [001] d... 1565.586327: tick_stop: success=yes msg=
 user_loop-10409 [001] d.h. 1566.586352: hrtimer_expire_entry: hrtimer=ffff881fbfa2ec80 function=tick_sched_timer now=1566474000281
 user_loop-10409 [001] d.h. 1567.586384: hrtimer_expire_entry: hrtimer=ffff881fbfa2ec80 function=tick_sched_timer now=1567474000282
 user_loop-10409 [001] d.h. 1568.586417: hrtimer_expire_entry: hrtimer=ffff881fbfa2ec80 function=tick_sched_timer now=1568474000280
 user_loop-10409 [001] d.h. 1569.586449: hrtimer_expire_entry: hrtimer=ffff881fbfa2ec80 function=tick_sched_timer now=1569474000280
 user_loop-10409 [001] d.h. 1570.586482: hrtimer_expire_entry: hrtimer=ffff881fbfa2ec80 function=tick_sched_timer now=1570474000275

What we’re looking for is the tick_stop messages, which mean that tick fired. Note: There is still one tick per-second in the current upstream code to maintain scheduler stats for load balancing. The above output is from a system tuned according to the specifics in this blog post. It was also necessary to configure the system BIOS for low latency. Individual OEMs typically publish whitepapers on this topic.

I mentioned certain statistical accounting is done inside the tick. One of those that is user-controllable is vm.stat_interval (which defaults to 1, so once per second). You will see that even with nohz_full, vm.stat_interval will pop at that interval. Frederic’s test harness accounts for this by setting vm.stat_interval to 120, then running the test for 10 seconds. If you run the test for 120+ seconds, you will see vmstat_update fire (and possibly other things like xfs).

kworker/1:0-141 [001] .... 2693.850191: workqueue_execute_start: work struct ffff881fbfa304a0: function vmstat_update

kworker/1:0-141   [001] ....  2713.458820: workqueue_execute_start: work struct ffff881f90e07c28: function xfs_log_worker [xfs]

This feature is a massive improvement in terms of cache efficiency. To see what I mean, try running this test harness without the kernel cmdline optons 🙂

To get rid of the xfs_log_worker interference, you can use the tunable workqueues feature of the kernel’s bdi-flush writeback threads. If, as in the above example, you are using core 0 as your “housekeeping CPU”, then you could affine the bdi-flush threads to core 0 like so:

# echo 1 > /sys/bus/workqueue/devices/writeback/cpumask

It takes a hex argument, so 1 is actually core 0.

At this point whenever the kernel wants to write dirty pages, it will wake up these bdi-flush threads as normal, but now they will wake up with the affinity that you programmed in. Keep in mind that a single core might not be enough to do the writeback and whatever else the kernel needs to do, because bdi-flush threads, like any IO thread, block. You might need to use 2+ cores. Keep an eye out for CPU congestion or blocking on the housekeeping core (mpstat or similar).

Also note that by default in RHEL7, bdi-flush threads are NUMA-affined to be PCI-local to your storage adapter (whether it’s a local SCSI/SATA card or HBA). That’s a change from RHEL6 where bdi-flush threads had no affinity by default. You can disable the default NUMA affinity and return RHEL6 setting like so:

# echo 0 > /sys/bus/workqueue/devices/writeback/numa

The 2 “echo” commands above do not persist reboots.

Now…If you run turbostat while in this configuration, you will see that the timekeeping core (core 0 in this case) is kept busy enough (because it is now ticking @ CONFIG_HZ rate) to be kept in C-state 0. That’s less than palatable, and was later fixed by Paul McKenney and is called CONFIG_NO_HZ_FULL_SYSIDLE. When that’s set, the timekeeping core is no longer pegged. Godmode???

Here’s another way to examine the tick’s behavior:

# perf stat -C 1 -e irq_vectors:local_timer_entry sleep 1

9 irq_vectors:local_timer_entry

pig is a program written by my co-worker Bill Gray. It’s used as an artificial load generator. Below, it spins on the CPU for 1 second. Unfortunately it’s not packaged for RHEL. But you can use this instead, just as well.

So here is the trace without the cmdline options. You can see that the tick fires roughly 1000 times in the 1 second run, and is expected out of the box behavior.

# perf stat -C 1 -e irq_vectors:local_timer_entry taskset -c 1 /root/pig -s 1

1005 irq_vectors:local_timer_entry

Then reboot with nohz_full=1-15 rcu_nocbs=1-15 and isolate core 1 from userspace tasks and IRQs. You could do this with isolcpus=1-15 too.

# tuna -c 1 -i ; tuna -q * -c 1 -i

The same pig run ends up with only a handful of interruptions! Oink!

# perf stat -C 1 -e irq_vectors:local_timer_entry taskset -c 1 /root/pig -s 1

4 irq_vectors:local_timer_entry

Here’s yet another (less granular) way to see what’s going on:

# watch -n1 -d "cat /proc/interrupts|egrep 'LOC|CPU'"

Now that you’ve validated your configuration, it’s time to run your applications and see if this feature gives you any boost. If you’ve got the right NICs, try out the busy polling socket option, too.

Here is some further reading on the topic, including a video of Frederic Weisbecker from LinuxCon where he covers this feature in detail.

https://www.kernel.org/doc/Documentation/timers/NO_HZ.txt
http://lwn.net/Articles/549580/
http://www.youtube.com/watch?v=G3jHP9kNjwc

Oh, did you expect the CPU ?

Sea-change alert…

For a while now, there has been a drive to lower power consumption in the datacenter. It began with virtualization density, continues with linux containers (fun posts coming soon on that), newer processors and their power-sipping variants, CPU frequency governors, CPU idle drivers, and new architectures like ARM and Intel’s Atom.

The sea change I’m alluding to is that with all of this churn in the hardware and kernel space, applications may have not kept up with what’s necessary to achieve top performance. My contact with customers and co-workers has surfaced a very important detail: application developers expect the hardware and kernel to “do the right thing”, and rightfully so. But customer-driven industry trends such as reduced power consumption have a side-effect: reduced performance.

Circling back to the title of this article…again, for a number of years the assumption by developers that full-bore CPU power is available 100% of the time is somewhat mis-leading. After all, when you shell out for those fancy new chips, you get what you pay for, right ? 🙂 The hardware and CPU frequency/idle drivers are biased towards power savings, I personally believe due to industry pressure, in their default configurations. If you’ve read some of my previous posts, you understand the situation, know how to turn all of that off during runtime, and get excellent performance at the price of power consumption.

But there’s got to be some sort of middle-ground…and in fact, our experiments have proven a few options for customers. For example…if you look at the C-state exit latencies on a Sandy Bridge CPU

# find /sys/devices/system/cpu/cpu0/cpuidle | grep latency | xargs cat 0 1 80 104 109

You can see that the latencies increase dramatically, the deeper you go. What if you just cut off the last few ? That turns out to be a valid compromise! You can set /dev/cpu_dma_latency=80 on this system and that will keep you out of the deepest C-states (C6 and C7), that have the highest exit latencies. Your cores will float somewhere between C3 and C0.

This method allows you to benefit from turbo-boost, when there is thermal headroom to do so. And we’ve seen improvements across a wide-variety of workloads that are not CPU-bound. Things like network- and disk-heavy loads that have small pauses (micro/milli) in them that allow the CPU to decide to go into deeper idle states, or slow it’s frequency. Oh by the way, the kernel recently grew tracepoints for PM/QoS subsystem. I think I could summarize this by saying if your workload is IRQ-heavy, you will probably see a benefit here because IRQs are just long enough to keep the processors out of C0. Generally I see a 30-40% C0 residency and the rest in C1 when I have a workload that is IRQ-heavy.

So when you use something like the latency-performance tuned profile that ships in RHEL, amongst other things, you lock the processors in C1 state. That has the side-effect of disabling turbo (see TDP article above), which is generally fine since all the BIOS low latency tuning guides I’ve seen tell you to disable turbo anyway (to reduce jitter). But. And there’s always a but. If you have a low thread count, and you want to capture turbo speeds, there is a new socket option, brought to you by Elizier Tamir from Intel, based on Jesse Brandeburg’s Low Latency Sockets paper from Linux Plumbers Conference 2012. It has since been renamed busy-polling, something I’m having a hard time getting used to myself…but whatever.

The busy-polling socket option is enabled either in the application code through setsockopt SO_BUSY_POLL=N, or sysctl net.core.busy_{read,poll}=N. See Documentation/sysctl/net.txt. When you enable this feature (which btw requires driver enablement…as of this writing, ixgbe, mlx4, bnx2x), the driver will busy-poll the hardware RX queue on the NIC and thus reduce latency. As mentioned in the commit logs for the patch set and the kernel docs, it has the side-effect of increased power consumption.

Starting off, I talked about looking for balance between hardware/idle driver power-savings BIAS, and performance (while retaining as much power savings as we can). The busy-polling feature allows you to (indirectly) lock only those cores active for your application into more performant C-states and operating frequencies. When your socket starts receiving data, the core executing the application owning the socket goes almost immediately to 100% in C0, while all the other cores remain in c6. As I said, without the socket option, only 30-40% of the time is spent in C0. It’s important to note that when the socket is NOT receiving data, the core transitions into a deep c-state. This is an excellent balance of power and performance when you need it.

This allows the cores being used by the application to benefit from turbo speeds, which explains why busy-polling outperforms the low-latency tuned profile (which effectively disables turbo by locking all cores into C0). Not only does this option outperform the c-state lock (because of turbo boost), it also helps achieve a more favorable balance of low latency performance vs power consumption by allowing other cores in the system to go into deep c-states. Nirvana ???

Back to macro: the busy-polling knob is only one way that developers should ask for the CPU these days. The second (and as I’m told under authority), preferred way to instruct the CPU what your application performance tolerances are, is through the /dev/cpu_dma_latency interface. I’ve covered the latter in a previous article, please have a look.

And here’s what I mean:

Performance Analysis and Tuning Videos from Red Hat Summit 2013

The Performance Engineering group under direction of John Shakshober (aka Shak), had a very busy spring working with our excellent customer and partner ecosystem, generating high-value content for Summit attendees. A great example of collaboration with customers was a super interesting talk from NASA, along with Red Hat’s Mark Wagner and Shak. Hopefully they post a video of it!

On Red Hat’s website, you can find videos of the keynotes as well as many other excellent presentations. Be sure to check them out here. All in all, a great week…very happy to re-connect with customers, partners and fellow Red Hat associates.

One of the recurring (and popular) presentations at Red Hat Summit is the Performance Analysis and Tuning “Shak and Larry Woodman Show”. This year, along with Bill Gray, I was honored to be a small piece of this very well attended talk.

Red Hat’s event A/V staff continues to raise the bar, and has posted videos here: Part 1 and Part 2. I hope they’re helpful!

“Baby, we were born to run…” — Every userspace process ever.

I tweeted recently that %usr is what we wanna do; %sys is what we gotta do…What I meant was to point out that the kernel’s main goals in life are to bring up hardware, and manage access to it on behalf of an application and get out of the way. This includes objectives like allocating memory when an application asks for it, taking network packets from an application and giving them to the network card, and deciding what application runs on what core, when it runs (ordering), and for how long.

Since at least the days of the Apollo Guidance Computer, there has been the concept of priorities in job scheduling. Should you have the time, I highly recommend the Wikipedia article, this book, and the AGC Emulator.

Anyway, in more recent operating systems like Linux, the user interface to the job scheduler is quite similar — a system of policies and priorities. There’s a great write-up in the Red Hat MRG Realtime docs here.

The system of policies and priorities represent a multi-tiered approach to ordering jobs on a multitasking operating system. The user herself or an application may request from the kernel that it wants a certain scheduling policy and priority. By itself, those values don’t mean much. But when there’s a contended resource, (such as a single CPU core) they quickly come into play by informing the scheduler what the various task priorities are in relation to each other. For example, in the case of the AGC, an engine control application would be prioritized higher than, say, a cabin heater.

The kernel can’t read minds, so we occasionally must provide it with guidance as to which application is the highest priority. If you have a server who’s purpose is to run an application that predicts the weather, you don’t need log cleanup scripts, data archival or backups etc running when the weather app has “real work” to do. Without any guidance, the kernel will assume these tasks are of equal weight, when in fact the operator knows better.

The tools to manipulate scheduler policy and priority are things like nice and chrt (there are also syscalls that apps can use directly). In the previous example, you might use nice to inform the scheduler that the weather application is the most important task on the system, and it should run whenever possible. Something like ‘nice -20 ./weather’ or ‘renice -20 `pidof weather`’.

Back to the kernel’s main point in life: mediating access to hardware. In order to do this, the kernel may spawn a special type of process called a kthread. Kthreads cannot be controlled like regular processes; i.e. CPU/memory affinity or killing them. At some point if these kthreads have work to do, the scheduler will let them run. I wrote about some of this previously…They have important functions to do, like write out dirty memory pages to disk (bdi-flush), perhaps shuffle network packets around (ksoftirqd) or service various kernel modules like infiniband.

When the kthreads run, they might run on the same core where the weather app is running. This interruption in userspace execution can cause a few symptoms…i.e. jittery latency performance, increased CPU cache misses, poor overall performance.

If you’re staring at one of these symptoms, you might be curious what’s the easiest way to find out what’s bumping you off-core and dumping your precious cache lines.

There are a few ways to determine this. I wrote about how to use perf sched record to do it in a low latency whitepaper, but wanted to write about a 2nd method I’ve been using a bit lately as well.

You can use a Systemtap script included in RHEL6 called ‘cycle_thief.stp’ (written by Red Hat’s Will Cohen) to find out what’s jumping ahead of you. Here’s an example; PID 3391 is a KVM guest. I added the [Section X] markers to make explaining the output a bit easier. I also removed the histogram buckets with zero values to shorten the output. Finally, I let it run for 30 seconds before hitting Ctrl+C.

# stap cycle_thief.stp -x 3391
^C
[Section 1]  task 3391 migrated: 1

[Section 2]  task 3391 on processor (us):
value |-------------------------------------------------- count
 16   |@@@@@@@@@@@@ 12
 32   |@@@@@@@@@@@ 11
 64   |@ 1

[Section 3] task 3391 off processor (us)
value   |-------------------------------------------------- count
 128    |@@@@@@@@@@@@ 12
 8192   |@@@@ 4
 131072 |@@@@ 4
 524288 |@@@ 3

[Section 4]
other pids taking processor from task 3391
 0    55
 3393 17
 2689 13
 115  4
 69   2
 431  1

[Section 5]
irq taking processor from task 3391
irq count min(us) avg(us) max(us)

Section 1 represents the number of times PID 3391 was migrated between CPU cores.

Section 2 is a histogram of the number of microseconds PID 3391 was on-core (actively executing on a CPU).

Section 3 is a histogram of the number of microseconds PID 3391 was off-core (something else was running).

Section 4 identifies which PIDs executed on the same core PID 3391 wanted to use during those 30 seconds (and thus bumped PID 3391 off-core). You can grep the process table to see what these are. Sometimes you’ll find other userspace processes, sometimes you’ll find kthreads. You can see this KVM guest was off-core more than on. It’s just an idle guest I created for this example, so that makes sense.

Section 5 is blank; had there been any IRQs serviced by this core during the 30 second script runtime, they’d be counted here.

With an understanding of the various policies and priorities (see MRG docs or man 2 setpriority) cycle_thief.stp is a super easy way of figuring out how to set your process policies and priorities to maximize the amount of time your app is on-core doing useful work.

Battle Plan for RDMA over Converged Ethernet (RoCE)

What is all that %sys time ? “I never know what she’s _doing_ in there…” Ha!

12:01:35 PM CPU %usr %nice %sys %iowait %irq %soft %idle
12:01:36 PM all 0.08 0.00  3.33 0.00    0.00 5.00  91.59
12:01:36 PM 0   0.00 0.00 40.59 0.00    0.00 59.41  0.00

...

You can instantly find out with ‘perf top’. In this case (netperf), the kernel is spending time copying skb’s around, mediating between kernel and userspace. I wrote a bit about this in a previous blog post; the traditional protection ring.

All that copying takes time…precious, precious time. And CPU cycles; also precious. And memory bandwidth…etc.

HPC customers have, for decades, been leveraging Remote Direct Memory Access (RDMA) technology to reduce latency and associated CPU time. They use InfiniBand fabrics and associated InfiniBand verbs programming to extract every last bit of performance out of their hardware.

As always, that last few percent performance ends up being the most expensive. Both in terms of hardware and software, as well as the people-talent and their effort. But it’s also sometimes the most lucrative.

Over the last few years, some in-roads have been made in lowering the bar to entry into RDMA implementation, with one of those being RoCE (RDMA Over Converged Ethernet). My employer Red Hat ships RoCE libraries (for Mellanox cards) in the “High Performance Networking” channel.

I’ve recently been working on characterizing RoCE in the context of it’s usefulness in various benchmarks and customer loads, so to that end I went into the lab and wired up a pair of Mellanox ConnectX-3 VPI cards back-to-back with a 56Gbit IB cable. The cards are inside Sandy Bridge generation servers.

Provided some basic understanding of the hideous vernacular in this area, it turns out to be shockingly easy to setup RoCE. Here’s some recommended reading to get you started:

First thing, make sure your server is subscribed to the HPN channel on RHN. Then let’s get all the packages installed.

# yum install libibverbs-rocee libibverbs-rocee-devel libibverbs-rocee-devel-static libibverbs-rocee-utils libmlx4-rocee libmlx4-rocee-static rdma mstflint libibverbs-utils infiniband-diags

The Mellanox VPI cards are multi-mode, in that they support either Infiniband or Ethernet. The cards I’ve got came in Infiniband mode, so I need to switch them over. Mellanox ships a script called connectx_port_config to change the mode, but we can do it with driver options too.

Get the PCI address of the NIC:

# lspci | grep Mellanox
 21:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]

Check what ethernet devices exist currently:

# ls -al /sys/class/net

I see ib0/1 devices now since my cards are in IB mode. Now let’s change to ethernet mode. Note you need to substitute your PCI address as it will likely differ from mine (21:00.0). I need eth twice since this is a dual-port card.

 # echo "0000:21:00.0 eth eth" >> /etc/rdma/mlx4.conf
 # modprobe -r mlx4_ib
 # modprobe -r mlx4_en
 # modprobe -r mlx4_core
 # service rdma restart ; chkconfig rdma on
 # modprobe mlx4_core
 # ls -al /sys/class/net

Now I see eth* devices (you may see pXpY names depending on the BIOS), since the cards are now in eth mode. If you look in dmesg you will see the mlx4 driver automatically sucked in the mlx4_en module accordingly. Cool!

Let’s verify that there is now an InfiniBand device ready for use:

# ibstat
CA 'mlx4_0'
	CA type: MT4099
	Number of ports: 2
	Firmware version: 2.11.500 <-- flashed the latest fw using mstflint.
	Hardware version: 0
	Node GUID: 0x0002c90300a0e970
	System image GUID: 0x0002c90300a0e973
	Port 1:
		State: Active  <-------------------- Sweet.
		Physical state: LinkUp
		Rate: 40
		Base lid: 0
		LMC: 0
		SM lid: 0
		Capability mask: 0x00010000
		Port GUID: 0x0202c9fffea0e970
		Link layer: Ethernet
	Port 2:
		State: Down
		Physical state: Disabled
		Rate: 10
		Base lid: 0
		LMC: 0
		SM lid: 0
		Capability mask: 0x00010000
		Port GUID: 0x0202c9fffea0e971
		Link layer: Ethernet

Cool so we’ve got our RoCE device up from a hardware init standpoint, now give it an IP like any old NIC.

Special note for IB users: most IB switches have a subnet manager built in (RHEL ships one too, called opensm). But when using RoCE there’s no need for opensm as it’s specific to InfiniBand fabrics and plays no part in Ethernet fabrics. And since RoCE runs over Ethernet, there is no need for a subnet manager. The InfiniBandTA article I linked above goes into some detail about what benefits the SM provides on IB fabrics.

Now we get to the hard and confusing part. Just kidding, we’re done. Was it that intimidating ?Let’s test it out using an RDMA application that ships with Red Hat MRG Messaging, called qpid-latency-test. I chose this because it supports RDMA as a transport.

# yum install qpid-cpp-server qpid-cpp-server-rdma qpid-cpp-client qpid-cpp-client-devel -y

# qpidd --auth no -m no
 2013-03-15 11:45:00 [Broker] notice SASL disabled: No Authentication Performed
 2013-03-15 11:45:00 [Network] notice Listening on TCP/TCP6 port 5672
 2013-03-15 11:45:00 [Security] notice ACL: Read file "/etc/qpid/qpidd.acl"
 2013-03-15 11:45:00 [System] notice Rdma: Listening on RDMA port 5672  <-- Sweet.
 2013-03-15 11:45:00 [Broker] notice Broker running

Defaults: around 100us.

# numactl -N0 -m0 nice -20 qpid-latency-test -b 172.17.2.41 --size 1024 --rate 10000 --prefetch=2000 --csv
 10000,0.104247,2.09671,0.197184
 10000,0.11297,2.12936,0.198664
 10000,0.099194,2.11989,0.197529
 ^C

With tcp-nodelay: around 95us

# numactl -N0 -m0 nice -20 qpid-latency-test -b 172.17.2.41 --size 1024 --rate 10000 --tcp-nodelay --prefetch=2000 --csv
 10000,0.094664,3.00963,0.163806
 10000,0.093109,2.14069,0.16246
 10000,0.094269,2.18473,0.163521

With RDMA/RoCE/HPN: around 65us.

# numactl -N0 -m0 nice -20 qpid-latency-test -b 172.17.2.41 --size 1024 --rate 10000 --prefetch=2000 --csv -P rdma
 10000,0.065334,1.88211,0.0858769
 10000,0.06503,1.93329,0.0879431
 10000,0.062449,1.94836,0.0872795
 ^C

Percentage-wise, that’s a really substantial improvement. Plus don’t forget all the %sys time (which also includes memory subsystem bandwidth usage) you’re saving. You get all those CPU cycles back to spend on your application!

Disclaimer: I didn’t do any heroic tuning on these systems. The above performance test numbers are only to illustrate “proportional improvements”. Don’t pay much attention to the raw numbers other than order-of-magnitude. You can do much better starting with this guide.

So! Maybe kick the tires on RoCE, and get closer to wire speed with lower latencies. Have fun!

Big-win I/O performance increase coming to KVM guests in RHEL6.4

I finally got the pony I’ve been asking for.

There’s a very interesting (and impactful) performance optimization coming to RHEL6.4. For years we’ve had to do this sort of tuning manually, but thanks to the power of open source, this magical feature has been implemented and is headed your way in RHEL6.4 (try it in the beta!)

What is this magical feature…is it a double-rainbow ? Yes. All the way.

It’s vhost thread affinity via virsh emulatorpin.

If you’re familiar with the vhost_net network infrastructure added to Linux, it moves the network I/O out of the main qemu userspace thread to a kthread called vhost-$PID (where $PID is the PPID of the main KVM process for the particular guest). So if your KVM guest is PID 12345, you would also see a [vhost-12345] process.

Anyway…with the growing amount of CPUs/RAM available and proliferation of NUMA systems (basically everything x86 these days), we have to be very careful to respect NUMA topology when tuning for maximum performance. Lots of common optimizations these days center around NUMA affinity tuning, and the automatic vhost affinity support is tangentially related to that.

If you are concerned with having the best performance for your KVM guest, you may have already used either virsh or virt-manager to bind the VCPUs to a physical CPUs or NUMA nodes. virt-manager makes this very easy by clicking “Generate from host NUMA configuration”:

OK that’s great. The guest is going to stick around on those odd-numbered cores. On my system, the NUMA topology looks like this:

# lscpu|grep NUMA
NUMA node(s): 4
NUMA node0 CPU(s): 0,2,4,6,8,10
NUMA node1 CPU(s): 12,14,16,18,20,22
NUMA node2 CPU(s): 13,15,17,19,21,23
NUMA node3 CPU(s): 1,3,5,7,9,11

So virt-manager will confine the guest’s VCPUs to node 3. You may think you’re all set now. And you’re close and you can see the rainbow on the horizon. You have significantly improved guest performance already by respecting physical NUMA topology, there is more to be done. Inbound pony.

Earlier I described the concept of the vhost thread, which contains the network processing for it’s associated KVM guest. We need to make sure that the vhost thread’s affinity matches the KVM guest affinity that we implemented with virt-manager.

At the moment, this feature is not exposed in virt-manager or virt-install, but it’s still very easy to do. If your guest is named ‘rhel64’, and you want to bind it’s “emulator threads” (like vhost-net) all you have to do is:

# virsh emulatorpin rhel64 1,3,5,7,9,11 --live

# virsh emulatorpin rhel64 1,3,5,7,9,11 --config

Now the vhost-net threads share a last-level-cache (LLC) with the VCPU threads. Verify with:

# taskset -pc <PID_OF_KVM_GUEST>

# taskset -pc <PID_OF_VHOST_THREAD>

These should match. Cache memory is many orders of magnitude faster than main memory, and the performance benefits of this NUMA/cache sharing is obvious…using netperf:

Avg TCP_RR (latency)
Before: 12813 trans/s
After: 14326 trans/s
% diff: +10.5%

Avg TCP_STREAM (throughput)
Before: 8856Mbps
After: 9413Mbps
% diff: +5.9%

So that’s a great performance improvement; just remember for now to run the emulatorpin stuff manually. Note that as I mentioned in previous blog posts, I always mis-tune stuff to make sure I did it right. The “before” numbers above are from the mis-tuned case 😉

Off topic…while writing this blog I was reminded of a really funny story I read on Eric Sandeen’s blog about open source ponies. Ha!

Generating arbitrary network packets using the pktgen kernel module

I am staring at a workload that is zillions upon zillions of very tiny packets, and each one is important. They've got to get there fast. As fast as possible. Nagle: you are not welcome here. I am seeing some seemingly random jitter, and it's only on this one system. <confused>

I need to take apart this stack piece by piece, and test each in isolation. Let's start at the lowest level possible. RHEL6 includes a kernel module called pktgen (modprobe pktgen). This module allows you to create network packets, specify it's attributes and send them at the fastest possible rate with the least overhead.

Using pktgen, I was able to achieve over 3.3M packets per second on a 10GB Solarflare NIC. These packets do not have any protocol TCP/UDP packet processing overhead. You can watch the receivers netstat/IP counters, though.

Since these are synthetic packets, you have to give pktgen some basic information in order for the packets to be constructed with enough info to get there they're going. Things like destination IP/MAC, the number of packets and their size. I tested tiny packets, 64bytes (because that's what this workload needs). I also tested jumbo frames just to be sure I was doing it right.

This brings up a habit of mine worth mentioning; purposely mis-tuning your environment to validate your settings. A sound practice!
To get to 3.3Mpps, I only had to make one key change. Use a 10x factor for clone_skb. Anything less than 10 lead to fewer packets (a value of zero halved the pps throughput as compared to 10). Anything more than 10 had no performance benefit, so I'm sticking with 10 for now.

I wrote a little helper script (actually modified something I found online)

./pktgen.sh <NIC_NAME> <CPU_CORE_NUMBER>

# ./pktgen.sh p1p1 4
Running pktgen with config:
---------------------------
NIC=p1p1
CPU=4
COUNT=count 100000000
CLONE_SKB=clone_skb 10
PKT_SIZE=pkt_size 60
DELAY=delay 0
MAX_BEFORE_SOFTIRQ=10000

Running...CTRL+C to stop

^C
Params: count 100000000 min_pkt_size: 60 max_pkt_size: 60
frags: 0 delay: 0 clone_skb: 10 ifname: p1p1
flows: 0 flowlen: 0
queue_map_min: 0 queue_map_max: 0
dst_min: 172.17.1.53 dst_max:
src_min: src_max:
src_mac: 00:0f:53:0c:4b:ac dst_mac: 00:0f:53:0c:58:98
udp_src_min: 9 udp_src_max: 9 udp_dst_min: 9 udp_dst_max: 9
src_mac_count: 0 dst_mac_count: 0
Flags:
Current:
pkts-sofar: 17662390 errors: 0
started: 2222764017us stopped: 2228095026us idle: 40us
seq_num: 17662391 cur_dst_mac_offset: 0 cur_src_mac_offset: 0
cur_saddr: 0x330111ac cur_daddr: 0x350111ac
cur_udp_dst: 9 cur_udp_src: 9
cur_queue_map: 0
flows: 0
Result: OK: 5331009(c5330968+d40) nsec, 17662390 (60byte,0frags)
3313141pps 1590Mb/sec (1590307680bps) errors: 0

^^ ~3.3 million packets per second.

Without the protocol and higher layer processing, the number 3.3M has somewhat limited value. What it's testing is the kernel's TX path, the driver, NIC firmware and validating physical infrastructure. This is useful for i.e. regression testing of drivers, validating NIC firmware or tuning the TX path for whatever particular packet-profile your application will drive.

I want to be clear -- that micro-benchmarks like this have their place. But take care when designing benchmarks to ultimately include as much/all of your stack as possible in order to draw usable conclusions. I stumbled on a quote from Linus Torvalds on this topic that I really liked:

"please don't ever benchmark things that don't make sense, and then use the numbers as any kind of reason to do anything. It's worse than worthless. It actually adds negative value to show "look ma, no hands" for things that nobody does. It makes people think it's a good idea, and optimizes the wrong thing entirely.
Are there actual real loads that get improved? I don't care if it means that the improvement goes from three orders of magnitude to just a couple of percent. The "couple of percent on actual loads" is a lot more important than "many orders of magnitude on a made-up benchmark".

Truth.

The Open-Source Advantage…through the eyes of Red Hat, DreamWorks and OpenStack

Let’s say you’re a big company in a competitive industry. One who innovates and succeeds by creating software. Not extending COTS, not adapting existing code. Generating fresh, new code, at your full expense. The value the company receives by investing in the creation of that software is competitive advantage, sometimes known as the profit-motive.

You’re an executive at this company. Creating the software was your idea. You are responsible for the ROI calculations that got the whole thing off the ground.

Your career may rest on the advantage the invention provides, and you are boldly considering opening up the code you’ve created. But first you need to understand why so many of your peers (at company after company) have staked their careers behind open-sourcing what was once their company’s secret sauce.

Let’s look at some examples. Since I work for Red Hat, I’ll mention a few of our own first. But we’re a different animal, I’ve looked further into different verticals to find powerful examples.

Over the last 5 years, Red Hat has open sourced software we’ve both built and acquired, such as RHEV-M (oVirt), CloudForms (Katello) and OpenShift (Origin). Just as other companies derive value from our software, we also derive some value from individuals/companies extending software, generating powerful ecosystems of contributors both public and private. For example, http://studiogrizzly.com/ is extending the open-source upstream for Red Hat’s Platform-as-a-Service called OpenShift Origin.

Both Rackspace and NASA also identified open-source as a way to build better software faster. Their OpenStack project is a living, breathing example of how a project can be incubated within a closed ecosystem (closed in governance rather than code), grow beyond anyone’s imagination to scratch an itch that no one else can, and blossom into an incredible example of community driven innovation. As a proof-point, this year the governance of OpenStack transitioned to a Foundation model. I should mention that Red Hat maintains a seat on that Foundation board, contributes significant resources/code to the upstream project and productized an OpenStack enterprise distribution over the summer.

More recently, DreamWorks has decided to open-source an in-house tool they’ve developed called OpenVBD.

It’s well known that DreamWorks relies heavily on open-source software (Linux in particular). But to have them contribute directly using The Open Source Way, truly deserves a closer look. What could have led them to possibly eliminate any competitive advantage derived from OpenVBD ?

While I have no specific knowledge of DreamWorks’ particular situation, here are some ideas:

The industry moved on, and they’ve extracted most of the value already.
They are moving on to greener pastures, driving profit through new tools or techniques.
Maybe OpenVBD has taken on a life of it’s own, and although critical to business processes, it would benefit from additional developers. But they’d rather pay artists and authors.

If I had to guess, it would be closest to:

The costs/maintenance burden for OpenVBD exceeds the value derived. Set it free.

Now it’s the competitor’s move. Will they simply study OpenVBD and take whatever pieces they were missing or did poorly ? Will they jump into a standardization effort ?

The answer may lie in the competitor’s view on the first bullet above, and if they have a source of differentiation (aka revenue), outside of the purpose of OpenVBD. If they do, standardizing tools will benefit both parties by eliminating duplicate effort/code, or possibly reducing/eliminating maintenance burden on internal tools. And that will generate better software faster.

This speaks to a personal pet peeve of mine; duplicated effort. Again and again I see extremely similar open-source tools popping up. For argument’s sake, let’s say this means that the ecosystem of developers spent (# of projects * # man-hours) creating the software. If they’d collaborated, it may have resulted in a single more powerful tool with added feature velocity, and potentially multiplied any corporate-backed funding. That’s a powerful reason to attempt to build consensus before software. But I digress…

processor.max_cstate, intel_idle.max_cstate and /dev/cpu_dma_latency

Customers often want the lowest possible latency for their application. Whether it’s VOIP, weather modeling, financial trading etc. For a while now, CPUs have had the ability to transition frequencies (P-states) based on load. This is the called a CPU governor, and the user interface is the cpuspeed service.

For slightly less time, CPUs have been able to scale certain sections of themselves on or off…voltages up or down, to save power. This capability is known as C-states. The downside to the power savings that C-states provide is a decrease in performance, as well as non-deterministic performance outside of the application or operating system control.

Anyway, for years I have been seeing things like processor.max_cstate on the kernel cmdline. This got the customer much better performance at higher power draw, a business decision they were fine with. After some time went on, people began looking at their datacenters and thinking how power was getting so expensive, they’d like to find a way to consolidate. That’s code for virtualization. But what about workloads so far unsuitable for virtualization, like those I’ve mentioned…those that should continue to run on bare metal, and further, maybe even specialized hardware ?

A clear need for flexibility: sysadmins know that changes to the kernel cmdline require reboots. But the desire to enable the paradigm of absolute performance only-when-I-need-it demands that this be a run-time tunable. Enter the /dev/cpu_dma_latency “pmqos” interface. This interface lets you specific a target latency for the CPU, meaning you can use it to indirectly control (with precision), the C-state residency of your processors. For now, it’s an all-or-nothing affair, but stay tuned as there is work to increase the granularity of C-state residency control to per-core.

Now in 2011, Red Hatter Jan Vcelak wrote a handy script called pmqos-static.py that can enable this paradigm. No more dependence on kernel cmdline. On-demand, toggle C-states to your application’s desire. Control C-states from your application startup/shutdown scripts. Use cron to dial up the performance before business hours and dial it down after hours. Significant power savings can come from this simple script, when compared to using the cmdline.

A few notes, before the technical detail.

1) When you set processor.max_cstate=0, the kernel actually silently sets it to 1.

drivers/acpi/processor_idle.c:1086:
 1086 if (max_cstate == 0)
 1087 max_cstate = 1;

2) RHEL6 has had this interface forever, but only recently do we have the pmqos-static.py script.

3) This script provides “equivalent performance” to kernel cmdline options, with added flexibility.

So here’s what I mean…this turbostat output on RHEL6.3, Westmere X5650. (note same behavior on SNB E5-2690):

Test #1: processor.max_cstate=0
pk cr CPU %c0 GHz TSC %c1 %c3 %c6 %pc3 %pc6
 2.37 3.06 2.67 0.03 0.13 97.47 0.00 67.31

Test #2: processor.max_cstate=1
pk cr CPU %c0 GHz TSC %c1 %c3 %c6 %pc3 %pc6
 0.04 2.16 2.67 0.04 0.55 99.37 4.76 88.00

Test #3: processor.max_cstate=0 intel_idle.max_cstate=0
pk cr CPU %c0 GHz TSC %c1 %c3 %c6 %pc3 %pc6
 0.02 2.20 2.67 99.98 0.00 0.00 0.00 0.00

Test #4: processor.max_cstate=1 intel_idle.max_cstate=0
pk cr CPU %c0 GHz TSC %c1 %c3 %c6 %pc3 %pc6
 0.02 2.29 2.67 99.98 0.00 0.00 0.00 0.00

Test #5: intel_idle.max_cstate=0
pk cr CPU %c0 GHz TSC %c1 %c3 %c6 %pc3 %pc6
 0.02 2.19 2.67 99.98 0.00 0.00 0.00 0.00

# rpm -q tuned
tuned-0.2.19-9.el6.noarch

Test #6: now with /dev/cpu_dma_latency set to 0 (via latency-performance  profile) and intel_idle.max_cstate=0.

The cmdline overrides /dev/cpu_dma_latency.

# tuned-adm profile latency-performance
pk cr CPU %c0 GHz TSC %c1 %c3 %c6 %pc3 %pc6
 0.01 2.32 2.67 99.99 0.00 0.00 0.00 0.00

Test #7: no cmdline options + /dev/cpu_dma_latency via
latency-performance profile.
# tuned-adm profile latency-performance
pk cr CPU %c0 GHz TSC %c1 %c3 %c6 %pc3 %pc6
 100.00 2.93 2.67 0.00 0.00 0.00 0.00 0.00

There is additional flexibility, too…let me illustrate:

# find /sys/devices/system/cpu/cpu0/cpuidle -name latency -o -name name | xargs cat
C0
0
NHM-C1
3
NHM-C3
20
NHM-C6
200

This shows you the exit latency (in microseconds) for various C-states on this particular Westmere (aka Nehalem/NHM). Each time the CPU transitions in between C-states, you get a latency hit of almost exactly those number of microseconds (which I can see in benchmarks). By default, an idle Westmere core sits in C6 (SNB sits in C7). To get that core up to C0, it takes 200us.

Here’s what I meant about flexibility. You can control exactly what C-state you want your CPUs in via /dev/cpu_dma_latency and via the pmqos-static.py script. And all dynamically during runtime. cmdline options do not allow for this level of control, as I showed they override /dev/cpu_dma_latency. Exhaustive detail about what to expect from each C-state can be found in Intel’s Architecture documentation. Around page 35-52 or so…

Using the information I fixed out of /sys above…Set it to 200 and you’re in the deepest c-state:

# /usr/libexec/tuned/pmqos-static.py cpu_dma_latency=200
pk cr CPU %c0 GHz TSC %c1 %c3 %c6 %pc3 %pc6
 0.04 2.19 2.67 0.04 0.26 99.66 0.91 91.91

Set it to anything in between 20 and 199, and you get into C3:

# /usr/libexec/tuned/pmqos-static.py cpu_dma_latency=199
pk cr CPU %c0 GHz TSC %c1 %c3 %c6 %pc3 %pc6
 0.03 2.28 2.67 0.03 99.94 0.00 89.65 0.00

Set it to anything in between 1 and 19, and you get into C1:

# /usr/libexec/tuned/pmqos-static.py cpu_dma_latency=19

pk cr CPU %c0 GHz TSC %c1 %c3 %c6 %pc3 %pc6
 0.02 2.18 2.67 99.98 0.00 0.00 0.00 0.00

Set it to 0 and you get into C0. This is what latency-performance
profile does.

# /usr/libexec/tuned/pmqos-static.py cpu_dma_latency=0
pk cr CPU %c0 GHz TSC %c1 %c3 %c6 %pc3 %pc6
 100.00 2.93 2.67 0.00 0.00 0.00 0.00 0.00

Sandy Bridge chips also have C7, but the same rules apply.

You have to decide whether this flexibility buys you anything in order to justify rolling any changes across your environment.

Maybe just the understanding of this is how and why it works might be enough! 🙂

Moving away from useless kernel cmdline options is one less thing for you to maintain, although I realize you still have to enable tuned profile. So, dial up the performance when you need it, and save power/money when you don’t! Pretty cool!