How to deal with latency introduced by disk I/O activity…

The techniques used by Linux to get dirty pages onto persistent media have changed over the years.  Most recently the change was from a gang of threads called (pdflush) to a per-backing-device thread model.  Basically one-thread-per-LUN/mount.  While neither is perfect, the fact of the matter is that you shouldn’t really have to care about disk interference with your latency-sensitive app.  Right now, we can’t cleanly apply the normal affinity tuning to the flush-* bdi kthreads and thus cannot effectively shield the latency-sensitive app entirely.

I’m going to stop short of handing out specific tuning advice, because I have no idea what your I/O pattern looks like, and that matters.  A lot.  Suffice it to say that (just like your latency-sensitive application), you’d prefer more frequent smaller transfers/writes over less frequent, larger transfers (which are optimized for throughput).
Going a step further, you often hear about using tools like numactl or taskset to affinitize your application to certain cores, and chrt/nice to control the task’s priority and policy related to the scheduler.  These flusher threads are not easy to deal with.  We can’t apply the normal tuning using any of the above tools, because the flush-* threads are kernel threads, created using the kthread infrastructure.  bdi flush threads take a fundamentally different approach than other
kthreads, which are instantiated on boot (like migration), or module insertion time (like nfs). There’s no way to set a “default affinity mask” on kthreads, and kthreads are not subject to isolcpus.

Even up to the current upstream kernel version, the flush-* threads are started on-demand, (like when you mount a new filesystem), and then they go away after some idle time. When they come back, they have a new pid.  That behavior doesn’t mesh well with affinity tuning.

For example, in the case of nfsd kthreads, since they do not come and go after they are first instantiated, you can apply typical affinity tuning and get measurable performance gains.

For now:

  • Write as little as possible.  And/or write to (shared)memory, then flush it later.
    • Take good care of your needed data, though!  Memory contents go bye-bye in a “power event”.
  • Get faster storage like a PCI-based RAM/SSD disk
  • Reduce the amount of dirty pages kept in cache.
  • Increase the frequency at which dirty pages are flushed, so that there is less written each time.
Further reading: (subscription required) (subscription required)

Tracking userspace memory allocation with glibc-utils memusage

Will Cohen turned me on to a little helper tool called memusage, which is distributed with glibc.  The purpose of that tool is to trace memory allocation behavior of a process.

In RHEL, the memusage binary is part of the glibc-utils package.  There’s actually also a shared library called /usr/lib64/ that’s part of the base glibc package, which can be used via LD_PRELOAD.

memusage writes output to your terminal, as below:

It is also capable of writing memory allocation over time to a png file, for example:

Netperf is not a particularly memory-intensive benchmark for illustrating it’s usage, just wanted to describe the utility.  I’ll upload more interesting graphs when I run more loads with the library.

Thoughts on Open vSwitch, kernel bypass, and 400gbps Ethernet…

For the Red Hat Summit this year, I wrote a paper on the kernel-bypass technology from Solarflare, called OpenOnload.  From a performance standpoint it’s hard to argue with the results.

I was looking at code from Open vSwitch recently, and it dawned on me that there is an important similarity between Open vSwitch and OpenOnload; a similar 2-phase approach…let me explain.

Both have a “connection setup” operation where many of the well-known user-space utilities come into play (and some purpose-build like ovs-vsctl)…things like adjusting routing, MTU, interface statistics etc…And then what you could call an accelerated path, that’s used after the initial connection setup for passing bits to/from user-space, whether that be a KVM process or your matching engine.

In OpenOnload’s case, the accelerated path bypasses the linux kernel, avoiding kernel-space-user-space data copies (aka context switches) and thus lowering latency.  This technique is also called RDMA, has been around for decades, and there are quite a few vendors out there with analogous solutions.  Often there are optimized drivers, things like OFED and a whole bunch of other tricks, but that’s beside my point…

The price paid for achieving this lower latency is having to completely give up, or entirely re-implement lots of kernel goodies like what you’d expect out of netstat, ethtool and tcpdump.

In the case of Open vSwitch, there is a software “controller” (which decides what to do with a packet) and a data-path implemented in a kernel module that provides the best performance possible once the user-defined policy has been applied via the controller.  If you’re interested in Open vSwitch internals, here’s a nice presentation from Simon Horms.  I think the video is definitely worth a half hour!

Anyway, what do accelerated paths and kernel-bypass boil down to ?  Things like swap-over-NFS, NFS-root, proliferation of iSCSI/NFS filers and FUSE-based projects like Gluster, put network subsystem performance directly in the cross-hairs.  Most importantly, demands on the networking subsystem on all operating systems are pushing the performance boundaries of what the traditional protection ring concept can provide.

Developers go to great lengths to take advantage of the ring model, however it seems faster network throughput (btw is 400gbps ethernet the next step?) and lower latency requirements are recently more at odds than ever with the ring paradigm.

Linux and BSD’s decades-old niche of being excellent routing platforms will be tested (as it always is) by these future technologies and customer demand for them.  Looking forward to seeing how projects like OpenStack wire all of this stuff together!

Low Latency Performance Tuning Guide for Red Hat Enterprise Linux 6

Last month I wrote a paper for Red Hat customers called Low Latency Performance Tuning Guide for Red Hat Enterprise Linux 6 or LLPTGFRHEL6 for short 😉

It’s the product of significant research and my hands-on experiments into what configurations provide tangible benefit for latency-sensitive environments.  Although the traditional audience for this paper is the financial services industry, I have found that there are all sorts of latency-sensitive workloads out there.  From oil and gas to healthcare to the public sector and cloud, everyone wants the best performance out of their shiny new kit.

This paper started out as a formal response to many similar questions I was receiving from the field.  Initially a 1-2 page effort, within a day it had blown up to 14 pages of stuff from my mountain of notes.  Talk about boiling the ocean…although I was happy with the content, the formatting left a little to be desired so I pared it back to about 7 pages and linked out to other in-depth guides where it made sense…

I’m mostly happy with how it turned out…I know that customers were looking for this type of data (because they asked me over and over) and so I set out to conduct numerous experiments filling out each bullet point with hard data and zero hand-waving.  I wanted to explicitly corroborate or dispel certain myths that are floating around out there about performance impact of various knobs, so I tested each in isolation and reported my recommendations.

I do hope that this paper helps to guide administrators in their quest to realize ROI from both their hardware and software investments, please have a look and let me know what you think!

P.S.  are there any other performance domains, workloads, use-cases or environments that you’d like us to look at?  Someone mentioned high-bandwidth-high-latency (long-fat-pipe) experiments…would that be of interest?