Generating arbitrary network packets using the pktgen kernel module

I am staring at a workload that is zillions upon zillions of very tiny packets, and each one is important.  They've got to get there fast.  As fast as possible.  Nagle:  you are not welcome here.  I am seeing some seemingly random jitter, and it's only on this one system.  <confused>

I need to take apart this stack piece by piece, and test each in isolation.   Let's start at the lowest level possible.  RHEL6 includes a kernel module called pktgen (modprobe pktgen).  This module allows you to create network packets, specify it's attributes and send them at the fastest possible rate with the least overhead.

Using pktgen, I was able to achieve over 3.3M packets per second on a 10GB Solarflare NIC.  These packets do not have any protocol TCP/UDP packet processing overhead.  You can watch the receivers netstat/IP counters, though.

Since these are synthetic packets, you have to give pktgen some basic information in order for the packets to be constructed with enough info to get there they're going.  Things like destination IP/MAC, the number of packets  and their size.  I tested tiny packets, 64bytes (because that's what this workload needs).  I also tested jumbo frames just to be sure I was doing it right.

This brings up a habit of mine worth mentioning; purposely mis-tuning your environment to validate your settings.  A sound practice!
To get to 3.3Mpps, I only had to make one key change.  Use a 10x factor for  clone_skb.  Anything less than 10 lead to fewer packets (a value of zero halved the pps throughput as compared to 10).  Anything more than 10 had no performance benefit, so I'm sticking with 10 for now.

I wrote a little helper script (actually modified something I found online)

./pktgen.sh <NIC_NAME> <CPU_CORE_NUMBER>

# ./pktgen.sh p1p1 4
Running pktgen with config:
---------------------------
NIC=p1p1
CPU=4
COUNT=count 100000000
CLONE_SKB=clone_skb 10
PKT_SIZE=pkt_size 60
DELAY=delay 0
MAX_BEFORE_SOFTIRQ=10000

Running...CTRL+C to stop

^C
Params: count 100000000  min_pkt_size: 60  max_pkt_size: 60
     frags: 0  delay: 0  clone_skb: 10  ifname: p1p1
     flows: 0 flowlen: 0
     queue_map_min: 0  queue_map_max: 0
     dst_min: 172.17.1.53  dst_max:
        src_min:   src_max:
     src_mac: 00:0f:53:0c:4b:ac dst_mac: 00:0f:53:0c:58:98
     udp_src_min: 9  udp_src_max: 9  udp_dst_min: 9  udp_dst_max: 9
     src_mac_count: 0  dst_mac_count: 0
     Flags:
Current:
     pkts-sofar: 17662390  errors: 0
     started: 2222764017us  stopped: 2228095026us idle: 40us
     seq_num: 17662391  cur_dst_mac_offset: 0  cur_src_mac_offset: 0
     cur_saddr: 0x330111ac  cur_daddr: 0x350111ac
     cur_udp_dst: 9  cur_udp_src: 9
     cur_queue_map: 0
     flows: 0
Result: OK: 5331009(c5330968+d40) nsec, 17662390 (60byte,0frags)
  3313141pps 1590Mb/sec (1590307680bps) errors: 0

^^ ~3.3 million packets per second.

Without the protocol and higher layer processing, the number 3.3M has somewhat limited value.  What it's testing is the kernel's TX path, the driver, NIC firmware and validating physical infrastructure.  This is useful for i.e. regression testing of drivers, validating NIC firmware or tuning the TX path for whatever particular packet-profile your application will drive.

I want to be clear -- that micro-benchmarks like this have their place.  But take care when designing benchmarks to ultimately include as much/all of your stack as possible in order to draw usable conclusions.  I stumbled on a quote from Linus Torvalds on this topic that I really liked:

"please don't ever benchmark things that don't make sense, and then use the numbers as any kind of reason to do anything. It's worse than worthless. It actually adds negative value to show "look ma, no hands" for things that nobody does. It makes people think it's a good idea, and optimizes the wrong thing entirely.
Are there actual real loads that get improved? I don't care if it means that the improvement goes from three orders of magnitude to just a couple of percent. The "couple of percent on actual loads" is a lot more important than "many orders of magnitude on a made-up benchmark".

Truth.

The Open-Source Advantage…through the eyes of Red Hat, DreamWorks and OpenStack

Let’s say you’re a big company in a competitive industry.  One who innovates and succeeds by creating software.  Not extending COTS, not adapting existing code.  Generating fresh, new code, at your full expense.  The value the company receives by investing in the creation of that software is competitive advantage, sometimes known as the profit-motive.

You’re an executive at this company.  Creating the software was your idea.  You are responsible for the ROI calculations that got the whole thing off the ground.

Your career may rest on the advantage the invention provides, and you are boldly considering opening up the code you’ve created.  But first you need to understand why so many of your peers (at company after company) have staked their careers behind open-sourcing what was once their company’s secret sauce.

Let’s look at some examples.  Since I work for Red Hat, I’ll mention a few of our own first.  But we’re a different animal, I’ve looked further into different  verticals to find powerful examples.

Over the last 5 years, Red Hat has open sourced software we’ve both built and acquired, such as RHEV-M (oVirt), CloudForms (Katello) and OpenShift (Origin).  Just as other companies derive value from our software, we also derive some value from individuals/companies extending software, generating powerful ecosystems of contributors both public and private.  For example, http://studiogrizzly.com/ is extending the open-source upstream for Red Hat’s Platform-as-a-Service called OpenShift Origin.

Both Rackspace and NASA also identified open-source as a way to build better software faster.  Their OpenStack project is a living, breathing example of how a project can be incubated within a closed ecosystem (closed in governance rather than code), grow beyond anyone’s imagination to scratch an itch that no one else can, and blossom into an incredible example of community driven innovation.  As a proof-point, this year the governance of OpenStack transitioned to a Foundation model.  I should mention that Red Hat maintains a seat on that Foundation board, contributes significant resources/code to the upstream project and productized an OpenStack enterprise distribution over the summer.

More recently, DreamWorks has decided to open-source an in-house tool they’ve developed called OpenVBD.

It’s well known that DreamWorks relies heavily on open-source software (Linux in particular).  But to have them contribute directly using The Open Source Way, truly deserves a closer look.  What could have led them to possibly eliminate any competitive advantage derived from OpenVBD ?

While I have no specific knowledge of DreamWorks’ particular situation, here are some ideas:

  • The industry moved on, and they’ve extracted most of the value already.
  • They are moving on to greener pastures, driving profit through new tools or techniques.
  • Maybe OpenVBD has taken on a life of it’s own, and although critical to business processes, it would benefit from additional developers.  But they’d rather pay artists and authors.

If I had to guess, it would be closest to:

  • The costs/maintenance burden for OpenVBD exceeds the value derived.  Set it free.
Now it’s the competitor’s move.  Will they simply study OpenVBD and take whatever pieces they were missing or did poorly ?  Will they jump into a standardization effort ?
The answer may lie in the competitor’s view on the first bullet above, and if they have a source of differentiation (aka revenue), outside of the purpose of OpenVBD.  If they do, standardizing tools will benefit both parties by eliminating duplicate effort/code, or possibly reducing/eliminating maintenance burden on internal tools.  And that will generate better software faster.
This speaks to a personal pet peeve of mine; duplicated effort.  Again and again I see extremely similar open-source tools popping up.  For argument’s sake, let’s say this means that the ecosystem of developers spent (# of projects * # man-hours) creating the software.  If they’d collaborated, it may have resulted in a single more powerful tool with added feature velocity, and potentially multiplied any corporate-backed funding.  That’s a powerful reason to attempt to build consensus before software.  But I digress…

 

 

processor.max_cstate, intel_idle.max_cstate and /dev/cpu_dma_latency

Customers often want the lowest possible latency for their application.  Whether it’s VOIP, weather modeling, financial trading etc.  For a while now, CPUs have had the ability to transition frequencies (P-states) based on load.  This is the called a CPU governor, and the user interface is the cpuspeed service.

For slightly less time, CPUs have been able to scale certain sections of themselves on or off…voltages up or down, to save power.  This capability is known as C-states.  The downside to the power savings that C-states provide is a decrease in performance, as well as non-deterministic performance outside of the application or operating system control.

Anyway, for years I have been seeing things like processor.max_cstate on the kernel cmdline.  This got the customer much better performance at higher power draw, a business decision they were fine with. After some time went on, people began looking at their datacenters and thinking how power was getting so expensive, they’d like to find a way to consolidate.  That’s code for virtualization.  But what about workloads so far unsuitable for virtualization, like those I’ve mentioned…those that should continue to run on bare metal, and further, maybe even specialized hardware ?

A clear need for flexibility:  sysadmins know that changes to the kernel cmdline require reboots.  But the desire to enable the paradigm of absolute performance only-when-I-need-it demands that this be a run-time tunable.  Enter the /dev/cpu_dma_latency “pmqos” interface.  This interface lets you specific a target latency for the CPU, meaning you can use it to indirectly control (with precision), the C-state residency of your processors.  For now, it’s an all-or-nothing affair, but stay tuned as there is work to increase the granularity of C-state residency control to per-core.

Now in 2011, Red Hatter Jan Vcelak wrote a handy script called pmqos-static.py that can enable this paradigm.  No more dependence on kernel cmdline.  On-demand, toggle C-states to your application’s desire.  Control C-states from your application startup/shutdown scripts.  Use cron to dial up the performance before business hours and dial it down after hours.  Significant power savings can come from this simple script, when compared to using the cmdline.

A few notes, before the technical detail.

1) When you set processor.max_cstate=0, the kernel actually silently sets it to 1.

drivers/acpi/processor_idle.c:1086:
 1086 if (max_cstate == 0)
 1087 max_cstate = 1;

2) RHEL6 has had this interface forever, but only recently do we have the pmqos-static.py script.

3) This script provides “equivalent performance” to kernel cmdline options, with added flexibility.

So here’s what I mean…this turbostat output on RHEL6.3, Westmere X5650. (note same behavior on SNB E5-2690):

Test #1: processor.max_cstate=0
pk cr CPU %c0 GHz TSC %c1 %c3 %c6 %pc3 %pc6
 2.37 3.06 2.67 0.03 0.13 97.47 0.00 67.31
Test #2: processor.max_cstate=1
pk cr CPU %c0 GHz TSC %c1 %c3 %c6 %pc3 %pc6
 0.04 2.16 2.67 0.04 0.55 99.37 4.76 88.00
Test #3: processor.max_cstate=0 intel_idle.max_cstate=0
pk cr CPU %c0 GHz TSC %c1 %c3 %c6 %pc3 %pc6
 0.02 2.20 2.67 99.98 0.00 0.00 0.00 0.00
Test #4: processor.max_cstate=1 intel_idle.max_cstate=0
pk cr CPU %c0 GHz TSC %c1 %c3 %c6 %pc3 %pc6
 0.02 2.29 2.67 99.98 0.00 0.00 0.00 0.00
Test #5: intel_idle.max_cstate=0
pk cr CPU %c0 GHz TSC %c1 %c3 %c6 %pc3 %pc6
 0.02 2.19 2.67 99.98 0.00 0.00 0.00 0.00
# rpm -q tuned
tuned-0.2.19-9.el6.noarch
Test #6: now with /dev/cpu_dma_latency set to 0 (via latency-performance  profile) and intel_idle.max_cstate=0.
The cmdline overrides /dev/cpu_dma_latency.
# tuned-adm profile latency-performance
pk cr CPU %c0 GHz TSC %c1 %c3 %c6 %pc3 %pc6
 0.01 2.32 2.67 99.99 0.00 0.00 0.00 0.00
Test #7: no cmdline options + /dev/cpu_dma_latency via
latency-performance profile.
# tuned-adm profile latency-performance
pk cr CPU %c0 GHz TSC %c1 %c3 %c6 %pc3 %pc6
 100.00 2.93 2.67 0.00 0.00 0.00 0.00 0.00

There is additional flexibility, too…let me illustrate:

# find /sys/devices/system/cpu/cpu0/cpuidle -name latency -o -name name | xargs cat
C0
0
NHM-C1
3
NHM-C3
20
NHM-C6
200

This shows you the exit latency (in microseconds) for various C-states on this particular Westmere (aka Nehalem/NHM). Each time the CPU transitions in between C-states, you get a latency hit of almost exactly those number of microseconds (which I can see in benchmarks). By default, an idle Westmere core sits in C6 (SNB sits in C7). To get that core up to C0, it takes 200us.

Here’s what I meant about flexibility. You can control exactly what C-state you want your CPUs in via /dev/cpu_dma_latency and via the pmqos-static.py script.  And all dynamically during runtime. cmdline options do not allow for this level of control, as I showed they override /dev/cpu_dma_latency.  Exhaustive detail about what to expect from each C-state can be found in Intel’s Architecture documentation.  Around page 35-52 or so…

Using the information I fixed out of /sys above…Set it to 200 and you’re in the deepest c-state:

# /usr/libexec/tuned/pmqos-static.py cpu_dma_latency=200
pk cr CPU %c0 GHz TSC %c1 %c3 %c6 %pc3 %pc6
 0.04 2.19 2.67 0.04 0.26 99.66 0.91 91.91

Set it to anything in between 20 and 199, and you get into C3:

# /usr/libexec/tuned/pmqos-static.py cpu_dma_latency=199
pk cr CPU %c0 GHz TSC %c1 %c3 %c6 %pc3 %pc6
 0.03 2.28 2.67 0.03 99.94 0.00 89.65 0.00

Set it to anything in between 1 and 19, and you get into C1:

# /usr/libexec/tuned/pmqos-static.py cpu_dma_latency=19
pk cr CPU %c0 GHz TSC %c1 %c3 %c6 %pc3 %pc6
 0.02 2.18 2.67 99.98 0.00 0.00 0.00 0.00

Set it to 0 and you get into C0. This is what latency-performance
profile does.

# /usr/libexec/tuned/pmqos-static.py cpu_dma_latency=0
pk cr CPU %c0 GHz TSC %c1 %c3 %c6 %pc3 %pc6
 100.00 2.93 2.67 0.00 0.00 0.00 0.00 0.00

Sandy Bridge chips also have C7, but the same rules apply.

You have to decide whether this flexibility buys you anything in order to justify rolling any changes across your environment.

Maybe just the understanding of this is how and why it works might be enough! 🙂

Moving away from useless kernel cmdline options is one less thing for you to maintain, although I realize you still have to enable tuned profile.  So, dial up the performance when you need it, and save power/money when you don’t!  Pretty cool!

How to deal with latency introduced by disk I/O activity…

The techniques used by Linux to get dirty pages onto persistent media have changed over the years.  Most recently the change was from a gang of threads called (pdflush) to a per-backing-device thread model.  Basically one-thread-per-LUN/mount.  While neither is perfect, the fact of the matter is that you shouldn’t really have to care about disk interference with your latency-sensitive app.  Right now, we can’t cleanly apply the normal affinity tuning to the flush-* bdi kthreads and thus cannot effectively shield the latency-sensitive app entirely.

I’m going to stop short of handing out specific tuning advice, because I have no idea what your I/O pattern looks like, and that matters.  A lot.  Suffice it to say that (just like your latency-sensitive application), you’d prefer more frequent smaller transfers/writes over less frequent, larger transfers (which are optimized for throughput).
Going a step further, you often hear about using tools like numactl or taskset to affinitize your application to certain cores, and chrt/nice to control the task’s priority and policy related to the scheduler.  These flusher threads are not easy to deal with.  We can’t apply the normal tuning using any of the above tools, because the flush-* threads are kernel threads, created using the kthread infrastructure.  bdi flush threads take a fundamentally different approach than other
kthreads, which are instantiated on boot (like migration), or module insertion time (like nfs). There’s no way to set a “default affinity mask” on kthreads, and kthreads are not subject to isolcpus.

Even up to the current upstream kernel version, the flush-* threads are started on-demand, (like when you mount a new filesystem), and then they go away after some idle time. When they come back, they have a new pid.  That behavior doesn’t mesh well with affinity tuning.

For example, in the case of nfsd kthreads, since they do not come and go after they are first instantiated, you can apply typical affinity tuning and get measurable performance gains.

For now:

  • Write as little as possible.  And/or write to (shared)memory, then flush it later.
    • Take good care of your needed data, though!  Memory contents go bye-bye in a “power event”.
  • Get faster storage like a PCI-based RAM/SSD disk
  • Reduce the amount of dirty pages kept in cache.
  • Increase the frequency at which dirty pages are flushed, so that there is less written each time.
Further reading:
http://lwn.net/Articles/352144/ (subscription required)
http://lwn.net/Articles/324833/ (subscription required)

Tracking userspace memory allocation with glibc-utils memusage

Will Cohen turned me on to a little helper tool called memusage, which is distributed with glibc.  The purpose of that tool is to trace memory allocation behavior of a process.

In RHEL, the memusage binary is part of the glibc-utils package.  There’s actually also a shared library called /usr/lib64/libmemusage.so that’s part of the base glibc package, which can be used via LD_PRELOAD.

memusage writes output to your terminal, as below:

It is also capable of writing memory allocation over time to a png file, for example:

Netperf is not a particularly memory-intensive benchmark for illustrating it’s usage, just wanted to describe the utility.  I’ll upload more interesting graphs when I run more loads with the library.

Thoughts on Open vSwitch, kernel bypass, and 400gbps Ethernet…

For the Red Hat Summit this year, I wrote a paper on the kernel-bypass technology from Solarflare, called OpenOnload.  From a performance standpoint it’s hard to argue with the results.

I was looking at code from Open vSwitch recently, and it dawned on me that there is an important similarity between Open vSwitch and OpenOnload; a similar 2-phase approach…let me explain.

Both have a “connection setup” operation where many of the well-known user-space utilities come into play (and some purpose-build like ovs-vsctl)…things like adjusting routing, MTU, interface statistics etc…And then what you could call an accelerated path, that’s used after the initial connection setup for passing bits to/from user-space, whether that be a KVM process or your matching engine.

In OpenOnload’s case, the accelerated path bypasses the linux kernel, avoiding kernel-space-user-space data copies (aka context switches) and thus lowering latency.  This technique is also called RDMA, has been around for decades, and there are quite a few vendors out there with analogous solutions.  Often there are optimized drivers, things like OFED and a whole bunch of other tricks, but that’s beside my point…

The price paid for achieving this lower latency is having to completely give up, or entirely re-implement lots of kernel goodies like what you’d expect out of netstat, ethtool and tcpdump.

In the case of Open vSwitch, there is a software “controller” (which decides what to do with a packet) and a data-path implemented in a kernel module that provides the best performance possible once the user-defined policy has been applied via the controller.  If you’re interested in Open vSwitch internals, here’s a nice presentation from Simon Horms.  I think the video is definitely worth a half hour!

Anyway, what do accelerated paths and kernel-bypass boil down to ?  Things like swap-over-NFS, NFS-root, proliferation of iSCSI/NFS filers and FUSE-based projects like Gluster, put network subsystem performance directly in the cross-hairs.  Most importantly, demands on the networking subsystem on all operating systems are pushing the performance boundaries of what the traditional protection ring concept can provide.

Developers go to great lengths to take advantage of the ring model, however it seems faster network throughput (btw is 400gbps ethernet the next step?) and lower latency requirements are recently more at odds than ever with the ring paradigm.

Linux and BSD’s decades-old niche of being excellent routing platforms will be tested (as it always is) by these future technologies and customer demand for them.  Looking forward to seeing how projects like OpenStack wire all of this stuff together!

Low Latency Performance Tuning Guide for Red Hat Enterprise Linux 6

Last month I wrote a paper for Red Hat customers called Low Latency Performance Tuning Guide for Red Hat Enterprise Linux 6 or LLPTGFRHEL6 for short 😉

It’s the product of significant research and my hands-on experiments into what configurations provide tangible benefit for latency-sensitive environments.  Although the traditional audience for this paper is the financial services industry, I have found that there are all sorts of latency-sensitive workloads out there.  From oil and gas to healthcare to the public sector and cloud, everyone wants the best performance out of their shiny new kit.

This paper started out as a formal response to many similar questions I was receiving from the field.  Initially a 1-2 page effort, within a day it had blown up to 14 pages of stuff from my mountain of notes.  Talk about boiling the ocean…although I was happy with the content, the formatting left a little to be desired so I pared it back to about 7 pages and linked out to other in-depth guides where it made sense…

I’m mostly happy with how it turned out…I know that customers were looking for this type of data (because they asked me over and over) and so I set out to conduct numerous experiments filling out each bullet point with hard data and zero hand-waving.  I wanted to explicitly corroborate or dispel certain myths that are floating around out there about performance impact of various knobs, so I tested each in isolation and reported my recommendations.

I do hope that this paper helps to guide administrators in their quest to realize ROI from both their hardware and software investments, please have a look and let me know what you think!

P.S.  are there any other performance domains, workloads, use-cases or environments that you’d like us to look at?  Someone mentioned high-bandwidth-high-latency (long-fat-pipe) experiments…would that be of interest?