Generating arbitrary network packets using the pktgen kernel module

I am staring at a workload that is zillions upon zillions of very tiny packets, and each one is important.  They've got to get there fast.  As fast as possible.  Nagle:  you are not welcome here.  I am seeing some seemingly random jitter, and it's only on this one system.  <confused>

I need to take apart this stack piece by piece, and test each in isolation.   Let's start at the lowest level possible.  RHEL6 includes a kernel module called pktgen (modprobe pktgen).  This module allows you to create network packets, specify it's attributes and send them at the fastest possible rate with the least overhead.

Using pktgen, I was able to achieve over 3.3M packets per second on a 10GB Solarflare NIC.  These packets do not have any protocol TCP/UDP packet processing overhead.  You can watch the receivers netstat/IP counters, though.

Since these are synthetic packets, you have to give pktgen some basic information in order for the packets to be constructed with enough info to get there they're going.  Things like destination IP/MAC, the number of packets  and their size.  I tested tiny packets, 64bytes (because that's what this workload needs).  I also tested jumbo frames just to be sure I was doing it right.

This brings up a habit of mine worth mentioning; purposely mis-tuning your environment to validate your settings.  A sound practice!
To get to 3.3Mpps, I only had to make one key change.  Use a 10x factor for  clone_skb.  Anything less than 10 lead to fewer packets (a value of zero halved the pps throughput as compared to 10).  Anything more than 10 had no performance benefit, so I'm sticking with 10 for now.

I wrote a little helper script (actually modified something I found online)

./pktgen.sh <NIC_NAME> <CPU_CORE_NUMBER>

# ./pktgen.sh p1p1 4
Running pktgen with config:
---------------------------
NIC=p1p1
CPU=4
COUNT=count 100000000
CLONE_SKB=clone_skb 10
PKT_SIZE=pkt_size 60
DELAY=delay 0
MAX_BEFORE_SOFTIRQ=10000

Running...CTRL+C to stop

^C
Params: count 100000000  min_pkt_size: 60  max_pkt_size: 60
     frags: 0  delay: 0  clone_skb: 10  ifname: p1p1
     flows: 0 flowlen: 0
     queue_map_min: 0  queue_map_max: 0
     dst_min: 172.17.1.53  dst_max:
        src_min:   src_max:
     src_mac: 00:0f:53:0c:4b:ac dst_mac: 00:0f:53:0c:58:98
     udp_src_min: 9  udp_src_max: 9  udp_dst_min: 9  udp_dst_max: 9
     src_mac_count: 0  dst_mac_count: 0
     Flags:
Current:
     pkts-sofar: 17662390  errors: 0
     started: 2222764017us  stopped: 2228095026us idle: 40us
     seq_num: 17662391  cur_dst_mac_offset: 0  cur_src_mac_offset: 0
     cur_saddr: 0x330111ac  cur_daddr: 0x350111ac
     cur_udp_dst: 9  cur_udp_src: 9
     cur_queue_map: 0
     flows: 0
Result: OK: 5331009(c5330968+d40) nsec, 17662390 (60byte,0frags)
  3313141pps 1590Mb/sec (1590307680bps) errors: 0

^^ ~3.3 million packets per second.

Without the protocol and higher layer processing, the number 3.3M has somewhat limited value.  What it's testing is the kernel's TX path, the driver, NIC firmware and validating physical infrastructure.  This is useful for i.e. regression testing of drivers, validating NIC firmware or tuning the TX path for whatever particular packet-profile your application will drive.

I want to be clear -- that micro-benchmarks like this have their place.  But take care when designing benchmarks to ultimately include as much/all of your stack as possible in order to draw usable conclusions.  I stumbled on a quote from Linus Torvalds on this topic that I really liked:

"please don't ever benchmark things that don't make sense, and then use the numbers as any kind of reason to do anything. It's worse than worthless. It actually adds negative value to show "look ma, no hands" for things that nobody does. It makes people think it's a good idea, and optimizes the wrong thing entirely.
Are there actual real loads that get improved? I don't care if it means that the improvement goes from three orders of magnitude to just a couple of percent. The "couple of percent on actual loads" is a lot more important than "many orders of magnitude on a made-up benchmark".

Truth.