What is all that %sys time ? “I never know what she’s _doing_ in there…” Ha!
12:01:35 PM CPU %usr %nice %sys %iowait %irq %soft %idle
12:01:36 PM all 0.08 0.00 3.33 0.00 0.00 5.00 91.59
12:01:36 PM 0 0.00 0.00 40.59 0.00 0.00 59.41 0.00
...
You can instantly find out with ‘perf top’. In this case (netperf), the kernel is spending time copying skb’s around, mediating between kernel and userspace. I wrote a bit about this in a previous blog post; the traditional protection ring.
All that copying takes time…precious, precious time. And CPU cycles; also precious. And memory bandwidth…etc.
HPC customers have, for decades, been leveraging Remote Direct Memory Access (RDMA) technology to reduce latency and associated CPU time. They use InfiniBand fabrics and associated InfiniBand verbs programming to extract every last bit of performance out of their hardware.
As always, that last few percent performance ends up being the most expensive. Both in terms of hardware and software, as well as the people-talent and their effort. But it’s also sometimes the most lucrative.
Over the last few years, some in-roads have been made in lowering the bar to entry into RDMA implementation, with one of those being RoCE (RDMA Over Converged Ethernet). My employer Red Hat ships RoCE libraries (for Mellanox cards) in the “High Performance Networking” channel.
I’ve recently been working on characterizing RoCE in the context of it’s usefulness in various benchmarks and customer loads, so to that end I went into the lab and wired up a pair of Mellanox ConnectX-3 VPI cards back-to-back with a 56Gbit IB cable. The cards are inside Sandy Bridge generation servers.
Provided some basic understanding of the hideous vernacular in this area, it turns out to be shockingly easy to setup RoCE. Here’s some recommended reading to get you started:
- http://blog.infinibandta.org/2012/02/
- http://people.redhat.com/dledford/infiniband_get_started.html
- http://www.redbooks.ibm.com/abstracts/tips0897.html?Open#contents
First thing, make sure your server is subscribed to the HPN channel on RHN. Then let’s get all the packages installed.
# yum install libibverbs-rocee libibverbs-rocee-devel libibverbs-rocee-devel-static libibverbs-rocee-utils libmlx4-rocee libmlx4-rocee-static rdma mstflint libibverbs-utils infiniband-diags
The Mellanox VPI cards are multi-mode, in that they support either Infiniband or Ethernet. The cards I’ve got came in Infiniband mode, so I need to switch them over. Mellanox ships a script called connectx_port_config to change the mode, but we can do it with driver options too.
Get the PCI address of the NIC:
# lspci | grep Mellanox 21:00.0 Network controller: Mellanox Technologies MT27500 Family [ConnectX-3]
Check what ethernet devices exist currently:
# ls -al /sys/class/net
I see ib0/1 devices now since my cards are in IB mode. Now let’s change to ethernet mode. Note you need to substitute your PCI address as it will likely differ from mine (21:00.0). I need eth twice since this is a dual-port card.
# echo "0000:21:00.0 eth eth" >> /etc/rdma/mlx4.conf # modprobe -r mlx4_ib # modprobe -r mlx4_en # modprobe -r mlx4_core # service rdma restart ; chkconfig rdma on # modprobe mlx4_core # ls -al /sys/class/net
Now I see eth* devices (you may see pXpY names depending on the BIOS), since the cards are now in eth mode. If you look in dmesg you will see the mlx4 driver automatically sucked in the mlx4_en module accordingly. Cool!
Let’s verify that there is now an InfiniBand device ready for use:
# ibstat CA 'mlx4_0' CA type: MT4099 Number of ports: 2 Firmware version: 2.11.500 <-- flashed the latest fw using mstflint. Hardware version: 0 Node GUID: 0x0002c90300a0e970 System image GUID: 0x0002c90300a0e973 Port 1: State: Active <-------------------- Sweet. Physical state: LinkUp Rate: 40 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0x0202c9fffea0e970 Link layer: Ethernet Port 2: State: Down Physical state: Disabled Rate: 10 Base lid: 0 LMC: 0 SM lid: 0 Capability mask: 0x00010000 Port GUID: 0x0202c9fffea0e971 Link layer: Ethernet
Cool so we’ve got our RoCE device up from a hardware init standpoint, now give it an IP like any old NIC.
Special note for IB users: most IB switches have a subnet manager built in (RHEL ships one too, called opensm). But when using RoCE there’s no need for opensm as it’s specific to InfiniBand fabrics and plays no part in Ethernet fabrics. And since RoCE runs over Ethernet, there is no need for a subnet manager. The InfiniBandTA article I linked above goes into some detail about what benefits the SM provides on IB fabrics.
Now we get to the hard and confusing part. Just kidding, we’re done. Was it that intimidating ?Let’s test it out using an RDMA application that ships with Red Hat MRG Messaging, called qpid-latency-test. I chose this because it supports RDMA as a transport.
# yum install qpid-cpp-server qpid-cpp-server-rdma qpid-cpp-client qpid-cpp-client-devel -y
# qpidd --auth no -m no 2013-03-15 11:45:00 [Broker] notice SASL disabled: No Authentication Performed 2013-03-15 11:45:00 [Network] notice Listening on TCP/TCP6 port 5672 2013-03-15 11:45:00 [Security] notice ACL: Read file "/etc/qpid/qpidd.acl" 2013-03-15 11:45:00 [System] notice Rdma: Listening on RDMA port 5672 <-- Sweet. 2013-03-15 11:45:00 [Broker] notice Broker running
Defaults: around 100us.
# numactl -N0 -m0 nice -20 qpid-latency-test -b 172.17.2.41 --size 1024 --rate 10000 --prefetch=2000 --csv 10000,0.104247,2.09671,0.197184 10000,0.11297,2.12936,0.198664 10000,0.099194,2.11989,0.197529 ^C
With tcp-nodelay: around 95us
# numactl -N0 -m0 nice -20 qpid-latency-test -b 172.17.2.41 --size 1024 --rate 10000 --tcp-nodelay --prefetch=2000 --csv 10000,0.094664,3.00963,0.163806 10000,0.093109,2.14069,0.16246 10000,0.094269,2.18473,0.163521
With RDMA/RoCE/HPN: around 65us.
# numactl -N0 -m0 nice -20 qpid-latency-test -b 172.17.2.41 --size 1024 --rate 10000 --prefetch=2000 --csv -P rdma 10000,0.065334,1.88211,0.0858769 10000,0.06503,1.93329,0.0879431 10000,0.062449,1.94836,0.0872795 ^C
Percentage-wise, that’s a really substantial improvement. Plus don’t forget all the %sys time (which also includes memory subsystem bandwidth usage) you’re saving. You get all those CPU cycles back to spend on your application!
Disclaimer: I didn’t do any heroic tuning on these systems. The above performance test numbers are only to illustrate “proportional improvements”. Don’t pay much attention to the raw numbers other than order-of-magnitude. You can do much better starting with this guide.
So! Maybe kick the tires on RoCE, and get closer to wire speed with lower latencies. Have fun!