nohz_full=godmode ?

Starting with some background…What is the kernel timer tick (aka LOC interrupt), and what does it do ?

The kernel timer tick is a interrupt triggered at a periodic interval (based on the kernel compile option CONFIG_HZ). The tick is what keeps track of kernel statistics such as CPU and memory usage and provides for scheduler fairness through it’s load balancer. It also does timekeeping, i.e. to keep gettimeofday updated.

When the tick fires (as often as every millisecond, based on value of CONFIG_NO_HZ), it will get scheduled ahead of whatever’s currently running on a CPU core. In other words, whatever was running (with all of it’s valuable data cache-hot) will be interrupted by the tick. The CPUs L1 instruction and data caches (the smallest yet fastest) are invalidated, somewhere around 1000 times a second (if the task was 100% CPU-bound which the majority are not).

This is not an All Is Lost scenario, but certain workloads might see a 1-3% hit that could be attributed to this interference. It also caused some noticeable jitter, especially since what happens inside the tick is not deterministic. The total time the tick runs is not a predictable/constant value.

That was a mouthful, so let me dissect it a bit by describing various kernel config options that control how often this tick fires.

Prior to the introduction of the “tickless kernel” in kernel 2.6.21, the timer tick ran on every core at the rate of CONFIG_HZ (i.e. 1000/sec). This provided for a decent balance of throughput and latency. It had the side-effect of waking up every core constantly, which wasn’t necessary when nr_running=0 (a per-core attribute…see /proc/sched_debug). The scheduler says there’s nothing to run on the core, so let’s disable the tick there and save some power by not waking the CPU up from a deeper c-state. Actually it saves lots of power; linux has become quite a responsible citizen in this regard.

In summary:

RHEL5 – CONFIG_HZ=1000
- No Tickless support
- Ticks 1000/sec on every CPU no matter what
RHEL6 – CONFIG_HZ=1000, CONFIG_NO_HZ=y
- Tickless when nr_running = 0
- Ticks 1000/sec when nr_running > 0
RHEL7 – CONFIG_HZ=1000, CONFIG_NO_HZ=y, CONFIG_NO_HZ_FULL=y, etc.
- Opt-in support for nohz_full
- Tickless when nr_running <= 1
- Ticks 1000/s when nr_running > 1

Note: for RHEL7, you will need 3.10.0-68 or later.

Red Hat’s Frederic Weisbecker has been working with other industry leaders such as Paul McKenney from IBM (and many others) to implement a feature called Full NO HZ. During the development phase, it has changed names several times (i.e. adaptive tickless). These days the kernel cmdline option to toggle it is nohz_full, so that’s what I’m calling it.

This feature requires yet another slew of kernel config options, along with some userspace gymnastics (that I’ll detail later) to get everything lined up. So far the use-cases for disabling the tick has been embedded applications, HPC/scientific, and the financial guys who need real-time characteristics.

It makes sense then to have these features enabled, but defaulted to OFF such that these folks can opt-in.  As you’ll see it’s not really necessary for everyone, nor do most workloads expose the tick as the “top-talker” in traces. But several can, and it was for those customers that the feature was developed.

nohz_full has the following characteristics:

  • Stop interrupting userspace when nr_running=1 (see /proc/sched_debug).
    • If runqueue depth is 1, then the scheduler should have nothing to do on that core.
  • Move all timekeeping to non-latency-sensitive cores.
  • Mark certain cores as nohz_full cores via cmdline.  In this example, the system has 2 sockets, 8 cores each, 16 cores total, logical cores disabled.  I want to dump everything I can over to core 0, leaving cores 1-15 for my performance critical application:
Kernel cmdline: nohz_full=1-15 isolcpus=1-15 selinux=0 audit=0

# dmesg|grep dyntick
dmesg: [ 0.000000] NO_HZ: Full dynticks CPUs: 1-15.
  • In addition to cmdline options nohz_full, the user must move RCU threads themselves.
 # for i in `pgrep rcu` ; do taskset -pc 0 $i ; done

Frederic has written a small harness that uses kernel tracepoints and the ftrace interface to test and debug during this feature’s development.  It’s available here:

git://git.kernel.org/pub/scm/linux/kernel/git/frederic/dynticks-testing.git

That harness spits out something like this:

root@localhost: ~/dynticks-testing # cat trace.1
 # tracer: nop
 #
 # entries-in-buffer/entries-written: 10392/10392 #P:16
 #
 # _-----=> irqs-off
 # / _----=> need-resched
 # | / _---=> hardirq/softirq
 # || / _--=> preempt-depth
 # ||| / delay
 # TASK-PID CPU# |||| TIMESTAMP FUNCTION
 # | | | |||| | |
 -0 [001] d... 1565.585643: tick_stop: success=yes msg=
 user_loop-10409 [001] d.h. 1565.586320: hrtimer_expire_entry: hrtimer=ffff881fbfa2ec80 function=tick_sched_timer now=1565474000583
 user_loop-10409 [001] d... 1565.586327: tick_stop: success=yes msg=
 user_loop-10409 [001] d.h. 1566.586352: hrtimer_expire_entry: hrtimer=ffff881fbfa2ec80 function=tick_sched_timer now=1566474000281
 user_loop-10409 [001] d.h. 1567.586384: hrtimer_expire_entry: hrtimer=ffff881fbfa2ec80 function=tick_sched_timer now=1567474000282
 user_loop-10409 [001] d.h. 1568.586417: hrtimer_expire_entry: hrtimer=ffff881fbfa2ec80 function=tick_sched_timer now=1568474000280
 user_loop-10409 [001] d.h. 1569.586449: hrtimer_expire_entry: hrtimer=ffff881fbfa2ec80 function=tick_sched_timer now=1569474000280
 user_loop-10409 [001] d.h. 1570.586482: hrtimer_expire_entry: hrtimer=ffff881fbfa2ec80 function=tick_sched_timer now=1570474000275

What we’re looking for is the tick_stop messages, which mean that tick fired.   Note:  There is still one tick per-second in the current upstream code to maintain scheduler stats for load balancing.   The above output is from a system tuned according to the specifics in this blog post.  It was also necessary to configure the system BIOS for low latency.  Individual OEMs typically publish whitepapers on this topic.

I mentioned certain statistical accounting is done inside the tick.  One of those that is user-controllable is vm.stat_interval (which defaults to 1, so once per second).  You will see that even with nohz_full, vm.stat_interval will pop at that interval.  Frederic’s test harness accounts for this by setting vm.stat_interval to 120, then running the test for 10 seconds.  If you run the test for 120+ seconds, you will see vmstat_update fire (and possibly other things like xfs).

kworker/1:0-141 [001] .... 2693.850191: workqueue_execute_start: work struct ffff881fbfa304a0: function vmstat_update

kworker/1:0-141   [001] ....  2713.458820: workqueue_execute_start: work struct ffff881f90e07c28: function xfs_log_worker [xfs]

This feature is a massive improvement in terms of cache efficiency.  To see what I mean, try running this test harness without the kernel cmdline optons 🙂

To get rid of the xfs_log_worker interference, you can use the tunable workqueues feature of the kernel’s bdi-flush writeback threads.  If, as in the above example, you are using core 0 as your “housekeeping CPU”, then you could affine the bdi-flush threads to core 0 like so:

# echo 1 > /sys/bus/workqueue/devices/writeback/cpumask

It takes a hex argument, so 1 is actually core 0.

At this point whenever the kernel wants to write dirty pages, it will wake up these bdi-flush threads as normal, but now they will wake up with the affinity that you programmed in.  Keep in mind that a single core might not be enough to do the writeback and whatever else the kernel needs to do, because bdi-flush threads, like any IO thread, block.  You might need to use 2+ cores.  Keep an eye out for CPU congestion or blocking on the housekeeping core (mpstat or similar).

Also note that by default in RHEL7, bdi-flush threads are NUMA-affined to be PCI-local to your storage adapter (whether it’s a local SCSI/SATA card or HBA).  That’s a change from RHEL6 where bdi-flush threads had no affinity by default.  You can disable the default NUMA affinity and return RHEL6 setting like so:

# echo 0 > /sys/bus/workqueue/devices/writeback/numa

The 2 “echo” commands above do not persist reboots.

Now…If you run turbostat while in this configuration, you will see that the timekeeping core  (core 0 in this case) is kept busy enough (because it is now ticking @ CONFIG_HZ rate) to be kept in C-state 0.  That’s less than palatable, and was later fixed by Paul McKenney and is called CONFIG_NO_HZ_FULL_SYSIDLE.  When that’s set, the timekeeping core is no longer pegged.  Godmode???

Here’s another way to examine the tick’s behavior:

# perf stat -C 1 -e irq_vectors:local_timer_entry sleep 1

9 irq_vectors:local_timer_entry

pig is a program written by my co-worker Bill Gray.  It’s used as an artificial load generator.   Below, it spins on the CPU for 1 second.  Unfortunately it’s not packaged for RHEL.  But you can use this instead, just as well.

So here is the trace without the cmdline options.  You can see that the tick fires roughly 1000 times in the 1 second run, and is expected out of the box behavior.

# perf stat -C 1 -e irq_vectors:local_timer_entry taskset -c 1 /root/pig -s 1

1005 irq_vectors:local_timer_entry

Then reboot with nohz_full=1-15 rcu_nocbs=1-15 and isolate core 1 from userspace tasks and IRQs.  You could do this with isolcpus=1-15 too.

# tuna -c 1 -i ; tuna -q * -c 1 -i

The same pig run ends up with only a handful of interruptions! Oink!

# perf stat -C 1 -e irq_vectors:local_timer_entry taskset -c 1 /root/pig -s 1

4 irq_vectors:local_timer_entry

Here’s yet another (less granular) way to see what’s going on:

# watch -n1 -d "cat /proc/interrupts|egrep 'LOC|CPU'"

Now that you’ve validated your configuration, it’s time to run your applications and see if this feature gives you any boost.  If you’ve got the right NICs, try out the busy polling socket option, too.

Here is some further reading on the topic, including a video of Frederic Weisbecker from LinuxCon where he covers this feature in detail.

https://www.kernel.org/doc/Documentation/timers/NO_HZ.txt
http://lwn.net/Articles/549580/
http://www.youtube.com/watch?v=G3jHP9kNjwc