I tweeted recently that %usr is what we wanna do; %sys is what we gotta do…What I meant was to point out that the kernel’s main goals in life are to bring up hardware, and manage access to it on behalf of an application and get out of the way. This includes objectives like allocating memory when an application asks for it, taking network packets from an application and giving them to the network card, and deciding what application runs on what core, when it runs (ordering), and for how long.
Since at least the days of the Apollo Guidance Computer, there has been the concept of priorities in job scheduling. Should you have the time, I highly recommend the Wikipedia article, this book, and the AGC Emulator.
Anyway, in more recent operating systems like Linux, the user interface to the job scheduler is quite similar — a system of policies and priorities. There’s a great write-up in the Red Hat MRG Realtime docs here.
The system of policies and priorities represent a multi-tiered approach to ordering jobs on a multitasking operating system. The user herself or an application may request from the kernel that it wants a certain scheduling policy and priority. By itself, those values don’t mean much. But when there’s a contended resource, (such as a single CPU core) they quickly come into play by informing the scheduler what the various task priorities are in relation to each other. For example, in the case of the AGC, an engine control application would be prioritized higher than, say, a cabin heater.
The kernel can’t read minds, so we occasionally must provide it with guidance as to which application is the highest priority. If you have a server who’s purpose is to run an application that predicts the weather, you don’t need log cleanup scripts, data archival or backups etc running when the weather app has “real work” to do. Without any guidance, the kernel will assume these tasks are of equal weight, when in fact the operator knows better.
The tools to manipulate scheduler policy and priority are things like nice and chrt (there are also syscalls that apps can use directly). In the previous example, you might use nice to inform the scheduler that the weather application is the most important task on the system, and it should run whenever possible. Something like ‘nice -20 ./weather’ or ‘renice -20 `pidof weather`’.
Back to the kernel’s main point in life: mediating access to hardware. In order to do this, the kernel may spawn a special type of process called a kthread. Kthreads cannot be controlled like regular processes; i.e. CPU/memory affinity or killing them. At some point if these kthreads have work to do, the scheduler will let them run. I wrote about some of this previously…They have important functions to do, like write out dirty memory pages to disk (bdi-flush), perhaps shuffle network packets around (ksoftirqd) or service various kernel modules like infiniband.
When the kthreads run, they might run on the same core where the weather app is running. This interruption in userspace execution can cause a few symptoms…i.e. jittery latency performance, increased CPU cache misses, poor overall performance.
If you’re staring at one of these symptoms, you might be curious what’s the easiest way to find out what’s bumping you off-core and dumping your precious cache lines.
There are a few ways to determine this. I wrote about how to use perf sched record to do it in a low latency whitepaper, but wanted to write about a 2nd method I’ve been using a bit lately as well.
You can use a Systemtap script included in RHEL6 called ‘cycle_thief.stp’ (written by Red Hat’s Will Cohen) to find out what’s jumping ahead of you. Here’s an example; PID 3391 is a KVM guest. I added the [Section X] markers to make explaining the output a bit easier. I also removed the histogram buckets with zero values to shorten the output. Finally, I let it run for 30 seconds before hitting Ctrl+C.
# stap cycle_thief.stp -x 3391 ^C [Section 1] task 3391 migrated: 1
[Section 2] task 3391 on processor (us): value |-------------------------------------------------- count 16 |@@@@@@@@@@@@ 12 32 |@@@@@@@@@@@ 11 64 |@ 1
[Section 3] task 3391 off processor (us) value |-------------------------------------------------- count 128 |@@@@@@@@@@@@ 12 8192 |@@@@ 4 131072 |@@@@ 4 524288 |@@@ 3
[Section 4] other pids taking processor from task 3391 0 55 3393 17 2689 13 115 4 69 2 431 1
[Section 5] irq taking processor from task 3391 irq count min(us) avg(us) max(us)
Section 1 represents the number of times PID 3391 was migrated between CPU cores.
Section 2 is a histogram of the number of microseconds PID 3391 was on-core (actively executing on a CPU).
Section 3 is a histogram of the number of microseconds PID 3391 was off-core (something else was running).
Section 4 identifies which PIDs executed on the same core PID 3391 wanted to use during those 30 seconds (and thus bumped PID 3391 off-core). You can grep the process table to see what these are. Sometimes you’ll find other userspace processes, sometimes you’ll find kthreads. You can see this KVM guest was off-core more than on. It’s just an idle guest I created for this example, so that makes sense.
Section 5 is blank; had there been any IRQs serviced by this core during the 30 second script runtime, they’d be counted here.
With an understanding of the various policies and priorities (see MRG docs or man 2 setpriority) cycle_thief.stp is a super easy way of figuring out how to set your process policies and priorities to maximize the amount of time your app is on-core doing useful work.