The Linux Completely Fair Scheduler (CFS)
This lesson gives an introduction to the Linux Completely Fair Scheduler (or CFS) and explains how it works.
We'll cover the following
Despite these earlier works in fair-share scheduling, the current Linux approach achieves similar goals in an alternate manner. The scheduler, entitled the
To achieve its efficiency goals, CFS aims to spend very little time making scheduling decisions, through both its inherent design and its clever use of data structures well-suited to the task. Recent studies have shown that scheduler efficiency is surprisingly important; specifically, in a study of Google data centers, Kanev et al. show that even after aggressive optimization, scheduling uses about 5% of overall data center CPU time. Reducing that overhead as much as possible is thus a key goal in modern scheduler architecture.
Basic operation
Whereas most schedulers are based around the concept of a fixed time slice, CFS operates a bit differently. Its goal is simple: to fairly divide a CPU evenly among all competing processes. It does so through a simple counting-based technique known as virtual runtime (vruntime).
As each process runs, it accumulates vruntime
. In the most basic case, each process’s vruntime
increases at the same rate, in proportion with physical (real) time. When a scheduling decision occurs, CFS will pick the process with the lowest vruntime
to run next.
This raises a question: how does the scheduler know when to stop the currently running process, and run the next one? The tension here is clear: if CFS switches too often, fairness is increased, as CFS will ensure that each process receives its share of CPU even over minuscule time windows, but at the cost of performance (too much context switching); if CFS switches less often, performance is increased (reduced context switching), but at the cost of near-term fairness.
The sched_latency
parameter
CFS manages this tension through various control parameters. The first is sched_latency
. CFS uses this value to determine how long one process should run before considering a switch (effectively determining its time slice but in a dynamic fashion). A typical sched_latency
value is 48 (milliseconds); CFS divides this value by the number () of processes running on the CPU to determine the time slice for a process, and thus ensures that over this period of time, CFS will be completely fair.
For example, if there are processes running, CFS divides the value of sched_latency
by to arrive at a per-process time slice of 12 ms. CFS then schedules the first job and runs it until it has used 12 ms of (virtual) runtime, and then checks to see if there is a job with lower vruntime
to run instead. In this case, there is, and CFS would switch to one of the three other jobs, and so forth.
Example
The figure below shows an example where the four jobs (A, B, C, D) each run for two time slices in this fashion; two of them (C, D) then complete, leaving just two remaining, which then each run for 24 ms in round-robin fashion.
But what if there are “too many” processes running? Wouldn’t that lead to too small of a time slice, and thus too many context switches? Good question! And the answer is yes.
The min_granularity
parameter
To address this issue, CFS adds another parameter, min_granularity
, which is usually set to a value like 6 ms. CFS will never set the time slice of a process to less than this value, ensuring that not too much time is spent in scheduling overhead.
For example, if there are ten processes running, our original calculation would divide sched_latency
by ten to determine the time slice (result: 4.8 ms). However, because of min_granularity
, CFS will set the time slice of each process to 6 ms instead. Although CFS won’t (quite) be perfectly fair over the target scheduling latency (sched_latency
) of 48 ms, it will be close, while still achieving high CPU efficiency.
Note that CFS utilizes a periodic timer interrupt, which means it can only make decisions at fixed time intervals. This interrupt goes off frequently (e.g., every 1 ms), giving CFS a chance to wake up and determine if the current job has reached the end of its run. If a job has a time slice that is not a perfect multiple of the timer interrupt interval, that is OK; CFS tracks vruntime
precisely, which means that over the long haul, it will eventually approximate ideal sharing of the CPU.
Get hands-on with 1400+ tech skills courses.