The Heart Of The Problem: Uncontrolled Scheduling
Let's try to find the root cause of the problem faced in threads' problem from the last lesson.
We'll cover the following
You saw in the last lesson an example of two threads trying to update a shared variable. It was noticed that incorrect value was being evaluated by the program. The code snippet is reproduced below:
#include <stdio.h> #include <stdlib.h> #include <pthread.h> #include "common.h" #include "common_threads.h" int max; volatile int counter = 0; // shared global variable void *mythread(void *arg) { char *letter = arg; int i; // stack (private per thread) printf("%s: begin [addr of i: %p]\n", letter, &i); for (i = 0; i < max; i++) { counter = counter + 1; // shared: only one } printf("%s: done\n", letter); return NULL; } int main(int argc, char *argv[]) { if (argc != 2) { fprintf(stderr, "usage: main-first <loopcount>\n"); exit(1); } max = atoi(argv[1]); pthread_t p1, p2; printf("main: begin [counter = %d] [%x]\n", counter, (unsigned int) &counter); Pthread_create(&p1, NULL, mythread, "A"); Pthread_create(&p2, NULL, mythread, "B"); // join waits for the threads to finish Pthread_join(p1, NULL); Pthread_join(p2, NULL); printf("main: done\n [counter: %d]\n [should: %d]\n", counter, max*2); return 0; }
Let’s discuss why this happens?
TIP: KNOW AND USE YOUR TOOLS
You should always learn new tools that help you write, debug, and understand computer systems. Here, you will use a neat tool called a disassembler. When you run a disassembler on an executable, it shows you what assembly instructions make up the program. For example, if you wish to understand the low-level code to update a counter (as in the example), you may run
objdump
in the terminal to see the assembly code:
prompt> objdump -d t1
Doing so produces a long listing of all the instructions in the program, neatly labeled (particularly if you compiled with the -g
flag), which includes symbol information in the program. The objdump
program is just one of many tools you should learn how to use; a debugger like gdb
, memory profilers like valgrind
or purify
, and of course the compiler itself are others that you should spend time to learn more about; the better you are at using your tools, the better systems you’ll be able to build.
The heart of the problem
To understand why this problem occurs, you must understand the code sequence that the compiler generates for the update to counter
. In this case, you wish to simply add a number (1) to counter
. Thus, the code sequence for doing so might look something like this (in x86);
mov 0x8049a1c, %eaxadd $0x1, %eaxmov %eax, 0x8049a1c
This example assumes that the variable counter
is located at address 0x8049a1c
. In this three-instruction sequence, the x86 mov
instruction is used first to get the memory value at the address and put it into register eax
. Then, the add is performed, adding 1 (0x1
) to the contents of the eax
register, and finally, the contents of eax
are stored back into memory at the same address.
Let us imagine one of the two threads (Thread 1) enters this region of code and is thus about to increment counter
by one. It loads the value of counter (let’s say it’s 50, to begin with) into its register eax
. Thus, eax=50
for Thread 1. Then it adds one to the register; thus eax=51
. Now, something unfortunate happens: a timer interrupt goes off; thus, the OS saves the state of the currently running thread (its PC, its registers including eax
, etc.) to the thread’s TCB.
Now, something worse happens: Thread 2 is chosen to run, and it enters this same piece of code. It also executes the first instruction, getting the value of counter
and putting it into its eax
(remember: each thread when running has its own private registers; the registers are virtualized by the context-switch code that saves and restores them). The value of counter
is still 50 at this point, and thus Thread 2 has eax=50
. Let’s then assume that Thread 2 executes the next two instructions, incrementing eax
by 1 (thus eax=51
), and then saving the contents of eax
into counter
(address 0x8049a1c
). Thus, the global variable counter
now has the value 51.
Finally, another context switch occurs, and Thread 1 resumes running. Recall that it had just executed the mov
and add
, and is now about to perform the final mov
instruction. Recall also that eax=51
. Thus, the final mov
instruction executes, and saves the value to memory; the counter is set to 51 again.
Put simply, what has happened is this: the code to increment counter
has been run twice, but counter
, which started at 50, is now only equal to 51. A “correct” version of this program should have resulted in the variable counter
equal to 52.
Let’s look at a detailed execution trace to understand the problem better. Assume, for this example, that the above code is loaded at address 100 in memory, like the following sequence (note for those of you used to nice, RISC-like instruction sets: x86 has variable-length instructions; this mov
instruction takes up 5 bytes of memory, and the add
only 3):
100 mov 0x8049a1c, %eax105 add $0x1, %eax108 mov %eax, 0x8049a1c
With these assumptions, what happens is shown in the figure below. Assume the counter starts at value 50, and trace through this example to make sure you understand what is going on.
What you have seen here is called a race condition (or, more specifically, a data race): the results depend on the timing execution of the code. With some bad luck (i.e., context switches that occur at untimely points in the execution), you get the wrong result. In fact, you may get a different result each time; thus, instead of a nice deterministic computation (which you are used to from computers), you call this result indeterminate, where it is not known what the output will be and it is indeed likely to be different across runs. Because multiple threads executing this code can result in a race condition, this code is called a critical section. A critical section is a piece of code that accesses a shared variable (or more generally, a shared resource) and must not be concurrently executed by more than one thread.
What you really want for this code is what is called mutual exclusion. This property guarantees that if one thread is executing within the critical section, the others will be prevented from doing so.
Virtually all of these terms, by the way, were coined by Edsger Dijkstra, who was a pioneer in the field and indeed won the Turing Award because of this and other work; see his
Get hands-on with 1400+ tech skills courses.