Non-Deadlock Bugs

This lesson includes two frequent types of non-deadlock concurrency bugs: atomicity and ordering violation.

Non-deadlock bugs make up a majority of concurrency bugs, according to Lu’s study“Learning from Mistakes — A Comprehensive Study on Real World Concurrency Bug Characteristics” by Shan Lu, Soyeon Park, Eunsoo Seo, Yuanyuan Zhou. ASPLOS ’08, March 2008, Seattle, Washington. The first in-depth study of concurrency bugs in real software, and the basis for this chapter. Look at Y.Y. Zhou’s or Shan Lu’s web pages for many more interesting papers on bugs.. But what types of bugs are these? How do they arise? How can we fix them? We now discuss the two major types of non-deadlock bugs found by Lu et al.:

  • Atomicity violation bugs.
  • Order violation bugs.

Atomicity-violation bugs

The first type of problem encountered is referred to as an atomicity violation. Here is a simple example, found in MySQL. Before reading the explanation, try figuring out what the bug is. Do it!

Press + to interact
Thread 1::
if (thd->proc_info) {
fputs(thd->proc_info, ...);
}
Thread 2::
thd->proc_info = NULL;

In the example, two different threads access the field proc_info in the structure thd. The first thread checks if the value is non-NULL and then prints its value; the second thread sets it to NULL. Clearly, if the first thread performs the check but then is interrupted before the call to fputs, the second thread could run in-between, thus setting the pointer to NULL. When the first thread resumes, it will crash, as a NULL pointer will be dereferenced by fputs.

The more formal definition of an atomicity violation, according to Lu et al, is this: “The desired serializability among multiple memory accesses is violated (i.e. a code region is intended to be atomic, but the atomicity is not enforced during execution).” In our example above, the code has an atomicity assumption (in Lu’s words) about the check for non-NULL of proc_info and the usage of proc_info in the fputs() call; when the assumption is incorrect, the code will not work as desired.

Finding a fix for this type of problem is often (but not always) straightforward. Can you think of how to fix the code above?

In the solution presented below, you can simply add locks around the shared-variable references, ensuring that when either thread accesses the proc_info field, it has a lock held (proc_info_lock). Of course, any other code that accesses the structure should also acquire this lock before doing so.

Press + to interact
pthread_mutex_t proc_info_lock = PTHREAD_MUTEX_INITIALIZER;
Thread 1::
pthread_mutex_lock(&proc_info_lock);
if (thd->proc_info) {
fputs(thd->proc_info, ...);
}
pthread_mutex_unlock(&proc_info_lock);
Thread 2::
pthread_mutex_lock(&proc_info_lock);
thd->proc_info = NULL;
pthread_mutex_unlock(&proc_info_lock);

Order-violation bugs

Another common type of non-deadlock bug found by Lu et al. is known as an order violation. Here is another simple example; once again, see if you can figure out why the code below has a bug in it.

Press + to interact
Thread 1::
void init() {
mThread = PR_CreateThread(mMain, ...);
}
Thread 2::
void mMain(...) {
mState = mThread->State;
}

As you probably figured out, the code in Thread 2 seems to assume that the variable mThread has already been initialized and is not NULL. However, if Thread 2 runs immediately once created, the value of mThread will not be set when it is accessed within mMain() in Thread 2, and will likely crash with a NULL-pointer dereference. Note that we assume the value of mThread is initially NULL; if not, even stranger things could happen as arbitrary memory locations are accessed through the dereference in Thread 2.

The more formal definition of an order violation is the following: “The desired order between two (groups of) memory accesses is flipped (i.e., AA should always be executed before BB, but the order is not enforced during execution)”“Learning from Mistakes — A Comprehensive Study on Real World Concurrency Bug Characteristics” by Shan Lu, Soyeon Park, Eunsoo Seo, Yuanyuan Zhou. ASPLOS ’08, March 2008, Seattle, Washington. The first in-depth study of concurrency bugs in real software, and the basis for this chapter. Look at Y.Y. Zhou’s or Shan Lu’s web pages for many more interesting papers on bugs..

The fix to this type of bug is generally to enforce the order. As discussed previously, using condition variables is an easy and robust way to add this style of synchronization into modern codebases. In the example above, we could thus rewrite the code as shown below.

Press + to interact
pthread_mutex_t mtLock = PTHREAD_MUTEX_INITIALIZER;
pthread_cond_t mtCond = PTHREAD_COND_INITIALIZER;
int mtInit = 0;
Thread 1::
void init() {
...
mThread = PR_CreateThread(mMain, ...);
// signal that the thread has been created...
pthread_mutex_lock(&mtLock);
mtInit = 1;
pthread_cond_signal(&mtCond);
pthread_mutex_unlock(&mtLock);
...
}
Thread 2::
void mMain(...) {
...
// wait for the thread to be initialized...
pthread_mutex_lock(&mtLock);
while (mtInit == 0)
pthread_cond_wait(&mtCond, &mtLock);
pthread_mutex_unlock(&mtLock);
mState = mThread->State;
...
}

In this fixed-up code sequence, we have added a condition variable (mtCond) and corresponding lock (mtLock), as well as a state variable (mtInit). When the initialization code runs, it sets the state of mtInit to 1 and signals that it has done so. If Thread 2 had run before this point, it will be waiting for this signal and corresponding state change; if it runs later, it will check the state and see that the initialization has already occurred (i.e., mtInit is set to 1), and thus continue as is proper. Note that we could likely use mThread as the state variable itself, but do not do so for the sake of simplicity here. When ordering matters between threads, condition variables (or semaphores) can come to the rescue.

Summary

A large fraction (97%) of non-deadlock bugs studied by Lu et al. are either atomicity or order violations. Thus, by carefully thinking about these types of bug patterns, programmers can likely do a better job of avoiding them. Moreover, as more automated code-checking tools develop, they should likely focus on these two types of bugs as they constitute such a large fraction of non-deadlock bugs found in deployment.

Unfortunately, not all bugs are as easily fixed as the examples we looked at above. Some require a deeper understanding of what the program is doing, or a larger amount of code or data structure reorganization to fix. Read Lu et al.’s excellent (and readable) paper for more details.

Get hands-on with 1400+ tech skills courses.