A Few Things You Should Know About Futexes

This article aims to provide a clear and concise discussion about the basics of futexes, and how they could be used to implement user-space synchronisation mechanisms on Linux.

So what are futexes? They are a kernel mechanism (in the form of a system call), using which fast user-space synchronisation can be implemented. The kernel interface for futexes looks like what is shown below:

int futex(int *uaddr, int op, int val, const struct timespec *timeout,
int *uaddr2, int val3);

The second param, op, can have several values, of which the following two are relevant for this discussion:
FUTEX_WAIT: The kernel checks if the value at uaddr is the same as val; if so, it then blocks the calling thread/process. If not, it then returns with EWOULDBLOCK. The last two parameters are ignored for this op, and the timeout parameter is not relevant, at least for now.
FUTEX_WAKE: The kernel wakes up at most val processes waiting on this futex. The last three parameters are ignored for this operation.

So why do we need futexes? Take a look at the traditional user-space synchronisation mechanisms:
Spin lock involves spinning on a variable in user space if the lock is not available. No context switch to kernel space is necessary. Spin locks are great if the size of the critical section is known to be small, or if there is no alternate work to be executed after blocking the thread. A lot of research has highlighted the limitations of spin locks not scalable with multi-cores, fairness issues, etc.
Semaphores involve making a system call each time, be it contended or uncontended. Hence, they are slower.
Thin locks initially try user-level locking (using atomics), similar to spin locks. If it fails, they inflate the lock and only try OS-level locking (similar to semaphores). But inflating happens only once, the first time a contention is seen. Then on, they stay inflated. Again, there’s system call overhead, once the lock is inflated.
In order to address some of these issues, futexes were added to the Linux kernel to enable library developers to implement a fast user-space synchronisation mechanism. How is this done?
Note: This discussion is based on the paper ‘Futexes are Tricky’ by Ulrich Drepper.

Using the above futex system call interface, fast user-space synchronisation mechanisms can be implemented.
1)     The condition wait-signal mechanism: This is typically used in producer-consumer problems, similar to what pthread_cond_wait and pthread_cond_signal provide. The following code snippet shows how this may be implemented:

class event {
public:
event () : val (0) { }
void ev_signal () {
++val;
futex_wake (&val, INT_MAX);
}
void ev_wait () {
futex_wait (&val, val);
}
private:
int val;
};

Here, the consumer threads call ev_wait() and pass in the current value of val (which is the same as what is stored at &val); hence, it is blocked. Note the semantics of the futex_wait() call above-although the interface shown here is slightly different. The producer thread calls ev_signal(), which increments the value of val and calls futex_wake() to wake up all the waiting threads (all since INT_MAX is passed as val; again, a different interface is shown here, for the sake of discussion).
Note here that incrementing val before calling futex_wake() is not really necessary, but here it is done to avoid a race condition. Let us suppose a new consumer thread had read the old value of val and was context-switched before calling futex_wait(). Now, after the producer called ev_signal(), if the futex_wait() is invoked in the new consumer, it would not block, since the value of val is not the same as what it had originally read.
2)     Implementing mutual exclusion (mutexes): The following code illustrates one simple, trivial way to implement mutual exclusion using the above futex interface:

class mutex {
public:
mutex () : val (0) { }
void lock () {
int c;
while ((c = atomic_inc (val)) != 0)
futex_wait (&val, c + 1);
}
void unlock () {
val = 0;
futex_wake (&val, 1);
}
private:
int val;
};

Here, a zero value for val represents an unlocked mutex. Any other value represents a locked one. It is easy to see how this provides mutual exclusion. When two different threads call lock() at the same time, only one of them finds that the old value was non-zero, and only that thread is put to sleep. Hence, only one thread enters the critical section. In unlock(), only one thread is woken up at a time. This is kind of an optimisation, which works just fine with respect to correctness.
Although this seems to work, it has several problems.
The correctness problem (1): Consider the following scenario. The lock is taken by thread T1, and T2 and T3 execute lock() almost at the same time. Let us suppose the lock() code gets executed in the following manner:
Initially, assume val is 1;

T2                                                                T3
lock()
atomic_inc (val) //now val=2, c=1              lock()
atomic_inc (val) //val=3, c=2
futex_wait (&val, c + 1); //value(val)=3
atomic_inc (val) //now val=4, c=3
futex_wait (&val, c + 1); //value(val)=4

In the above sequence, both futex_wait calls fail, since the actual value of val (3 and 4 as shown in the code) is not the same as what is being passed as second arg (c+1 = 2 and 3 respectively). This sequence may repeat forever, making the program incorrect.
The correctness problem (2): In case of interrupts, a single thread may keep getting interrupted exactly when it calls futex_wait(). This system call returns whenever an interrupt occurs. Hence, a single thread may keep incrementing val until it overflows and becomes 0.
A performance-related problem: The futex_wake() call can be avoided in unlock() in the uncontended case.

Better mutex implementation
Consider the following revised implementation, which fixes these issues:

class mutex2 {
public:
mutex () : val (0) { }
void lock () {
int c;
if ((c = cmpxchg (val, 0, 1)) != 0)
do {
if (c == 2 || cmpxchg (val, 1, 2) != 0)
futex_wait (&val, 2);
} while ((c = cmpxchg (val, 0, 2)) != 0);   ///Interesting
}
void unlock () {
if (atomic_dec (val) != 1) {
val = 0;
futex_wake (&val, 1);
}
}
private:
int val;
};

Here, we use 3 states instead of 2: 0 unlocked, 1 locked + no contention, and 2 locked + contention. Also, this code uses cmpxchg(), which also returns the old value. If the lock was free, the thread simply gets the lock, sets it to 1 and returns. If the lock was already taken, there are two cases:
a) In the uncontended case, nobody else is contesting for the lock; hence, its value is 1. We indicate that there’s a waiting thread by setting the value to 2.
b) In the contended case, the value is already 2. The check (c==2) succeeds and the thread simply blocks. But if the futex_wait() fails (because of unlock() code waking it up, or an interrupt), then it would have to try to set the value of lock again. But here (in the while() statement), we try to set it from 0 to 2 directly, since we don’t know at this point if we are the only ones who are locking. There may be many waiting.
Now, in the lock code there’s no question of overflow, since the highest value it reaches is 2. This fixes both correctness problems 1) and 2) discussed above. In the unlock code, check that the state is contended, and only then call futex_wake(). This fixes the performance problem discussed earlier.

Kernel support for futexes
At the kernel level, each futex has a unique ID, which is the struct page pointer and the offset within the page. This way, even though the futex variable may have different virtual addresses across processes, the physical address is the same, hence providing synchronisation. The reference count of the page is incremented, so that it is not swapped out of memory even when the process is sleeping. The futex structure on which a process is sleeping is placed on the kernel stack of the process.

A hash table maps the futex ID (uaddr param, above) to per-thread/per-process futex_q structs. When futex_wait() is called by a thread, a new futex_q object is created for it and added to the appropriate hash table entry. When futex_wake() is called, the same is used to wake up the sleeping process.
Do refer to the links below about futex implementation at the kernel level, for more details.

References
[0]     http://www.cis.temple.edu/~giorgio/cis307/readings/futex0.pdf
[1]     http://www.cis.temple.edu/~giorgio/cis307/readings/futex.pdf
[2]     http://locklessinc.com/articles/mutex_cv_futex/
[3]     http://lwn.net/Articles/360699/
[4]     http://lwn.net/Articles/267968/

All published articles are released under Creative Commons Attribution-NonCommercial 3.0 Unported License, unless otherwise noted.
Open Source For You is powered by WordPress, which gladly sits on top of a CentOS-based LEMP stack.

Creative Commons License.