Dabeaz

Dave Beazley's mondo computer blog. [ homepage | archive ]

Sunday, September 13, 2009

Python Thread Synchronization Primitives : Not Entirely What You Think

If you have done any kind of programming with Python threads, you are probably familiar with the basic synchronization primitives provided by the threading module. Specifically, you get the following kinds of synchronization objects to work with:

Lock. Mutual exclusion lock that's commonly used to protect shared data structures.
RLock. Reentrant mutual exclusion lock that is useful for code-based locking on functions or methods or to implement monitors.
Event. An object that that allows one or more threads to wait for some "event" to occur. Used to implement barriers or to signal the completion of some task.
Condition. Condition variable. Used to send signals between threads. For example in producer-consumer problems, the producer will use a condition variable to send a signal to the consumer that data is available.
Semaphore. A high-level synchronization primitive based on an integer counter; Acquiring the semaphore decreases the counter and releasing the semaphore increases the counter. If the counter is 0 and a thread tries to acquire, it will block until a different thread releases the semaphore.

Knowing how and when to use the various synchronization primitives is often a non-trivial exercise. However, the point of this post is not about that--so if you're here looking for a gentle tutorial, you're in the wrong place.

Instead, I'd like to look at the inner workings of Python's thread synchronization primitives. In part, this is motivated by a general interest in knowing how Python works on multicore machines. However, it's also related to something that I noticed when putting my GIL talk together. So, we'll take a little tour under the covers, do a few experiments, and think about how this might fit into the "big picture."

A Curious Fact

If you write threaded programs, you should know that Python uses real system-level threads to carry out its work. That is, threads are implemented using pthreads or some other native threading mechanism provided by the operating system. However, the same can not be said of Python's basic synchronization primitives such as Lock, Condition, Semaphore and so forth. That is, even though low-level libraries such as pthreads provide various kinds of basic locks and synchronization objects, the threading library doesn't make direct use of them (so, when you're using something like a Lock object in your program, you're not manipulating a pthreads mutex).

This fact may surprise experienced programmers. Many of Python's core library modules provide a direct interface to low-level functionality written in C (e.g., think about the os or socket modules). However, thread synchronization objects are an exception to that rule.

Some History

Python has included support for threads for most of its history. In fact, if Guido ever gets around to updating his History of Python blog, he will eventually tell you that threads were first added to Python in 1992 after a contribution by one of his coworkers Sjoerd Mullender (disclaimer: I don't have a time machine, but I have seen the entire "History of Python" article that Guido is using as the basis for his history blog). This early work is where you find the introduction of the global interpreter lock (GIL) as well as the low-level thread library module.

Part of the problem faced by early versions of Python was the fact that thread programming interfaces weren't always available or standardized across systems. Thus, threads were only supported on certain machines such as SGI Irix and Sun Solaris. The pthreads interface wasn't standardized until a little later (~1995). The modern threading library that virtually all Python programmers now use first appeared in Python-1.5.1 (1998).

A consequence of this chaos was that Python's support for threads was intentionally designed to have a minimal set of basic requirements. The thread library module simply provided a function for launching a Python callable in its own execution thread. A single function, allocate_lock() could be used to allocate a "lock" object. This object provided the usual acquire() and release() operations, but not much else.

If you dig into the C implementation of the interpreter, you'll find that all support for locking is reduced to just four C functions.

PyThread_allocate_lock()
PyThread_free_lock()
PyThread_acquire_lock()
PyThread_release_lock()

You can find these functions in a series of files such as thread_pthread.h, thread_nt.h, thread_solaris.h, and so forth in the Python/ directory of the Python interpreter source. Each file simply contains a platform specific implementation of a basic lock. This lock then becomes the basis for all other synchronization primitives as we'll see in a minute. It should also be noted that these functions are also used to implement the infamous global interpreter lock (GIL).

What is a lock exactly?

If you have worked with thread locking in C, you might think that the above C functions are simply a wrapper around something like a pthreads mutex lock. However, this is not the case. Instead, the lock is minimally implemented as a binary semaphore. Here is a simplified example of the lock implementation that's used on many Unix systems:

#include <stdlib.h>
#include <pthread.h>
#include <string.h>

typedef struct {
  char           locked;
  pthread_cond_t lock_released;
  pthread_mutex_t mut;
} lock_t;

lock_t *
allocate_lock(void) {
  lock_t *lock;
  lock = (lock_t *) malloc(sizeof(lock_t));
  memset((void *)lock, '\0', sizeof(lock_t));
  pthread_mutex_init(&lock->mut,NULL);
  pthread_cond_init(&lock->lock_released, NULL);
  return lock;
}

void 
free_lock(lock_t *lock) {
  pthread_mutex_destroy( &lock->mut );
  pthread_cond_destroy( &lock->lock_released );
  free((void *)lock);
}

int 
acquire_lock(lock_t *lock, int waitflag) {
  int success;
  pthread_mutex_lock( &lock->mut );
  success = lock->locked == 0;

  if ( !success && waitflag ) {
    while ( lock->locked ) {
      pthread_cond_wait(&lock->lock_released,&lock->mut);
    }
    success = 1;
  }
  if (success) lock->locked = 1;
  pthread_mutex_unlock( &lock->mut );
  return success;
}

void 
release_lock(lock_t *lock) {
  pthread_mutex_lock( &lock->mut );
  lock->locked = 0;
  pthread_mutex_unlock( &lock->mut );
  pthread_cond_signal( &lock->lock_released );
}

Understanding this code requires some careful study. However, the key part of it is that Python lock objects manually keep track of their internal state (locked or unlocked). This is the locked attribute of the lock structure. The pthreads mutex lock is simply being used to synchronize access to the locked attribute in the acquire() and release() operations (note: this mutex lock is not actually the lock). Finally, the condition variable is being used to perform a kind of thread signaling that's used to wake up any sleeping threads when the lock gets released.

What about Native Semaphores?

As just mentioned, the Python lock is minimally implemented as a binary semaphore. If you've done thread programming in C, you probably know that many systems optionally include a native semaphore object. On such systems, Python may be built in a way so that it simply uses the native semaphore object for the lock. For example, this what Python uses for synchronization on Windows.

I don't intend to say any more about this here except to emphasize that using some kind of semaphore is actually a requirement for other parts of Python's threading to work correctly. For instance, the high-level threading library won't work if the lock isn't implemented in this manner.

Semaphores vs. Mutex Locks

The differences between a semaphore and mutex lock are subtle. However, the most obvious one pertains to the issue of ownership. When you use a mutex lock, there is almost always a strong sense of ownership. Specifically, if a thread acquires a mutex, it is the only thread that is allowed to release it. Semaphores don't have this restriction. In fact, once a semaphore has been acquired, any thread can later release it. This allows for more varied forms of thread signaling and synchronization. Here is one such experiment you can try in Python:

>>> import threading, time
>>> done = threading.Lock()
>>> def foo():
...      print "I'm foo and I'm running"
...      time.sleep(30)
...      done.release()       # Signal completion by releasing the lock
...
>>> done.acquire()
>>> threading.Thread(target=foo).start()
I'm foo and I'm running
>>> done.acquire(); print "Foo done"
Foo done                        (note: prints after 30 seconds)
>>>

In this example, a lock is being used to signal the completion of some task. The main thread acquires the lock to clear it and then launches a thread to carry out some work. Immediately after launching this thread, the main thread attempts to immediately acquire the lock again. Since the lock was already in use, this operation blocks. However, when the worker thread finishes, it releases the lock--notifying the main thread that it has finished. It is critical to emphasize that the lock is being acquired and released by two different threads. This is the essential property provided by using a semaphore. If a traditional mutex lock were used, the program would deadlock or crash with an error.

Just as aside, I would not recommend writing Python code that uses Lock objects in this way. Most programmers are going to associate Lock with a mutex-lock. You definitely don't use mutex-locks in the manner shown.

Other differences between mutex locks and semaphores tend to be more subtle. There are a number of well-known problems concerning mutex locks that typically get addressed by thread libraries and the operating system. For example, the system may implement policies to prevent thread starvation or provide some sense of fairness when many threads are competing for the same lock. If threads have different scheduling priorities, the system may also try to work around problems related to priority inversion (a problem where a low-priority thread holds a lock needed by a high-priority thread). Semaphores aren't necessarily treated in the same manner which means that a multithreaded program using semaphores may execute in a manner that is slightly different than one that uses mutex locks. For now, however, let's skip though details.

The threading Library

Now, that we've talked about the low-level locking mechanism used by the interpreter, let's talk about the synchronization primitives defined in the threading library. With the exception of Lock objects, which are identical to the lock described in the above section, all of the other synchronization primitives are written entirely in Python. For example, consider the RLock implementation. Here is a cleaned up version of how it is implemented:

class RLock:
    def __init__(self):
        self._block = _allocate_lock()
        self._owner = None
        self._count = 0

    def acquire(self, blocking=1):
        me = current_thread()
        if self._owner is me:
            self._count = self._count + 1
            return 1
        rc = self._block.acquire(blocking)
        if rc:
            self._owner = me
            self._count = 1
        return rc

    def release(self):
        if self._owner is not current_thread():
            raise RuntimeError("cannot release un-aquired lock")
        self._count = count = self._count - 1
        if not count:
            self._owner = None
            self._block.release()

The fact that an RLock is implemented entirely as a Python layer over a regular lock object significantly impacts its performance. For example:

>>> from timeit import timeit
>>> timeit("lock.acquire();lock.release()","from threading import Lock; lock = Lock()")
0.50123405456542969
>>> timeit("lock.acquire();lock.release()","from threading import RLock; lock = RLock()")
5.2153160572052002
>>>

Here, you see that acquiring and releasing a RLock object is about ten times slower than using a Lock. The performance impact is worse for more advanced synchronization primitives. For example, here is the result of using a Semaphore object (which is also implemented entirely in Python)

>>> timeit("lock.acquire();lock.release()","from threading import Semaphore; lock = Semaphore(1)")
6.5345189571380615
>>>

Condition and Event objects are also implemented entirely in Python. However, their implementation is also rather interesting. Keep in mind that the primary purpose of a Condition object is to perform signaling between threads. Here is a very common scenario that you see with producer-consumer problems such as in the implementation of a queue.

from threading import Lock, Condition
from collections import deque

items      = deque()
items_cv   = Condition()

def producer():
    while True:
         # produce some item
         items_cv.acquire()
         items.append(item)
         items_cv.notify()
         items_cv.release()

def consumer():
    while True:
         items_cv.acquire()
         while not items:
               items_cv.wait()
         item = items.popleft()
         items_cv.release()
         # Do something with item

Of particular interest here are the wait() and notify() operations that perform the thread signaling. This signaling is actually carried out using a Lock object. When you wait on a condition variable, a new Lock object is created and acquired. The lock is then acquired again to force the thread to block. If you look at the implementation of Condition you find code like this:

class Condition:
    ...
    def wait(self, timeout=None):
        ...
        waiter = _allocate_lock()
        waiter.acquire()
        self._waiters.append(waiter)
        ...
        waiter.acquire()       # Block
    ...

The notify() operation that wakes up a thread is carried out by simply releasing the waiter lock created above:

class Condition:
    ...
    def notify(self, n=1):
        waiters = self._waiters[:n]
        for waiter in waiters:
            waiter.release()
    ...

Needless to say, a lot of processing is going on underneath the covers when you use something like a Condition object in Python. Every wait() operation involves creating an entirely new lock object. Signaling is carried out with acquire() and release() operations on that lock. Moreover, there are additional locking operations carried out on the lock object associated with the condition variable itself. Plus, consider that all of this high-level locking actually involves more locks and condition variables in C.

Who Cares?

At this point, you might be asking yourself "who cares? This is all a bunch of low-level esoteric details." However, I think that anyone who is serious about using threads in Python should take an interest in how the synchronization primitives are actually put together.

For one, a common rule of thumb with thread programming is to try and avoid the use of locks and synchronization primitives as much as possible. This is certainly true in C, but even more so in Python. The fact that almost all of the synchronization primitives are implemented in Python means that they are substantially slower than any comparable operations in a C/C++ threading library. So, if you care about performance, using a lot of locks is something you'll definitely want to avoid.

The other reason to care about this concerns the Queue module. It is commonly advised that the Queue module be used as a means for exchanging data between threads because it already deals with all of the underlying synchronization. This is all well and good except for the fact that Queue objects add even more layers to all of the synchronization primitives that we've talked about. In particular, the locking performed by a queue is done using a combination of locks and condition variables from the threading module.

This means that if you're using queues, you're not really avoiding all of the overhead of locking. Instead, you're just moving it to a different location where it's out of view.

One might wonder just how much overhead gets added by all of this. For instance, a Queue object is really just a wrapper around a collections.deque with the added locking. You can try a few performance experiments. For instance, inserting items:

>>> timeit("q.append(1)","from collections import deque; q = deque()")
0.17505884170532227
>>> timeit("q.put(1)","from Queue import Queue; q = Queue()")
4.4164938926696777
>>>

Here, you find that inserting into a Queue is about 25 times slower than inserting into a deque. You get similar figures for removing items. Keep in mind that these simple benchmarks don't even cover the case of working with multiple threads where even more overhead would be added.

Some Final Thoughts

There surely seems to be an opportunity for some experimentation with better implementations of Python's thread synchronization primitives. For example, condition variables are a core component of Python's Semaphore, Event, and Queue objects, yet Python makes no effort to use any kind of native implementation (e.g., pthreads condition variables). Moreover, why is Python using custom implementations of synchronization objects already provided by the operating system and thread libraries (e.g., semaphores). Given that much of Python's thread implementation was worked out more than ten years ago, it would be interesting to perform some experiments and revisit the threading implementation on modern systems--especially in light of the increased interested in concurrency, multiple CPU cores, and other matters.

Anyways, that's it for now. I'd love to hear your comments. Also, if you are aware of prior work related to optimizing the threading library, benchmarks, or anything else that might be related, I'd be interested in links so that I can post them here.

posted by Dave Beazley # 5:43 PM

Dabeaz

Sunday, September 13, 2009

Python Thread Synchronization Primitives : Not Entirely What You Think

Archives