Dabeaz

Dave Beazley's mondo computer blog. [ homepage | archive ]

Thursday, August 27, 2009

Inside the "Inside the Python GIL" Presentation

On June 11, 2009 I gave a presentation about the inner workings of the Python GIL at the Chicago Python user group meeting. To be honest, I always expected the event to be a pretty low-key affair involving some local Python hackers and some beers. However, the presentation went a little viral and I've received a number of requests to get the code modifications I made to investigate thread behavior--especially the traces that show thread switching and other details.

In this post, I'll briefly outline the code changes I made to generate the traces. Before going any further, you should probably first view the original presentation. Also, as a disclaimer, none of these changes are easily packaged into a neat "patch" that one can simply download and install into any Python distribution. So, to start, you should first go download a Python source distribution for the version of Python you want to experiment with. For my talk, I was using Python 2.6.

First, let's talk about a major issue--any investigation of threads at a low-level (especially thread scheduling) tends to be a rather tricky affair involving some kind of computer science variant of the uncertainty principle. That is, once you start trying to observe thread behavior, you run the risk of changing the very thing you're trying to observe. The problem gets worse if you add a lot of extra complexity--especially if there are extra system calls or I/O. So, a major underlying concern was to try and devise a technique for recording thread behavior in a minimally invasive manner (as an aside, I considered the idea of trying to use dtrace for this, but decided that it would take longer for me to learn dtrace than it would to simply make a few minor modifications to the interpreter).

Step 1: Defining time

Everything that happens inside the Python interpreter is focused around the concept of "ticks." Each tick loosely corresponds to a single instruction in the virtual machine. Locate the file Python/ceval.c in the Python source code. In this file, you will find a global variable _Py_Ticker holding the tick counter. Here's what the code looks like:

/* ceval.c */
...
int _Py_CheckInterval = 100;
volatile int _Py_Ticker = 0; /* so that we hit a "tick" first thing */
...

Add a new variable declaration _Py_Ticker_Counter to this code so that it looks like this:

/* ceval.c */
...
int _Py_CheckInterval = 100;
volatile int _Py_Ticker = 0; /* so that we hit a "tick" first thing */
volatile int _Py_Ticker_Count = 0;
...

Later in the same file, you will find code that decrements the value of _Py_Ticker. Modify this code so that each time _Py_Ticker reaches 0, the value of _Py_Ticker_Count is incremented. Here's what it looks like:

/* ceval.c */
...
  if (--_Py_Ticker < 0) {
   if (*next_instr == SETUP_FINALLY) {
    /* Make the last opcode before
       a try: finally: block uninterruptable. */
    goto fast_next_opcode;
   }
   _Py_Ticker = _Py_CheckInterval;
   _Py_Ticker_Count++; 
   tstate->tick_counter++;
...

The _Py_Ticker_Count and _Py_Ticker variables together define a kind of internal clock. _Py_Ticker is a countdown to the next time the interpreter might thread-switch. The _Py_Ticker_Count keeps track of how many times the interpreter has actually signaled the operating system to schedule waiting threads (if any). In the traces that follow, these two values are used together to record the sequence of events that occur in terms of interpreter ticks.

Step 2 : Recording Trace Data

Python defines a general purpose lock object that is used for both the GIL and locking primitives in the threading modules. On Unix systems using pthreads, the implementation of the lock can be found in the file Python/thread_pthread.h. In that file, there are two functions that we are going to modify: PyThread_acquire_lock() and PyThread_release_lock().

Here's the general idea : The lock/unlock functions are instrumented to record a large in-memory trace of lock-related events. These include lock entry (when a thread first tries to acquire a lock), busy (when the lock is busy), retry (a repeated failed attempt to acquire a lock), acquire (lock successfully acquired), and release (lock released). In addition to events, the trace records current values of the _Py_Ticker and _Py_Ticker_Count variables as well as the pointer to the currently executing thread.

All trace data is stored entirely in memory as programs execute. The size of the history can be controlled with a macro in the code. To dump the trace, a function print_history() is registered to execute on interpreter exit using the atexit() call. It is important to emphasize that no I/O occurs as programs are executing--traces are only dumped on program exit.

Here a copy of the modified code. Be aware that thread_pthread.h is a bit of a mess and that there are a few different implementations of locks. This code is meant to go in the non-semaphore implemention of locks. Further discussion appears afterwards

/* thread_pthread.h */
...
/* Thread lock monitoring modifications (beazley) */

#include <sys/resource.h>
#include <sched.h>

#define MAXHISTORY 5000000
static int           thread_history[MAXHISTORY];
static unsigned char tick_history[MAXHISTORY];
static int           tick_count_history[MAXHISTORY];
static unsigned char tick_acquire[MAXHISTORY];
static double        time_history[MAXHISTORY];
static unsigned int  history_count = 0;

#define EVENT_ENTRY   0
#define EVENT_BUSY    1
#define EVENT_RETRY   2
#define EVENT_ACQUIRE 3
#define EVENT_RELEASE 4

static char *_codes[] = {"ENTRY","BUSY","RETRY","ACQUIRE","RELEASE" };

static void print_history(void) {
 int i;
 FILE *f;

 f = fopen("tickhistory.txt","w");
 for (i = 0; i < history_count; i++) {
   fprintf(f,"%x %d %d %s %0.6f\n",thread_history[i],tick_history[i],tick_count_history[i],_codes[tick_acquire[i]],time_history[i]);
 }
 fclose(f);
}

/* External variables recorded in the history */
extern volatile int _Py_Ticker;
extern volatile int _Py_Ticker_Count;


int
PyThread_acquire_lock(PyThread_type_lock lock, int waitflag)
{
 int success;
 pthread_lock *thelock = (pthread_lock *)lock;
 int status, error = 0;
 int start_thread = 0;

 if (history_count == 0) {
   atexit(print_history);
 }

 dprintf(("PyThread_acquire_lock(%p, %d) called\n", lock, waitflag));

 status = pthread_mutex_lock( &thelock->mut );

 /* Record information in the log */
 start_thread = (int) pthread_self(); 
 if (history_count < MAXHISTORY) {
   thread_history[history_count] = start_thread;
   tick_history[history_count] = _Py_Ticker;
   tick_count_history[history_count] = _Py_Ticker_Count;
   time_history[history_count] = 0.0;
   tick_acquire[history_count++] = EVENT_ENTRY;
 }

 CHECK_STATUS("pthread_mutex_lock[1]");
 success = thelock->locked == 0;

 if ( !success && waitflag ) {

   int ntries = 0;
  /* continue trying until we get the lock */

  /* mut must be locked by me -- part of the condition
   * protocol */

  while ( thelock->locked ) {
    if (ntries == 0) {
      if (history_count < MAXHISTORY) {
        thread_history[history_count] = start_thread;
        tick_history[history_count] = _Py_Ticker;
        tick_count_history[history_count] = _Py_Ticker_Count;
        time_history[history_count] = 0.0;
        tick_acquire[history_count++] = EVENT_BUSY;
      }
    }

   status = pthread_cond_wait(&thelock->lock_released,
         &thelock->mut);
   CHECK_STATUS("pthread_cond_wait");
   if (thelock->locked) {
     if (history_count < MAXHISTORY) {
       thread_history[history_count] = start_thread;
       tick_history[history_count] = _Py_Ticker;
       tick_count_history[history_count] = _Py_Ticker_Count;
       time_history[history_count] = 0.0;
       tick_acquire[history_count++] = EVENT_RETRY;
       ntries += 1;
     }
   } else {
     if (history_count < MAXHISTORY) {
       thread_history[history_count] = start_thread;
       tick_history[history_count] = _Py_Ticker;
       tick_count_history[history_count] = _Py_Ticker_Count;
       {
         struct timeval t;
#ifdef GETTIMEOFDAY_NO_TZ
         if (gettimeofday(&t) == 0)
    time_history[history_count] = (double)t.tv_sec + t.tv_usec*0.000001;
#else /* !GETTIMEOFDAY_NO_TZ */
         if (gettimeofday(&t, (struct timezone *)NULL) == 0)
    time_history[history_count] = (double)t.tv_sec + t.tv_usec*0.000001;
#endif /* !GETTIMEOFDAY_NO_TZ */
       }
       tick_acquire[history_count++] = EVENT_ACQUIRE;
     }
   }

  }
  success = 1;
 } else {
   if (history_count < MAXHISTORY) {
     thread_history[history_count] = start_thread;
     tick_history[history_count] = _Py_Ticker;
     tick_count_history[history_count] = _Py_Ticker_Count;
     time_history[history_count] = 0.0;
     tick_acquire[history_count++] = EVENT_ACQUIRE;
   }
 }
 if (success) thelock->locked = 1;
 status = pthread_mutex_unlock( &thelock->mut );
 CHECK_STATUS("pthread_mutex_unlock[1]");

 if (error) success = 0;
 dprintf(("PyThread_acquire_lock(%p, %d) -> %d\n", lock, waitflag, success));
 return success;
}

void
PyThread_release_lock(PyThread_type_lock lock)
{
 pthread_lock *thelock = (pthread_lock *)lock;
 int status, error = 0;

 dprintf(("PyThread_release_lock(%p) called\n", lock));

 status = pthread_mutex_lock( &thelock->mut );
 CHECK_STATUS("pthread_mutex_lock[3]");
 
 if (history_count < MAXHISTORY) {
   thread_history[history_count] = (int) pthread_self();
   tick_history[history_count] = _Py_Ticker;
   tick_count_history[history_count] = _Py_Ticker_Count;
   tick_acquire[history_count++] = EVENT_RELEASE;
 }

 thelock->locked = 0;

 status = pthread_mutex_unlock( &thelock->mut );
 CHECK_STATUS("pthread_mutex_unlock[3]");

 /* wake up someone (anyone, if any) waiting on the lock */
 status = pthread_cond_signal( &thelock->lock_released );
 CHECK_STATUS("pthread_cond_signal");
}

Step 3 : Rebuilding and Running Python

Once you have made the above changes, rebuild the Python interpreter and run it on some sample code. The code should run the same as before, but on program exit, you will get get a huge data file tickhistory.txt dumped into the current working directory. The contents of this file are going to look something like this:

a0811720 8 1299 RELEASE 0.000000
a0811720 15 1302 ENTRY 0.000000
a0811720 15 1302 ACQUIRE 0.000000
a0811720 10 1302 ENTRY 0.000000
a0811720 10 1302 ACQUIRE 0.000000
a0811720 10 1302 RELEASE 0.000000
a0811720 7 1302 ENTRY 0.000000
a0811720 7 1302 ACQUIRE 0.000000
b0081000 7 1302 ENTRY 0.000000
b0081000 7 1302 ACQUIRE 0.000000
b0081000 7 1302 RELEASE 0.000000
b0081000 7 1302 ENTRY 0.000000
b0081000 7 1302 ACQUIRE 0.000000
b0081000 7 1302 RELEASE 0.000000
b0081000 7 1302 ENTRY 0.000000
b0081000 7 1302 BUSY 0.000000
a0811720 1 1302 RELEASE 0.000000
a0811720 1 1302 ENTRY 0.000000
a0811720 1 1302 ACQUIRE 0.000000
a0811720 1 1302 ENTRY 0.000000
a0811720 1 1302 ACQUIRE 0.000000
a0811720 100 1303 RELEASE 0.000000
a0811720 100 1303 ENTRY 0.000000
a0811720 100 1303 ACQUIRE 0.000000
a0811720 92 1303 RELEASE 0.000000
a0811720 92 1303 ENTRY 0.000000
a0811720 92 1303 ACQUIRE 0.000000
a0811720 92 1303 ENTRY 0.000000
a0811720 92 1303 ACQUIRE 0.000000
...

Be forewarned--the size of this file can be substantial. Running a threaded program for even 10-20 seconds might generate a trace file that contains 3-4 million events. To do any kind of analysis on it, you'll probably want to do what everyone normally does and write a Python script.

Discussion

Interpreting the contents of the trace file are left as an exercise for the reader. However, here are few tips. First, the normal sequence of lock acquisition and release on the GIL with a CPU-bound thread looks something like this (notice that the _Py_Ticker value in the 2nd column is always 100 and that the lock goes through a repeated ENTRY->ACQUIRE->RELEASE cycle):

a000d000 100 3570 ENTRY 0.000000
a000d000 100 3570 ACQUIRE 0.000000
a000d000 100 3571 RELEASE 0.000000
a000d000 100 3571 ENTRY 0.000000
a000d000 100 3571 ACQUIRE 0.000000
a000d000 100 3572 RELEASE 0.000000
a000d000 100 3572 ENTRY 0.000000
a000d000 100 3572 ACQUIRE 0.000000
a000d000 100 3573 RELEASE 0.000000
...

If you're looking at thread contention, you're going to see a trace that has an event series of ENTRY->BUSY->RETRY->...->RETRY->ACQUIRE->RELEASE like this:

a000d000 48 4794 ENTRY 0.000000
a000d000 48 4794 BUSY 0.000000
7091800 32 4794 RELEASE 0.000000
7069a00 32 4794 ACQUIRE 1251397338.473370
7091800 32 4794 ENTRY 0.000000
7091800 32 4794 BUSY 0.000000
a000d000 32 4794 RETRY 0.000000
7069a00 100 4795 RELEASE 0.000000
7069a00 100 4795 ENTRY 0.000000
7069a00 100 4795 ACQUIRE 0.000000
a000d000 66 4795 RETRY 0.000000
7069a00 100 4796 RELEASE 0.000000
7069a00 100 4796 ENTRY 0.000000
7069a00 100 4796 ACQUIRE 0.000000
a000d000 95 4796 RETRY 0.000000
7069a00 100 4797 RELEASE 0.000000
7069a00 100 4797 ENTRY 0.000000
7069a00 100 4797 ACQUIRE 0.000000
...
a000d000 100 5083 ACQUIRE 1251397338.478188
...

Here are some other notes concerning its analysis:

The first column is the hex memory address of a lock object. If you run the program on a threaded program that is using many different locks, you will be tracing not only the GIL, but every lock in the program. You might be able to use this to investigate lock contention.
The GIL is not specifically identified in the trace file. However, it will be one of the first locks used.
The last column of the trace file is a system timer that is only recorded when locks are acquired after repeated failed acquisition attempts. At some point, I was using this to investigate some issues related to response times, but to be honest, I didn't spend much time exploring that angle. It might be useful if you want to get an idea for how long each thread runs before giving up control. Of course, you may just want to comment that code out.

Other Comments

Since giving the presentation, I've received a few comments through email offering suggestions for a GIL fix. I stand by my earlier assertion that there is no easy fix for the problem described in the presentation. Here are some specific suggestions followed by my response:

"Perhaps the GIL could be fixed by adding some kind of scheduling queue." If you were to add a scheduling queue to the GIL, you would effectively turn it into a kind of poorly implemented mutex lock. Mutex locks are already implemented (by pthreads and the OS) using queues into order to avoid thread starvation. More details can be found in an operating system textbook. You might also look at the Bakery Algorithm
"Perhaps the GIL could be fixed by simply using a mutex lock." As just mentioned, mutex locks are generally implemented using a queuing mechanism. If you do this, runnable threads will always context switch every 100 interpreter ticks (you'll see the threads cycling in a round-robin manner). This will definitely eliminate the multicore contention problem, but now your programs will perform a tremendous amount of context switching. Also, you might lose the high scheduling priority of I/O bound threads. Needless to say, there are some downsides that need to be considered (just for the record, I think the use of a condition variable in the current implementation is probably the best overall solution for running on a single CPU).
"Could you fix the problem by telling the operating system to schedule all threads on the same core?" Short answer: No. C extensions to Python (and even significant parts of Python itself) often release the GIL by design so that they can run concurrently while carrying out work that doesn't directly involve the Python interpreter. If you force everything to one core, you will most likely make these programs run worse, not better.

Final Words

As mentioned in the presentation, deep exploration of the Python GIL is not a project I'm actively working on. In fact, all of this was really just an exploration to find out how the GIL works and to see if I could track down pathological performance for a certain test case on my Mac. Feel free to take this code and hack it in any way that you wish. If it proves to be useful, just give me an acknowledgment when you give your PyCon presentation. Have fun!

Labels: gil, python, threads

posted by Dave Beazley # 5:39 AM

<< Home

Dabeaz

Thursday, August 27, 2009

Inside the "Inside the Python GIL" Presentation

Archives