On June 11, 2009 I gave a presentation about the inner workings of the Python GIL at the Chicago Python user group meeting. To be honest, I always expected the event to be a pretty low-key affair involving some local Python hackers and some beers. However, the presentation went a little viral and I've received a number of requests to get the code modifications I made to investigate thread behavior--especially the traces that show thread switching and other details.
In this post, I'll briefly outline the code changes I made to generate the traces. Before going any further, you should probably first view the original presentation. Also, as a disclaimer, none of these changes are easily packaged into a neat "patch" that one can simply download and install into any Python distribution. So, to start, you should first go download a Python source distribution for the version of Python you want to experiment with. For my talk, I was using Python 2.6.
First, let's talk about a major issue--any investigation of threads at a low-level (especially thread scheduling) tends to be a rather tricky affair involving some kind of computer science variant of the uncertainty principle. That is, once you start trying to observe thread behavior, you run the risk of changing the very thing you're trying to observe. The problem gets worse if you add a lot of extra complexity--especially if there are extra system calls or I/O. So, a major underlying concern was to try and devise a technique for recording thread behavior in a minimally invasive manner (as an aside, I considered the idea of trying to use dtrace for this, but decided that it would take longer for me to learn dtrace than it would to simply make a few minor modifications to the interpreter).
Step 1: Defining time
Everything that happens inside the Python interpreter is focused around the concept of "ticks." Each tick loosely corresponds to a single instruction in the virtual machine. Locate the file Python/ceval.c in the Python source code. In this file, you will find a global variable _Py_Ticker holding the tick counter. Here's what the code looks like:
/* ceval.c */ ... int _Py_CheckInterval = 100; volatile int _Py_Ticker = 0; /* so that we hit a "tick" first thing */ ...
Add a new variable declaration _Py_Ticker_Counter to this code so that it looks like this:
/* ceval.c */
...
int _Py_CheckInterval = 100;
volatile int _Py_Ticker = 0; /* so that we hit a "tick" first thing */
volatile int _Py_Ticker_Count = 0;
...
Later in the same file, you will find code that decrements the value of _Py_Ticker. Modify this code so that each time _Py_Ticker reaches 0, the value of _Py_Ticker_Count is incremented. Here's what it looks like:
/* ceval.c */ ... if (--_Py_Ticker < 0) { if (*next_instr == SETUP_FINALLY) { /* Make the last opcode before a try: finally: block uninterruptable. */ goto fast_next_opcode; } _Py_Ticker = _Py_CheckInterval; _Py_Ticker_Count++; tstate->tick_counter++; ...
The _Py_Ticker_Count and _Py_Ticker variables together define a kind of internal clock. _Py_Ticker is a countdown to the next time the interpreter might thread-switch. The _Py_Ticker_Count keeps track of how many times the interpreter has actually signaled the operating system to schedule waiting threads (if any). In the traces that follow, these two values are used together to record the sequence of events that occur in terms of interpreter ticks.
Step 2 : Recording Trace Data
Python defines a general purpose lock object that is used for both the GIL and locking primitives in the threading modules. On Unix systems using pthreads, the implementation of the lock can be found in the file Python/thread_pthread.h. In that file, there are two functions that we are going to modify: PyThread_acquire_lock() and PyThread_release_lock().
Here's the general idea : The lock/unlock functions are instrumented to record a large in-memory trace of lock-related events. These include lock entry (when a thread first tries to acquire a lock), busy (when the lock is busy), retry (a repeated failed attempt to acquire a lock), acquire (lock successfully acquired), and release (lock released). In addition to events, the trace records current values of the _Py_Ticker and _Py_Ticker_Count variables as well as the pointer to the currently executing thread.
All trace data is stored entirely in memory as programs execute. The size of the history can be controlled with a macro in the code. To dump the trace, a function print_history() is registered to execute on interpreter exit using the atexit() call. It is important to emphasize that no I/O occurs as programs are executing--traces are only dumped on program exit.
Here a copy of the modified code. Be aware that thread_pthread.h is a bit of a mess and that there are a few different implementations of locks. This code is meant to go in the non-semaphore implemention of locks. Further discussion appears afterwards
/* thread_pthread.h */ ... /* Thread lock monitoring modifications (beazley) */ #include <sys/resource.h> #include <sched.h> #define MAXHISTORY 5000000 static int thread_history[MAXHISTORY]; static unsigned char tick_history[MAXHISTORY]; static int tick_count_history[MAXHISTORY]; static unsigned char tick_acquire[MAXHISTORY]; static double time_history[MAXHISTORY]; static unsigned int history_count = 0; #define EVENT_ENTRY 0 #define EVENT_BUSY 1 #define EVENT_RETRY 2 #define EVENT_ACQUIRE 3 #define EVENT_RELEASE 4 static char *_codes[] = {"ENTRY","BUSY","RETRY","ACQUIRE","RELEASE" }; static void print_history(void) { int i; FILE *f; f = fopen("tickhistory.txt","w"); for (i = 0; i < history_count; i++) { fprintf(f,"%x %d %d %s %0.6f\n",thread_history[i],tick_history[i],tick_count_history[i],_codes[tick_acquire[i]],time_history[i]); } fclose(f); } /* External variables recorded in the history */ extern volatile int _Py_Ticker; extern volatile int _Py_Ticker_Count; int PyThread_acquire_lock(PyThread_type_lock lock, int waitflag) { int success; pthread_lock *thelock = (pthread_lock *)lock; int status, error = 0; int start_thread = 0; if (history_count == 0) { atexit(print_history); } dprintf(("PyThread_acquire_lock(%p, %d) called\n", lock, waitflag)); status = pthread_mutex_lock( &thelock->mut ); /* Record information in the log */ start_thread = (int) pthread_self(); if (history_count < MAXHISTORY) { thread_history[history_count] = start_thread; tick_history[history_count] = _Py_Ticker; tick_count_history[history_count] = _Py_Ticker_Count; time_history[history_count] = 0.0; tick_acquire[history_count++] = EVENT_ENTRY; } CHECK_STATUS("pthread_mutex_lock[1]"); success = thelock->locked == 0; if ( !success && waitflag ) { int ntries = 0; /* continue trying until we get the lock */ /* mut must be locked by me -- part of the condition * protocol */ while ( thelock->locked ) { if (ntries == 0) { if (history_count < MAXHISTORY) { thread_history[history_count] = start_thread; tick_history[history_count] = _Py_Ticker; tick_count_history[history_count] = _Py_Ticker_Count; time_history[history_count] = 0.0; tick_acquire[history_count++] = EVENT_BUSY; } } status = pthread_cond_wait(&thelock->lock_released, &thelock->mut); CHECK_STATUS("pthread_cond_wait"); if (thelock->locked) { if (history_count < MAXHISTORY) { thread_history[history_count] = start_thread; tick_history[history_count] = _Py_Ticker; tick_count_history[history_count] = _Py_Ticker_Count; time_history[history_count] = 0.0; tick_acquire[history_count++] = EVENT_RETRY; ntries += 1; } } else { if (history_count < MAXHISTORY) { thread_history[history_count] = start_thread; tick_history[history_count] = _Py_Ticker; tick_count_history[history_count] = _Py_Ticker_Count; { struct timeval t; #ifdef GETTIMEOFDAY_NO_TZ if (gettimeofday(&t) == 0) time_history[history_count] = (double)t.tv_sec + t.tv_usec*0.000001; #else /* !GETTIMEOFDAY_NO_TZ */ if (gettimeofday(&t, (struct timezone *)NULL) == 0) time_history[history_count] = (double)t.tv_sec + t.tv_usec*0.000001; #endif /* !GETTIMEOFDAY_NO_TZ */ } tick_acquire[history_count++] = EVENT_ACQUIRE; } } } success = 1; } else { if (history_count < MAXHISTORY) { thread_history[history_count] = start_thread; tick_history[history_count] = _Py_Ticker; tick_count_history[history_count] = _Py_Ticker_Count; time_history[history_count] = 0.0; tick_acquire[history_count++] = EVENT_ACQUIRE; } } if (success) thelock->locked = 1; status = pthread_mutex_unlock( &thelock->mut ); CHECK_STATUS("pthread_mutex_unlock[1]"); if (error) success = 0; dprintf(("PyThread_acquire_lock(%p, %d) -> %d\n", lock, waitflag, success)); return success; } void PyThread_release_lock(PyThread_type_lock lock) { pthread_lock *thelock = (pthread_lock *)lock; int status, error = 0; dprintf(("PyThread_release_lock(%p) called\n", lock)); status = pthread_mutex_lock( &thelock->mut ); CHECK_STATUS("pthread_mutex_lock[3]"); if (history_count < MAXHISTORY) { thread_history[history_count] = (int) pthread_self(); tick_history[history_count] = _Py_Ticker; tick_count_history[history_count] = _Py_Ticker_Count; tick_acquire[history_count++] = EVENT_RELEASE; } thelock->locked = 0; status = pthread_mutex_unlock( &thelock->mut ); CHECK_STATUS("pthread_mutex_unlock[3]"); /* wake up someone (anyone, if any) waiting on the lock */ status = pthread_cond_signal( &thelock->lock_released ); CHECK_STATUS("pthread_cond_signal"); }
Step 3 : Rebuilding and Running Python
Once you have made the above changes, rebuild the Python interpreter and run it on some sample code. The code should run the same as before, but on program exit, you will get get a huge data file tickhistory.txt dumped into the current working directory. The contents of this file are going to look something like this:
a0811720 8 1299 RELEASE 0.000000 a0811720 15 1302 ENTRY 0.000000 a0811720 15 1302 ACQUIRE 0.000000 a0811720 10 1302 ENTRY 0.000000 a0811720 10 1302 ACQUIRE 0.000000 a0811720 10 1302 RELEASE 0.000000 a0811720 7 1302 ENTRY 0.000000 a0811720 7 1302 ACQUIRE 0.000000 b0081000 7 1302 ENTRY 0.000000 b0081000 7 1302 ACQUIRE 0.000000 b0081000 7 1302 RELEASE 0.000000 b0081000 7 1302 ENTRY 0.000000 b0081000 7 1302 ACQUIRE 0.000000 b0081000 7 1302 RELEASE 0.000000 b0081000 7 1302 ENTRY 0.000000 b0081000 7 1302 BUSY 0.000000 a0811720 1 1302 RELEASE 0.000000 a0811720 1 1302 ENTRY 0.000000 a0811720 1 1302 ACQUIRE 0.000000 a0811720 1 1302 ENTRY 0.000000 a0811720 1 1302 ACQUIRE 0.000000 a0811720 100 1303 RELEASE 0.000000 a0811720 100 1303 ENTRY 0.000000 a0811720 100 1303 ACQUIRE 0.000000 a0811720 92 1303 RELEASE 0.000000 a0811720 92 1303 ENTRY 0.000000 a0811720 92 1303 ACQUIRE 0.000000 a0811720 92 1303 ENTRY 0.000000 a0811720 92 1303 ACQUIRE 0.000000 ...
Be forewarned--the size of this file can be substantial. Running a threaded program for even 10-20 seconds might generate a trace file that contains 3-4 million events. To do any kind of analysis on it, you'll probably want to do what everyone normally does and write a Python script.
Discussion
Interpreting the contents of the trace file are left as an exercise for the reader. However, here are few tips. First, the normal sequence of lock acquisition and release on the GIL with a CPU-bound thread looks something like this (notice that the _Py_Ticker value in the 2nd column is always 100 and that the lock goes through a repeated ENTRY->ACQUIRE->RELEASE cycle):
a000d000 100 3570 ENTRY 0.000000 a000d000 100 3570 ACQUIRE 0.000000 a000d000 100 3571 RELEASE 0.000000 a000d000 100 3571 ENTRY 0.000000 a000d000 100 3571 ACQUIRE 0.000000 a000d000 100 3572 RELEASE 0.000000 a000d000 100 3572 ENTRY 0.000000 a000d000 100 3572 ACQUIRE 0.000000 a000d000 100 3573 RELEASE 0.000000 ...
If you're looking at thread contention, you're going to see a trace that has an event series of ENTRY->BUSY->RETRY->...->RETRY->ACQUIRE->RELEASE like this:
a000d000 48 4794 ENTRY 0.000000 a000d000 48 4794 BUSY 0.000000 7091800 32 4794 RELEASE 0.000000 7069a00 32 4794 ACQUIRE 1251397338.473370 7091800 32 4794 ENTRY 0.000000 7091800 32 4794 BUSY 0.000000 a000d000 32 4794 RETRY 0.000000 7069a00 100 4795 RELEASE 0.000000 7069a00 100 4795 ENTRY 0.000000 7069a00 100 4795 ACQUIRE 0.000000 a000d000 66 4795 RETRY 0.000000 7069a00 100 4796 RELEASE 0.000000 7069a00 100 4796 ENTRY 0.000000 7069a00 100 4796 ACQUIRE 0.000000 a000d000 95 4796 RETRY 0.000000 7069a00 100 4797 RELEASE 0.000000 7069a00 100 4797 ENTRY 0.000000 7069a00 100 4797 ACQUIRE 0.000000 ... a000d000 100 5083 ACQUIRE 1251397338.478188 ...
Here are some other notes concerning its analysis:
Other Comments
Since giving the presentation, I've received a few comments through email offering suggestions for a GIL fix. I stand by my earlier assertion that there is no easy fix for the problem described in the presentation. Here are some specific suggestions followed by my response:
Final Words
As mentioned in the presentation, deep exploration of the Python GIL is not a project I'm actively working on. In fact, all of this was really just an exploration to find out how the GIL works and to see if I could track down pathological performance for a certain test case on my Mac. Feel free to take this code and hack it in any way that you wish. If it proves to be useful, just give me an acknowledgment when you give your PyCon presentation. Have fun!
08/01/2009 - 09/01/2009 09/01/2009 - 10/01/2009 10/01/2009 - 11/01/2009 11/01/2009 - 12/01/2009 12/01/2009 - 01/01/2010 01/01/2010 - 02/01/2010 02/01/2010 - 03/01/2010 04/01/2010 - 05/01/2010 05/01/2010 - 06/01/2010 07/01/2010 - 08/01/2010 08/01/2010 - 09/01/2010 09/01/2010 - 10/01/2010 12/01/2010 - 01/01/2011 01/01/2011 - 02/01/2011 02/01/2011 - 03/01/2011 03/01/2011 - 04/01/2011 04/01/2011 - 05/01/2011 05/01/2011 - 06/01/2011 08/01/2011 - 09/01/2011 09/01/2011 - 10/01/2011 12/01/2011 - 01/01/2012 01/01/2012 - 02/01/2012 02/01/2012 - 03/01/2012 03/01/2012 - 04/01/2012 07/01/2012 - 08/01/2012 01/01/2013 - 02/01/2013 03/01/2013 - 04/01/2013 06/01/2014 - 07/01/2014 09/01/2014 - 10/01/2014
Subscribe to Posts [Atom]