Dabeaz

Dave Beazley's mondo computer blog. [ homepage | archive ]

Friday, February 26, 2010

Upcoming Python Training Classes

Please forgive the brief commercial interruption. I'd just like to plug a few of my upcoming Python training classes--yes, if you must know, this is how I pay the bills so that I can spend the rest of my time thinking about the GIL and other diabolical Python-related topics.

New! Python Mastery Bootcamp, April 12-16, 2010 (Atlanta)

First, I'm pleased to announce a brand-new Python course that I'm offering for the first time at Big Nerd Ranch in Atlanta. The Python Mastery Bootcamp might be the ultimate Python tutorial for programmers who already know the basics of Python, but who want to take their understanding of the language to a whole new level. Over the past few years, I have given a number of well-reviewed PyCON tutorials on advanced topics such as Generator Tricks for Systems Programmers, A Curious Course on Coroutines and Concurrency, or most recently Mastering Python 3 I/O. Well, the Mastery Bootcamp is sort of similar except that it lasts 5 days, it covers far more material (network programming, threads, multiprocessing, asynchronous I/O, functional programming, metaprogramming, distributed computing, C extensions, etc.), and it has more hands-on projects that allow the material to be explored in greater depth than at a conference.

The experience at Big Nerd Ranch is quite unique--for 5 days, you will be completely immersed in Python programming without the annoyance of outside distractions. This makes it the perfect environment to interact with other class participants and to really focus on the course material. There's really nothing quite like it in the training world--you won't be disappointed.

March 12,2010 Update! The Mastery Bootcamp is confirmed to run and there are still a few slots available. It's going to be great experience for anyone who wants to learn enough about Python to be dangerous.

Introduction to Python Programming, March 16-18, 2010 (Chicago)

If you're relatively new to Python and want to master the fundamentals, consider coming to my Introduction to Python Programming class in Chicago. This course is aimed at programmers, system administrators, scientists, and engineers who want to apply Python to everyday tasks such as analyzing data files, automating system tasks, scraping web pages, using databases, and more. Through practical examples, you will learn all of the major features of Python including data handling, functions, modules, classes, generators, testing, and more. This is a highly refined class that has been taught for numerous corporate and government clients over the past three years. The class features a 300 page fully indexed course guide and more than 50 hands-on exercises.

My Chicago classes are also taught in a rather unique format. Unlike a typical corporate training course, I conduct the course in a round-table format that is strictly limited to 6 attendees--a size that encourages interaction and allows course topics to be easily customized to your interests. The course is located in Chicago's distinctive Andersonville neighborhood where just steps away, you will find dozens of unique restaurants, bakeries, coffee houses, pubs, and more. You're definitely going to like it!

March 12, 2010 update! The Chicago class is now sold out. However, be on the lookout for its return in a few months.

posted by Dave Beazley # 12:56 PM

Monday, February 22, 2010

Revisiting thread priorities and the new GIL

Well, PyCon is over and it's time to get back to work. First, I'd just like to thank everyone who came to my GIL Talk and participated in all of the discussion that followed. It was almost as if part of PyCon had turned into a mini operating systems conference!

This post is a followup to the GIL open space at PyCon where we looked at the new GIL and explored the possibility of introducing thread priorities. For those of you not at PyCon, the open space was attended by about 30-40 people and included Guido, Antoine Pitrou, and a large number of systems hackers, some of which had previously worked on thread library implementations and operating system kernels.

First, a little background. As might know, Antoine Pitrou implemented a new Python GIL that is currently only available in the Python 3.2 development branch (you can obtain it via subversion). This new GIL is described in his original mailing list post as well as the slides for my PyCon talk. You should read those first if you haven't already.

Right before PyCON, I discovered an I/O performance problem with the new GIL that is related to CPU-bound threads stalling the progress of I/O bound threads which it turn leads to a severe performance degradation of I/O bandwidth and response time. This is described in Issue 7946 : Convoy effect with I/O bound threads and New GIL.

In the bug report, I submitted a very simple test case that illustrated the problem. However, here is a more refined experiment that you can try. The following program, iotest.py contains both CPU-bound threads and an I/O server thread that echos UDP packets. It is meant to study the case in which CPU-processing and I/O processing are overlapped.

# iotest.py

import time
import threading
from socket import *
import itertools

def task_pidigits():
    """Pi calculation (Python)"""
    _map = map
    _count = itertools.count
    _islice = itertools.islice

    def calc_ndigits(n):
        # From http://shootout.alioth.debian.org/
        def gen_x():
            return _map(lambda k: (k, 4*k + 2, 0, 2*k + 1), _count(1))

        def compose(a, b):
            aq, ar, as_, at = a
            bq, br, bs, bt = b
            return (aq * bq,
                    aq * br + ar * bt,
                    as_ * bq + at * bs,
                    as_ * br + at * bt)

        def extract(z, j):
            q, r, s, t = z
            return (q*j + r) // (s*j + t)

        def pi_digits():
            z = (1, 0, 0, 1)
            x = gen_x()
            while 1:
                y = extract(z, 3)
                while y != extract(z, 4):
                    z = compose(z, next(x))
                    y = extract(z, 3)
                z = compose((10, -10*y, 0, 1), z)
                yield y

        return list(_islice(pi_digits(), n))

    return calc_ndigits, (50, )

def spin():
    task,args = task_pidigits()
    while True:
       r= task(*args)

def echo_server():
    s = socket(AF_INET, SOCK_DGRAM)
    s.setsockopt(SOL_SOCKET, SO_REUSEADDR,1)
    s.bind(("",16000))
    while True:
        msg, addr = s.recvfrom(16384)
        s.sendto(msg,addr)  

# Launch threads (adjust the number to see different results)
NUMTHREADS = 1
for n in range(NUMTHREADS):
    t = threading.Thread(target=spin)
    t.daemon = True
    t.start()

# Launch a background echo server
echo_server()

Next, here is a client program ioclient.py that simply measures the time it takes to echo 10MB of data to the server in the iotest.py program.

# echoclient.py
from socket import *
import time

CHUNKSIZE = 8192
NUMMESSAGES = 1280     # Total of 10MB

# Dummy message
msg = b"x"*CHUNKSIZE

# Connect and send messages
s = socket(AF_INET,SOCK_DGRAM)
start = time.time()
for n in range(NUMMESSAGES):
    s.sendto(msg,("",16000))
    msg, addr = s.recvfrom(65536)
end = time.time()
print("%0.3f seconds (%0.3f bytes/sec)" % (end-start, (CHUNKSIZE*NUMMESSAGES)/(end-start)))

If you run iotest.py on a dual-core Macbook with only 1 spin() thread. You get the following result if you run ioclient.py:

Python 3.2 (New GIL) : 9.166 seconds (1143998.140 bytes/sec)

It works, but it's hardly impressive (just barely over 1MB/sec transfer rate between two processes?). However, if you make the server have two spin() threads, the performance gets much worse:

Python 3.2 (New GIL) : 28.064 seconds (373642.858 bytes/sec)

Now to further complicate matters, if you disable all but one of the CPU cores, you get this inexplicable result:

Python 3.2 (New GIL, 1 CPU) : 0.297 seconds (35326299.028 bytes/sec)

Needless to say, there are many bizarre things going on here. The most major effect is that on multiple cores, it is very easy for CPU-bound threads to reacquire the GIL whenever an I/O bound thread performs I/O. This means that CPU-threads have a greater tendency to hog the GIL.

At PyCON, I did some experiments with thread priorities and a modified GIL that adjusted priorities in a manner similar to what you find with multilevel feedback queues in operating systems. Namely:

If a thread is forced to give up the GIL due to a timeout, it is penalized with lower priority.
If a thread voluntarily gives up the GIL because it performed I/O, it is reward with higher priority.
High priority threads always preempty low-priority threads.

The results of this approach were impressive. If you run the same tests with priorities on 2 CPU cores, you get this result:

Python 3.2 (New GIL with priorities), 0.298 seconds (35156921.564 bytes/sec)

The prioritized GIL also gives good performance for Antoine's own ccbench.py benchmark.

New GIL	New GIL with priorities
== CPython 3.2a0.0 (py3k:78250) == == i386 Darwin on 'i386' == --- Throughput --- Pi calculation (Python) threads=1: 873 iterations/s. threads=2: 845 ( 96 %) threads=3: 837 ( 95 %) threads=4: 820 ( 93 %) regular expression (C) threads=1: 348 iterations/s. threads=2: 339 ( 97 %) threads=3: 328 ( 94 %) threads=4: 317 ( 91 %) bz2 compression (C) threads=1: 367 iterations/s. threads=2: 655 ( 178 %) threads=3: 642 ( 174 %) threads=4: 646 ( 175 %) --- Latency --- Background CPU task: Pi calculation (Python) CPU threads=0: 0 ms. (std dev: 0 ms.) CPU threads=1: 5 ms. (std dev: 0 ms.) CPU threads=2: 2 ms. (std dev: 2 ms.) CPU threads=3: 138 ms. (std dev: 100 ms.) CPU threads=4: 132 ms. (std dev: 99 ms.) Background CPU task: regular expression (C) CPU threads=0: 0 ms. (std dev: 0 ms.) CPU threads=1: 6 ms. (std dev: 1 ms.) CPU threads=2: 6 ms. (std dev: 6 ms.) CPU threads=3: 6 ms. (std dev: 4 ms.) CPU threads=4: 10 ms. (std dev: 8 ms.) Background CPU task: bz2 compression (C) CPU threads=0: 0 ms. (std dev: 0 ms.) CPU threads=1: 0 ms. (std dev: 1 ms.) CPU threads=2: 0 ms. (std dev: 0 ms.) CPU threads=3: 0 ms. (std dev: 0 ms.) CPU threads=4: 0 ms. (std dev: 0 ms.)	== CPython 3.2a0.0 (py3k:78215M) == == i386 Darwin on 'i386' == --- Throughput --- Pi calculation (Python) threads=1: 885 iterations/s. threads=2: 860 ( 97 %) threads=3: 869 ( 98 %) threads=4: 859 ( 97 %) regular expression (C) threads=1: 362 iterations/s. threads=2: 358 ( 98 %) threads=3: 349 ( 96 %) threads=4: 354 ( 97 %) bz2 compression (C) threads=1: 373 iterations/s. threads=2: 654 ( 175 %) threads=3: 649 ( 173 %) threads=4: 638 ( 170 %) --- Latency --- Background CPU task: Pi calculation (Python) CPU threads=0: 0 ms. (std dev: 0 ms.) CPU threads=1: 0 ms. (std dev: 0 ms.) CPU threads=2: 0 ms. (std dev: 2 ms.) CPU threads=3: 0 ms. (std dev: 1 ms.) CPU threads=4: 0 ms. (std dev: 1 ms.) Background CPU task: regular expression (C) CPU threads=0: 0 ms. (std dev: 0 ms.) CPU threads=1: 2 ms. (std dev: 1 ms.) CPU threads=2: 3 ms. (std dev: 3 ms.) CPU threads=3: 2 ms. (std dev: 1 ms.) CPU threads=4: 2 ms. (std dev: 2 ms.) Background CPU task: bz2 compression (C) CPU threads=0: 0 ms. (std dev: 0 ms.) CPU threads=1: 0 ms. (std dev: 1 ms.) CPU threads=2: 0 ms. (std dev: 1 ms.) CPU threads=3: 0 ms. (std dev: 1 ms.) CPU threads=4: 0 ms. (std dev: 1 ms.)

New GIL

New GIL with priorities

== CPython 3.2a0.0 (py3k:78250) ==
== i386 Darwin on 'i386' ==

--- Throughput ---

Pi calculation (Python)

threads=1: 873 iterations/s.
threads=2: 845 ( 96 %)
threads=3: 837 ( 95 %)
threads=4: 820 ( 93 %)

regular expression (C)

threads=1: 348 iterations/s.
threads=2: 339 ( 97 %)
threads=3: 328 ( 94 %)
threads=4: 317 ( 91 %)

bz2 compression (C)

threads=1: 367 iterations/s.
threads=2: 655 ( 178 %)
threads=3: 642 ( 174 %)
threads=4: 646 ( 175 %)

--- Latency ---

Background CPU task: Pi calculation (Python)

CPU threads=0: 0 ms. (std dev: 0 ms.)
CPU threads=1: 5 ms. (std dev: 0 ms.)
CPU threads=2: 2 ms. (std dev: 2 ms.)
CPU threads=3: 138 ms. (std dev: 100 ms.)
CPU threads=4: 132 ms. (std dev: 99 ms.)

Background CPU task: regular expression (C)

CPU threads=0: 0 ms. (std dev: 0 ms.)
CPU threads=1: 6 ms. (std dev: 1 ms.)
CPU threads=2: 6 ms. (std dev: 6 ms.)
CPU threads=3: 6 ms. (std dev: 4 ms.)
CPU threads=4: 10 ms. (std dev: 8 ms.)

Background CPU task: bz2 compression (C)

CPU threads=0: 0 ms. (std dev: 0 ms.)
CPU threads=1: 0 ms. (std dev: 1 ms.)
CPU threads=2: 0 ms. (std dev: 0 ms.)
CPU threads=3: 0 ms. (std dev: 0 ms.)
CPU threads=4: 0 ms. (std dev: 0 ms.)

== CPython 3.2a0.0 (py3k:78215M) ==
== i386 Darwin on 'i386' ==

--- Throughput ---

Pi calculation (Python)

threads=1: 885 iterations/s.
threads=2: 860 ( 97 %)
threads=3: 869 ( 98 %)
threads=4: 859 ( 97 %)

regular expression (C)

threads=1: 362 iterations/s.
threads=2: 358 ( 98 %)
threads=3: 349 ( 96 %)
threads=4: 354 ( 97 %)

bz2 compression (C)

threads=1: 373 iterations/s.
threads=2: 654 ( 175 %)
threads=3: 649 ( 173 %)
threads=4: 638 ( 170 %)

--- Latency ---

Background CPU task: Pi calculation (Python)

CPU threads=0: 0 ms. (std dev: 0 ms.)
CPU threads=1: 0 ms. (std dev: 0 ms.)
CPU threads=2: 0 ms. (std dev: 2 ms.)
CPU threads=3: 0 ms. (std dev: 1 ms.)
CPU threads=4: 0 ms. (std dev: 1 ms.)

Background CPU task: regular expression (C)

CPU threads=0: 0 ms. (std dev: 0 ms.)
CPU threads=1: 2 ms. (std dev: 1 ms.)
CPU threads=2: 3 ms. (std dev: 3 ms.)
CPU threads=3: 2 ms. (std dev: 1 ms.)
CPU threads=4: 2 ms. (std dev: 2 ms.)

Background CPU task: bz2 compression (C)

CPU threads=0: 0 ms. (std dev: 0 ms.)
CPU threads=1: 0 ms. (std dev: 1 ms.)
CPU threads=2: 0 ms. (std dev: 1 ms.)
CPU threads=3: 0 ms. (std dev: 1 ms.)
CPU threads=4: 0 ms. (std dev: 1 ms.)

The overall outcome of the GIL open space was that having a priority mechanism was probably a good idea. However, a lot of people wanted to study the problem in more detail and to think about different possible implementations. I am posting the following tar file that has my own modifications to the GIL used for the above benchmarks:

prioritygil.tar.gz

Note: This tar file has all of the modified files in the Python 3.2 source (pystate.h, pystate.c, and ceval_gil.h) along with the io testing benchmark. Be advised that this patch is only intended for further study by others---it's kind of hacked together and really only a proof of concept implementation of one possible priority scheme. A real implementation would still need to address some issues not covered in my patch (e.g., starvation effects).

Due to other time commitments, I'm not going to be able to do much followup with this patch at this moment. However, I do want to encourage others to at least consider the benefit of introducing thread priorities and to explore different possible implementations. Initial results seem to indicate that this can fix the GIL for both CPU-bound threads and for
I/O performance.

posted by Dave Beazley # 1:47 PM

Tuesday, February 02, 2010

A function that works as a context manager and a decorator

As a followup to my last blog post on timings, I present the following function which works as both a decorator and a context manager.

# timethis.py
import time
from contextlib import contextmanager

def timethis(what):
    @contextmanager
    def benchmark():
        start = time.time()
        yield
        end = time.time()
        print("%s : %0.3f seconds" % (what, end-start))
    if hasattr(what,"__call__"):
        def timed(*args,**kwargs):
            with benchmark():
                return what(*args,**kwargs)
        return timed
    else:
        return benchmark()

Here is a short demonstration of how it works:

# Usage as a context manager
with timethis("iterate by lines (UTF-8)"):
     for line in open("biglog.txt",encoding='utf-8'):
          pass

# Usage as a decorator
@timethis
def iterate_by_lines_latin_1():
    for line in open("biglog.txt",encoding='latin-1'):
        pass

iterate_by_lines_latin_1()

If you run it, you'll get output like this:

bash % python3 timethis.py
iterate by lines (UTF-8) : 3.762 seconds
<function iterate_by_lines_latin_1 at 0x100537958> : 3.513 seconds

Naturally, this bit of code would be a good thing to bring into your next code review just to make sure people are actually paying attention.

posted by Dave Beazley # 6:12 PM

A Context Manager for Timing Benchmarks

I spend a lot of time studying different aspects of Python, different implementation techniques, and so forth. As part of that, I often carry out little performance benchmarks. For small things, I'll often use the timeit module. For example:

>>> from timeit import timeit
>>> timeit("math.sin(2)","import math")
0.29826998710632324
>>> timeit("sin(2)","from math import sin")
0.21983098983764648
>>>

However, for larger blocks of code, I tend to just use the time module directly like this:

import time
start = time.time()
...
... some big calculation
...
end = time.time()
print("Whatever : %0.3f seconds" % (end-start))

Having typed that particular code template more often than I care to admit, it occurred to me that I really ought to just make a context manager for doing it. Something like this:

# benchmark.py
import time
class benchmark(object):
    def __init__(self,name):
        self.name = name
    def __enter__(self):
        self.start = time.time()
    def __exit__(self,ty,val,tb):
        end = time.time()
        print("%s : %0.3f seconds" % (self.name, end-self.start))
        return False

Now, I can just use that context manager whenever I want to do that kind of timing benchmark. For example:

# fileitertest.py
from benchmark import benchmark

with benchmark("iterate by lines (UTF-8)"):
     for line in open("biglog.txt",encoding='utf-8'):
          pass

with benchmark("iterate by lines (Latin-1)"):
     for line in open("biglog.txt",encoding='latin-1'):
         pass

with benchmark("iterate by lines (Binary)"):
     for line in open("biglog.txt","rb"):
         pass

If you run it, you might get output like this:

bash % python3 fileitertest.py
iterate by lines (UTF-8) : 3.903 seconds
iterate by lines (Latin-1) : 3.615 seconds
iterate by lines (Binary) : 1.886 seconds

Nice. I like it already!

posted by Dave Beazley # 5:01 AM