Dabeaz

Dave Beazley's mondo computer blog. [ homepage | archive ]

Thursday, January 28, 2010

A few useful bytearray tricks

When I first saw the new Python 3 bytearray object (also back-ported to Python 2.6), I wasn't exactly sure what to make of it. On the surface, it seemed like a kind of mutable 8-bit string (a feature sometimes requested by users of Python 2). For example:

>>> s = bytearray(b"Hello World")
>>> s[:5] = b"Cruel"
>>> s
bytearray(b'Cruel World')
>>>

On the other hand, there are aspects of bytearray objects that are completely unlike a string. For example, if you iterate over a bytearray, you get integer byte values:

>>> s = bytearray(b"Hello World")
>>> for c in s: print(c)
...
72
101
108
108
111
32
87
111
114
108
100
>>>

Similarly, indexing operations are tied to integers:

>>> s[1]
101
>>> s[1] = 97
>>> s[1] = b'a'
Traceback (most recent call last):
  File "", line 1, in 
TypeError: an integer is required
>>>

Finally, there's the fact bytearray instances have most of the methods associated with strings as well as methods associated with lists. For example:

>>> s.split()
[bytearray(b'Hello'), bytearray(b'World')]
>>> s.append(33)
>>> s
bytearray(b'Hello World!')
>>>

Although documentation on bytearrays describes these features, it is a little light on meaningful use cases. Needless to say, if you have too much spare time (sic) on your hands, this is the kind of thing that you start to think about. So, I thought I'd share three practical uses of bytearrays.

Example 1: Assembling a message from fragments

Suppose you're writing some network code that is receiving a large message on a socket connection. If you know about sockets, you know that the recv() operation doesn't wait for all of the data to arrive. Instead, it merely returns what's currently available in the system buffers. Therefore, to get all of the data, you might write code that looks like this:

# remaining = number of bytes being received (determined already)
msg = b""
while remaining > 0:
    chunk = s.recv(remaining)    # Get available data
    msg += chunk                 # Add it to the message
    remaining -= len(chunk)

The only problem with this code is that concatenation (+=) has horrible performance. Therefore, a common performance optimization in Python 2 is to collect all of the chunks in a list and perform a join when you're done. Like this:

# remaining = number of bytes being received (determined already)
msgparts = []
while remaining > 0:
    chunk = s.recv(remaining)    # Get available data
    msgparts.append(chunk)       # Add it to list of chunks
    remaining -= len(chunk)  
msg = b"".join(msgparts)          # Make the final message

Now, here's a third solution using a bytearray:

# remaining = number of bytes being received (determined already)
msg = bytearray()
while remaining > 0:
    chunk = s.recv(remaining)    # Get available data
    msg.extend(chunk)            # Add to message
    remaining -= len(chunk)

Notice how the bytearray version is really clean. You don't collect parts in a list and you don't perform that cryptic join at the end. Nice.

Of course, the big question is whether or not it performs. To test this out, I first made a list of small byte fragments like this:

chunks = [b"x"*16]*512

I then used the timeit module to compare the following two code fragments:

# Version 1
msgparts = []
for chunk in chunks:
    msgparts.append(chunk)
msg = b"".join(msgparts)

# Version 2
msg = bytearray()
for chunk in chunks:
    msg.extend(chunk)

When tested, version 1 of the code ran in 99.8s whereas version 2 ran in 116.6s (a version using += concatenation takes 230.3s by comparison). So while performing a join operation is still faster, it's only faster by about 16%. Personally, I think the cleaner programming of the bytearray version might make up for it.

Example 2: Binary record packing

This example is an slight twist on the last example. Support you had a large Python list of integer (x,y) coordinates. Something like this:

points = [(1,2),(3,4),(9,10),(23,14),(50,90),...]

Now, suppose you need to write that data out as a binary encoded file consisting of a 32-bit integer length followed by each point packed into a pair of 32-bit integers. One way to do it would be to use the struct module like this:

import struct
f = open("points.bin","wb")
f.write(struct.pack("I",len(points)))
for x,y in points:
    f.write(struct.pack("II",x,y))
f.close()

The only problem with this code is that it performs a large number of small write() operations. An alternative approach is to pack everything into a bytearray and only perform one write at the end. For example:

import struct
f = open("points.bin","wb")
msg = bytearray()
msg.extend(struct.pack("I",len(points))
for x,y in points:
    msg.extend(struct.pack("II",x,y))
f.write(msg)
f.close()

Sure enough, the version that uses bytearray runs much faster. In a simple timing test involving a list of 100000 points, it runs in about half the time as the version that makes a lot of small writes.

Example 3: Mathematical processing of byte values

The fact that bytearrays present themselves as arrays of integers makes it easier to perform certain kinds of calculations. In a recent embedded systems project, I was using Python to communicate with a device over a serial port. As part of the communications protocol, all messages had to be signed with a Longitudinal Redundancy Check (LRC) byte. An LRC is computed by taking an XOR across all of the byte values.

Bytearrays make such calculations easy. Here's one version:

message = bytearray(...)     # Message already created
lrc = 0
for b in message:
    lrc ^= b
message.append(lrc)          # Add to the end of the message

Here's a version that increases your job security:

message.append(functools.reduce(lambda x,y:x^y,message))

And here's the same calculation in Python 2 without bytearrays:

message = "..."       # Message already created
lrc = 0
for b in message:
    lrc ^= ord(b)
message += chr(lrc)        # Add the LRC byte

Personally, I like the bytearray version. There's no need to use ord() and you can just append the result at the end of the message instead of using concatenation.

Here's another cute example. Suppose you wanted to run a bytearray through a simple XOR-cipher. Here's a one-liner to do it:

>>> key = 37
>>> message = bytearray(b"Hello World")
>>> s = bytearray(x ^ key for x in message)
>>> s
bytearray(b'm@IIJ\x05rJWIA')
>>> bytearray(x ^ key for x in s)
bytearray(b"Hello World")
>>>

Final Comments

Although some programmers might focus on bytearrays as a kind of mutable string, I find their use as an efficient means for assembling messages from fragments to be much more interesting. That's because this kind of problem comes up a lot in the context of interprocess communication, networking, distributed computing, and other related areas. Thus, if you know about bytearrays, it might lead to code that has good performance and is easy to understand.

That's it for this installment. In case you're wondering, this topic is also related to my upcoming PyCON'2010 tutorial "Mastering Python 3 I/O."

posted by Dave Beazley # 7:02 PM

Tuesday, January 26, 2010

Reexamining Python 3 Text I/O

Note: Since I first posted this, I added a performance test using the Python 2.6.4 codecs module. This addition is highlighted in red.

When Python 3.0 was first released, I tried it out on a few things and walked away unimpressed. By far, the big negative was the horrible I/O performance. For instance, scripts to perform simple data analysis tasks like processing a web server log file were running more than 30 times slower than Python 2. Even though there were many new features of Python 3 to be excited about, the I/O performance alone was enough to make me not want to use it---or recommend it to anyone else for that matter.

Some time has passed since then. For example, Python-3.1.1 is out and many improvements have been made. To force myself to better understand the new Python 3 I/O system, I've been working on a tutorial Mastering Python 3 I/O for the upcoming PyCON'2010 conference in Atlanta. Overall, I have to say that I'm pretty impressed with what I've found--and not just in terms of improved performance.

Due to space constraints, I can't talk about everything in my tutorial here. However, I thought I would share some thoughts about text-based I/O in Python 3.1 and discuss a few examples. Just as a disclaimer, I show a few benchmarks, but my intent is not to do a full study of every possible aspect of text I/O handling. I would strongly advise you to download Python 3.1.1 and perform your own tests to get a better feel for it.

Like many people, one of my main uses of Python is data processing and parsing. For example, consider the contents of a typical Apache web server log:

75.54.118.139 - - [24/Feb/2008:00:15:42 -0600] "GET /favicon.ico HTTP/1.1" 404 133
75.54.118.139 - - [24/Feb/2008:00:15:49 -0600] "GET /software.html HTTP/1.1" 200 3163
75.54.118.139 - - [24/Feb/2008:00:16:10 -0600] "GET /ply/index.html HTTP/1.1" 200 8018
213.145.165.82 - - [24/Feb/2008:00:16:19 -0600] "GET /ply/ HTTP/1.1" 200 8018
...

Let's look at a simple script that processes this file. For example, suppose you wanted to produce a list of all URLs that have generated a 404 error. Here's a really simple (albeit hacky) script that does that:

error_404_urls = set()
for line in open("access-log"):
    fields = line.split()
    if fields[-2] == '404':
        error_404_urls.add(fields[-4])

for name in error_404_urls:
    print(name)

On my machine, I have a 325MB log file consisting of 3649000 lines--a perfect candidate for performing a few benchmarks. Here are the numbers that you get running the above script with different Python versions. UCS-2 refers to Python compiled with 16-bit Unicode characters. UCS-4 refers to Python compiled with 32-bit Unicode characters (the --with-wide-unicode configuration option). Also, in the interest of full disclosure, these tests were performed with a warm disk cache on a 2 GHZ Intel Core 2 Duo Apple Macbook with 4GB of memory under OS-X 10.6.2 (Snow Leopard).

Python Version Time (seconds)

2.6.4 7.91s

3.0 125.42s

3.1.1 (UCS-2) 14.11s

3.1.1 (UCs-4) 17.32s

Python Version	Time (seconds)
2.6.4	7.91s
3.0	125.42s
3.1.1 (UCS-2)	14.11s
3.1.1 (UCs-4)	17.32s

As you can see, Python 3.0 performance was an anomaly--the performance of Python 3.1.1 is substantially better. To better understand the I/O component of this script, I ran a modified test with the following code

for line in open("access-log"):
    pass

Here are the performance results for iterating over the file by lines:

Python Version Time (seconds)

2.6.4 1.50s

2.6.4 (codecs, UTF-8) 52.22s

3.0 105.87s

3.1.1 (UCS-2) 4.35s

3.1.1 (UCs-4) 6.11s

Python Version	Time (seconds)
2.6.4	1.50s
2.6.4 (codecs, UTF-8)	52.22s
3.0	105.87s
3.1.1 (UCS-2)	4.35s
3.1.1 (UCs-4)	6.11s

If you look at these numbers, you will see that the I/O performance of Python 3.1 has improved substantially. It is also substantially faster than using the codecs module in Python 2.6. However, you'll also observe that the performance is still quite a bit worse than the native Python 2.6 file object. For example, in the table, iterating over lines is about 3x slower in Python 3.1.1 (UCS-2). How can that be good? That's 300% slower!

Let's talk about the numbers in more detail. The decreased performance in Python 3 is almost solely due to the overhead of the underlying Unicode conversion applied to text input. That conversion process involves two distinct steps:

Input data (bytes) has to be scanned and characters decoded according to some encoding (UTF-8 by default).
The decoded character data has to be stored as an array of multibyte integers that represent the associated string result.

The overhead of decoding is a direct function of how complicated the underlying codec is. Although UTF-8 is relatively simple, it's still more complex than an encoding such as Latin-1. Let's see what happens if we try reading the file with "latin-1" encoding instead. Here's the modified test code:

for line in open("access-log",encoding='latin-1'):
    pass

Here are the modified performance results that show an improvement:

Python Version Time (seconds)

3.1.1 (UCS-2) 3.64s (was 4.35s)

3.1.1 (UCs-4) 5.33s (was 6.11s)

Python Version	Time (seconds)
3.1.1 (UCS-2)	3.64s (was 4.35s)
3.1.1 (UCs-4)	5.33s (was 6.11s)

Lesson learned : The encoding matters. So, if you're working purely with ASCII text, specifying an encoding such as 'latin-1' will speed everything up. Just so you know, if you specify 'ascii' encoding, you get no improvement over UTF-8. This is because 'ascii' requires more work to decode than 'latin-1' (due to an extra check for bytes outside the range 0-127 in the decoding process).

At this point, you're still saying that it's slower. Yes, even with a faster encoding, Python 3.1.1 is still about 2.5x slower than Python 2.6.4 on this simple I/O test. Is there anything that can be done about that?

The short answer is "not really." Since Python 3 strings are Unicode, the process of reading a simple 8-bit text file is always going to involve an extra process of converting and copying the byte-oriented data into the multibyte Unicode representation. Just to give you an idea, let's drop into C programming and consider the following program:

#include <stdio.h>

int main() {
  FILE *f;
  char  bytes[256];

  f = fopen("access-log","r");
  while (fgets(bytes,256,f)) {  // Yes, hacky 
  }
}

This program does nothing more than iterate over lines of a file--think of it as the ultimate stripped down version of our Python-2.6.4 test. If you run it, takes 1.13s to run on the same log file used for our earlier Python tests.

When you go to Python 3, there is always extra conversion. It's like modifying the C program as follows:

#include <stdio.h>

int main() {
  FILE *f;
  char  bytes[256], *c;
  short  unicode[256], *u;

  f = fopen("biglog.txt","r");
  while (fgets(bytes,256,f)) {
    c = bytes;
    u = unicode;
    while (*c) {    /* Convert to Unicode */
      *(u++) = (short) *(c++);
    }
  }
}

Sure enough, if you run this modified C program, it takes about 1.7 seconds--a nearly 50% performance hit just from that extra copying and conversion step. Minimally, Python 3 has to do the same conversion. However, it's also performing dynamic memory allocation, reference counting, and other low-level operations. So, if you factor all of that in, the performance numbers start to make a little more sense. You also start to understand why it might be really hard to do much better.

Now, should you care about all of this? Truthfully, most programs are probably not going to be affected by degraded text I/O performance as much as you think. That's because most interesting programs do far more than just I/O. Go back and consider the original script that I presented. On Python-2.6.4, it took 7.91s to execute. If I go back and tune the script to use the more efficient 'latin-1' encoding, it takes 13.8s with Python-3.1.1. Yes, that's about 1.75x slower than before. However, the key point is that it's not 2.5x slower as our earlier I/O tests would suggest. The performance impact will become less and less as the script performs more non-IO related work.

Finally, let's say that you still can't live with the performance degradation. If you're just working with simple ASCII data files, you might solve this problem by turning to binary I/O instead. For example, the following script variant uses binary I/O and bytes for most of its processing--only converting text to Unicode when absolutely necessary for printing.

error_404_urls = set()
for line in open("access-log","rb"):
    fields = line.split()
    if fields[-2] == b'404':
        error_404_urls.add(fields[-4])

for name in error_404_urls:
    print(name.decode('latin-1'))

If you run this final script, you find that it takes 8.22s in Python 3.1.1--which is only about 4% slower than the Python-2.6.4. How about that!

The bottom line is that Python-3.1 is definitely worth a second look--especially if you tried the earlier Python 3.0 release and were disappointed with its performance. Although text-based I/O is always going to be slower in Python 3 due to extra Unicode processing, it might not matter as much in practice. Plus, binary I/O in Python 3 is still quite fast which means that you can turn to it as a last resort.

If you want to know more, attend my Mastering Python 3 I/O at PyCON'2010 or sign up for the Special Preview in Chicago.

Final Notes:

All versions of Python were compiled from source using the exact same configuration, compiler, and environment settings.
Python timing tests were performed using the time module and enclosing code with these statements:
```
import time
start = time.time()
... statements ...
end = time.time()
print(end-start)
```

posted by Dave Beazley # 8:36 PM

Thursday, January 21, 2010

Slashdot, Pronouns, and the Python Essential Reference

Yesterday, I was ecstatic to see a positive review of my Python Essential Reference book on Slashdot. I've never had a book reviewed on Slashdot before. However, I also know that with Slashdot, one never really knows what direction the subsequent discussion is going to take. For instance, will someone jump in and say something like "in Soviet Russia, Python indents you" or will the conversation devolve into something about how Python programmers will never have a girlfriend? That's not true by the way. I once had a girlfriend who went to hear me talk for 90 minutes about LALR(1) parser generators at a Chipy meeting despite the fact that she didn't know the first thing about programming. That's surely a sign of true love or insanity if there ever was one. Needless to say, I married her. However, I digress.

No, this time around, the Slashdot discussion decided it was going to focus on the use of pronouns--namely in response to a comment that included the sentence "... there is a lot of what a developer needs and very little of what she doesn't need." Now, I am by no means any fan of political correctness, but I had to chuckle at the irony. Of all of the things to discuss about the Python Essential Reference, pronouns would have to rank at about the bottom of the list. This is because the entire book is virtually devoid of personal pronouns. With the exception of the word "you" (e.g., "you type this..."), you won't find "he", "she", "him", "her", "we", or anything like that used anywhere in the text. This was an intentional choice, but it wasn't related to any kind of political influence (in fact, editors of the Essential Reference have often tried to add pronouns like "he" and "she" to the text only to have me take them out again).

First published in 1999, the Essential Reference was actually my second major writing project--the first being my Ph.D. dissertation which had been completed the year before. As you know, writing a dissertation is a pretty major affair. Not only do you have to do original research and defend it, you also have to write a major document describing the results. For a typical graduate student, the dissertation is the most technically demanding document you will ever write. It might even be the first document that you will ever submit to a real-world copy editor--an editor who will very likely tear your precious document to shreds in front of your eyes.

In my case, the final stage of my dissertation involved a somewhat prolonged battle with the dissertation editor at the University of Utah. Upon submitting the document, she would immediately put it under the microscope to see if it met the required "technical specifications." This meant measuring margins, line spacing, tables, figures, and other details with a ruler. Any deviation whatsoever meant instant rejection of the entire document--please play again.

Assuming one could pass the basic technical requirements, the next stage involved a review to see if you were strictly adhering to the required writing "style guide." When submitting a dissertation, you actually had to indicate a specific writing style guide. For example, I said that I was writing the document according to the "Chicago Manual of Style." What this meant in practical terms is that upon submitting the dissertation to the editor, she would read it and return it to you a few days later dripping in a sea of red ink. Every sentence of the document that did not precisely adhere to that style guide would be torn apart. I have to say that in my entire academic and professional career (grade school, high school, college, etc.), I have never had any paper reviewed like that.

Just to give you an example of the agony, if I wrote something like "the data is plotted" (something that sounded perfectly reasonable to me as a programmer) the editor would reject it because "data" is a plural (of datum) and you can't use "is" with a plural (e.g., you would never say "the points is plotted."). The other major source of agony was in the use of pronouns. The editor would instantly punish you for any use of a personal pronoun. So, a sentence like "we took the points and processed them with a script" would be rejected.

Essentially the editor wanted the entire document to be written in what I would roughly describe as "academic passive voice." It's a style of writing where you never identify who is actually carrying out various actions. So, instead of saying "we took the points and processed them with a script" you had to write "the points were processed with a script." As you can see, A major feature of this writing style is that it is very direct and precise. Not only is the second sentence more compact, it doesn't muddle the discussion with unimportant details about who is actually carrying out the action. Obviously, you also avoid the whole issue of "he" versus "she" with such a writing style.

Anyways, work on the Python Essential Reference started just 6 months after finishing my dissertation. Having fought all of those editor battles, I wrote it in the exact same style. So far as I can remember, I don't think any pronoun other than "it" or "its" appeared in the text. It must have blown the copy editor's mind. What kind of deranged lunatic would write a 300-page impersonal document like that? Especially since writing in the passive voice is something so actively discouraged.

Over the last ten years, various copy-editors have worked on the Essential Reference, but much of that original academic writing style remains. At some point, use of the word "you" was introduced in the book. I was somewhat lukewarm about it at the time, but as an author you also learn to pick and choose your battles--and that wasn't one that seemed worth fighting (unlike the battle to convince my publisher that putting out a Python 2.6 book hot on the heels of Python 3.0 was going to make any sense).

So there you have it. A review of a book virtually devoid of personal pronouns spawns a big discussion on the use of he/she on Slashdot. Who would have thought?

Naturally, I disavow any grammatical mistakes in this blog post---after all, I don't have a editor.

posted by Dave Beazley # 10:00 AM

Sunday, January 17, 2010

Presentation on the new Python GIL

For anyone who missed it, I gave a presentation on the new Python GIL, implemented by Antoine Pitrou, at the January 14, 2010 meeting of Chipy. The presentation slides can be found at http://www.dabeaz.com/python/NewGIL.pdf. I don't have any followup comments to put here at this time. However, I think this is an exciting new development for Python 3.

posted by Dave Beazley # 10:09 AM

Tuesday, January 05, 2010

The Python GIL Visualized

In preparation for my upcoming PyCON'2010 talk on "Understanding the Python GIL", I've been working on a variety of new material--including some graphical visualization of the GIL behavior described in my earlier talk. I'm still experimenting, but check it out.

In these graphs, Python interpreter ticks are shown along the X-axis. The two bars indicate two different threads that are executing. White regions indicate times at which a thread is completely idle. Green regions indicate when a thread holds the GIL and is running. Red regions indicate when a thread has been scheduled by the operating system only to awake and find that the GIL is not available (e.g., the infamous "GIL Battle"). For those who don't want to read, here is the legend again in pictures:

Okay, now let's look at some threads. First, here is the behavior of running two CPU-bound threads on a single CPU system. As you will observe, the threads nicely alternate with each other after long periods of computation.

Now, let's go fire up the code on your fancy new dual-core laptop. Yow! Look at all of that GIL contention. Again, all of those red regions indicate times where the operating system has scheduled a Python thread on one of the cores, but it can't run because the thread on the other core is holding it.

Here's an interesting case that involves an I/O bound thread competing with a CPU-bound thread. In this example, the I/O thread merely echoes UDP packets. Here is the code for that thread.

def thread_1(port):
    s = socket(AF_INET,SOCK_DGRAM)
    s.bind(("",port))
    while True:
        msg, addr = s.recvfrom(1024)
        s.sendto(msg,addr)

The other thread (thread 2) is just mindlessly spinning. This graph shows what happens when you send a UDP message to thread 1.

As you would expect, most of the time is spent running the CPU-bound thread. However, when I/O is received, there is a flurry of activity that takes place in the I/O thread. Let's zoom in on that region and see what's happening.

In this graph, you're seeing how difficult it is for the I/O bound to get the GIL in order to perform its small amount of processing. For instance, approximately 17000 interpreter ticks pass between the arrival of the UDP message and successful return of the s.recvfrom() call (and notice all of the GIL contention). More that 34000 ticks pass between the execution of s.sendto() and looping back to the next s.recvfrom() call. Needless to say, this is not the behavior you usually want for I/O bound processing.

Anyways, that is all for now. Come to my PyCON talk to see much more. Also check out Antoine Pitrou's work on a new GIL.

Note: It is not too late to sign up for my Concurrency Workshop next week (Jan 14-15).

posted by Dave Beazley # 6:18 AM