Dabeaz

Dave Beazley's mondo computer blog. [ homepage | archive ]

Sunday, August 09, 2009

Python Binary I/O Handling

As a followup to my last post about the Essential Reference, I thought I'd talk about the one topic that I wish I had addressed in more detail in my book--and that's the subject of binary data and I/O handling. Let me elaborate.

One of the things that interests me a lot right now is the subject of concurrent programming. In the early 1990's, I spent a lot of time writing big physics simulation codes for Connection Machines and Crays. All of those programs had massive parallelism (e.g., 1000s of processors) and were based largely on message-passing. In fact, my first use of Python was to control a large massively parallel C program that used MPI. Now, we're starting to see message passing concepts incorporated into the Python standard library. For example, I think the inclusion of the multiprocessing library is probably one of the most significant additions to the Python core that has occurred in the past 10 years.

A major aspect of message passing concerns the problem of quickly getting data from point A to point B. Obviously, you want to do it as fast as possible. A high speed connection helps. However, it also helps to eliminate as much processing overhead as possible. Such overhead can come from many places--decoding data, copying memory buffers, and so forth.

Python makes it pretty easy to pass data around between processes. For example, you can use the pickle module, json, XML-RPC, or some other similar mechanism. However, all of these approaches involve a significant amount of overhead to encode and decode data. You probably wouldn't want to use them for any kind of bulk data transfer (e.g., if you wanted to send a large array of floats between processes). Nor would you really want to use this for some kind of high-performance networking on a big cluster.

However, lurking within the Python standard library is another way to deal with data in messaging and interprocess communication. However, it's all spread out in a way that's not entirely obvious unless you're looking for it (and even then it's still pretty subtle). Let's start with the ctypes library. I always assumed that ctypes was all about accessing C libraries from Python (an alternative approach to Swig). However, that's only part of the story. For instance, using ctypes, you can define binary data structures:

from ctypes import *
class Point(Structure):
     _fields_ = [ ('x',c_double), ('y',c_double), ('z',c_double) ]

This defines an object representing a C data structure. You can even create and manipulate such objects just like an ordinary Python class:

>>> p = Point(2,3.5,6)
>>> p.x
2.0
>>> p.y
3.5
>>> p.z = 7
>>>

However, keep in mind that under the covers, this is manipulating a C structure represented in a contiguous block of memory.

Now this is where things start to get interesting. I wonder how many Python programmers know that they can directly write a ctypes data structure onto a file opened in binary mode. For example, you can take the point above and do this:

>>> f = open("foo","wb")
>>> f.write(p)       
>>> f.close()

Not only that, you can read the file directly back into a ctypes structure if you use the poorly documented readinto() method of files.

>>> g = open("foo","rb")
>>> q = Point()
>>> g.readinto(q)
24
>>> q.x
2.0
>>>

The mechanism that makes all of this work is Python's so-called "buffer protocol." Since C types structures are contiguous in memory, I/O operations can be performed directly with that memory without making copies or first converting such structures into strings as you might do with something like the struct module. The buffer protocol simply exposes the underlying memory buffers for use in I/O.

Direct binary I/O like this is not limited to files. If s is a socket, you can perform similar operations like this:

p = Point(2,3,4)           #  Create a point
s.send(p)                  #  Send across a socket

q = Point()
s.recv_info(q)               # Receive directly into q

If that wasn't enough to make your brain explode, similar functionality is provided by the multiprocessing library as well. For example, Connection objects (as created by the multiprocessing.Pipe() function) have send_bytes() and recv_bytes_into() methods that also work directly with ctypes objects. Here's an experiment to try. Start two different Python interpreters and define the Point structure above. Now, try sending a point through a multiprocessing connection object:

>>> p = Point(2,3,4)
>>> from multiprocessing.connection import Listener
>>> serv = Listener(("",25000),authkey="12345")
>>> c = serv.accept()
>>> c.send_bytes(p)
>>>

In the other Python process, do this:

>>> q = Point()
>>> from multiprocessing.connection import Client
>>> c = Client(("",25000),authkey="12345")
>>> c.recv_bytes_into(q)
24
>>> q.x
2.0
>>> q.y
3.0
>>>

As you can see, the point defined in one process has been directly transferred to the other.

If you put all of the pieces of this together, you find that there is this whole binary handling layer lurking under the covers of Python. If you combine it with something like ctypes, you'll find that you can directly pass binary data structures such as C structures and arrays around between different interpreters. Moreover, if you combine this with C extensions, it seems to be possible pass data around without a lot of extra overhead. Finally, if that wasn't enough, it turns out that some popular extensions such as numpy also play in this arena. For instance, in certain cases you can perform similar direct I/O operations with numpy arrays (e.g., directly passing arrays through multiprocessing connections).

I think that this functionality is pretty interesting--and highly relevant to anyone who is thinking about parallel processing and messaging. However, all of this is also somewhat unsettling. For one, much of this functionality is all very poorly documented in the Python documentation (and in my book for that matter). If you look at the documentation for methods such as the read_into() method files, it simply says "undocumented, don't use it." The buffer interface, which makes much of this work, has always been rather obscure and poorly understood--although it got a redesign in Python 3.0 (see Travis Oliphant's talk from PyCon). And if it wasn't complicated enough already, much of this functionality gets tied into the bytes/Unicode handling part of Python --a hairy subject on its own.

To wrap up, I think much of what I've described here represents a part of Python that probably deserves more investigation (and at the very least, more documentation). Unfortunately, I only started playing around with this recently--too late for inclusion in the Essential Reference (which was already typeset and out the door). However, I'm thinking it might be a good topic for a PyCon tutorial. Stay tuned.

Note: If anyone has links to articles or presentations about this, let me know and I'll add them here.

posted by Dave Beazley # 3:32 PM

<< Home

Dabeaz

Sunday, August 09, 2009

Python Binary I/O Handling

Archives