Turbo Charge Python Apps with Speed, Part 1

Turbo charge Python apps with Speed

Python’s ease of use, friendliness, “batteries-included” huge standard library, plus add-on libraries, make it one of the most popular languages in the FOSS world. However, as a byte-code-compiled language, it loses out on the performance front in comparison to C and C++. This article presents several possible solutions to this problem, so you can still use Python’s familiar syntax, yet achieve better, or even native-code performance for programs that need it.

Python is undoubtedly one of the most popular programming languages  ever created. It has a marvellous syntax, and great libraries supporting almost all the areas of computing that one can think of. Python provides great programming productivity and flexibility — but it loses out on the performance front. It is a byte-code-compiled dynamic programming language that executes through its virtual machine, so the performance of Python programs is much slower compared to native-code-compiled programming languages like C, C++, etc. Such languages are wonderful for program performance, but greatly hamper programming productivity and flexibility.

Another problem with Python is that its current standard implementation is quite unable to manipulate the multiple processing cores of the now-standard multi-core processors. These speed-dampening factors in Python are often frustrating to developers, since they have to switch to writing complicated C or C++ compiled code chunks to mitigate the performance drawbacks of Python.

Thanks to the FLOSS community, however, there are many ways to turbo-charge your Python programs without resorting to writing complicated and time-consuming C or C++ compiled code chunks. These solutions are in the form of Python JIT compilation modules like Psyco, alternative implementations of standard Python like Unladen Swallow, Python to C++ compilers like Shedskin, a Python-like compiled language, Wirbel, etc.

To enable Python to take full advantage of multi-core processors, there are solutions like the built-in multiprocessing module, PProcess, Parallel Python, etc. We will go on to show you how to employ these tools to pump steroids into Python, so it yields greater productivity, flexibility and performance.

In the first half of this two-part article, we explore Psyco, Unladen Swallow, Shedskin and Wirbel. In the next part, we will look at multiprocessing, PProcess and Parallel Python.
I used Puppy Linux 4.2.1 and the Ubuntu 9.10 64-bit desktop edition to explore these tools and to test the source code presented in this article.

From now on, I’ll use the terms “C-like” or “C-type” to identify C, C++ and other languages that compile to native code and offer excellent performance — it gets irksome to type “C, C++, etc.,” multiple times, and you’ll probably tire of reading that, too.

Why is Python slow?

When you compile and link a program in C-like languages, the entire program is converted into instructions that are natively executable by your hardware — the CPU. In contrast, running a Python program is mainly a two-step process. When you run a Python program, it is automatically converted into a platform-independent intermediate representation known as byte-code. These byte-code instructions are then interpreted through a native platform application known as the Python Virtual Machine.

The official, reference, and most popular implementation of the Python virtual machine is CPython. It’s obvious from the name itself that it is programmed in C. Every single byte-code instruction invokes a C routine, comprising multiple native instructions, to accomplish the task encoded in the byte-code. It’s this process that introduces the execution delays.

Python is a dynamic programming language, where every data type is an object — even if it looks like a native data type to you. Type i = 7; f = 3.14 at a Python interactive prompt, and then type dir(i); dir(f) to see a listing of the various methods and attributes associated with i and f. In Python, you directly assign values to variables without declaring their types — only the values have a type in Python, not the objects. This is the opposite of static C-like languages, where you have to first declare the types of the native or abstracted objects, before manipulating them.

Python also lets you bind different types of values to objects at runtime. Therefore, the Python runtime (VM) also has to deduce the types of these objects, and convert them into native data types, to finally build the native instructions that will run on the CPU. This type deduction and translation is another overhead that further slows down Python.

The various objects in Python are allocated and deallocated through a subsystem for automatic memory management known as the gc (Garbage Collector). This gc activity runs after a certain number of object allocations, to manage and free memory from unreachable objects. During the gc cycles, other activities stop — so if the gc cycles are long, then those again introduce delays in execution of your Python code.

The Python virtual machine deals with a lot of global objects, and the external non-thread-safe native-code modules. The Python interpreter enforces a rule that only a single running thread modifies those global objects, and ensures the safety of other threads by granting a mutual exclusion lock to the scheduled thread. The lock is known as the GIL (Global Interpreter Lock).

Though the design of the Python interpreter is simplified and it’s easier to maintain it this way, it introduces yet another speed-reducing side-effect to Python programs. Due to the GIL, Python programs are unable to take advantage of modern multi-core processors that can run multiple threads of instructions in parallel. This means that even if you write your Python programs with multiple threads, the Python interpreter will only execute one thread at a time on a CPU core; the remaining cores are not used. That’s another big performance loss. (This was just a simple picture of the GIL; it’s one of the most-discussed topics in the Python community, and you could learn more about it in this excellent video presentation.)

Psyco, a Python specialising compiler

The easiest option to speed up your Python apps, without significantly changing the code, and while still using the standard Python interpreter, is to use Psyco. This is a loadable Python module; you just import the psyco module, and add a line or two to speed up your existing Python code.

Psyco is a mature piece of work that’s been around for many years, but currently it is in maintenance-only mode, and is only available for 32-bit Python distributions on the x86 processor architecture. According to the author of Psyco, the current and future work related to it is being targeted towards a new Python implementation known as PyPy.

Psyco claims to provide speed-ups in the range of 2 to 100 times, typically a 4x speed-up with the unmodified Python sources and the standard interpreter. It is available for multiple OS platforms: GNU/Linux, BSD, Mac OS/X and Windows.

Installing Psyco

Psyco is written in C, so you need standard build tools to build it. You also need the standard Python headers to install Psyco from its source tarball. Type the commands sudo apt-get install build-
essential && sudo apt-get install python-dev
in a text console on your Ubuntu 32-bit machine, to fulfil the dependencies.

Now go to the Psyco project page, download its latest source tarball, and issue the commands tar zxvf psyco-version-src.tar.gz && cd psyco-version. Finally, issue the command python setup.py install, with sudo or su if you want to do a system-wide install.

If everything goes well, check by running python -c 'import psyco'. If there’s no error, you are ready to play with Psyco.

Quickly mod your Python app with Psyco

If you’re impatient to put Psyco in your Python application without reading up about it, then add the lines of code below into your app, after the lines that import all the modules, create constant global objects, etc. The recommended place to put these lines is at the starting of the well-known if '__ main__' == __name__ block in your Python application. Try it out after this, and see if your app seems to run faster.

    import psyco
except ImportError:

Under Psyco’s hood

Let’s explore how Psyco works, what performance boosts we can practically expect, and how to achieve the best speed-up with it. Psyco works somewhat like JIT (Just in Time) compilers in other virtual machine-based programming languages like Java. In fact, Psyco is like a JIT compiler; it compiles chunks of Python byte-code into native machine instructions at runtime. But it does more than a JIT compiler.

From a single chunk of Python byte-code, it creates multiple versions of compiled native code, for different kinds of objects. Psyco profiles the Python program at runtime, and accordingly marks some parts of the program as “hot zones”, for which it emits the native machine instructions. It keeps on using the different versions of generated native-code chunks for the various types of objects used in the hot zones. Thus, Psyco tries to guess the performance-oriented parts of your Python program, and compiles those into native-code chunks to provide performance comparable to C-like native-compiled languages.

It also performs various runtime optimisations, to emit the most optimised native-code instructions. The psyco.full() function shown above instructs Psyco to tackle the whole Python code that is running, and native-compile as much as possible.

How much, exactly, performance gets boosted by using Psyco, is difficult to guess, due to the complicated mechanism it uses to generate native-compiled versions of your Python programs. However, there are some thumb rules that help you extract the highest performance boosts:

  • First, Psyco is most useful for Python programs that are CPU-bound; that is, they do a lot of looping, mathematical calculations, walking lists, string and number operations, etc. Psyco may not be useful to Python code that is I/O-bound (waiting for network events, disk reads, writes or blocking operations), as the performance in those cases is hampered by the GIL, over which Psyco has no control. We will look at ways to overcome the GIL issue, in later sections of the article.
  • As mentioned earlier, Psyco generates multiple versions of native-code chunks, optimised for different types of objects — that means Psyco needs a lot of additional memory. Forcing Psyco to work on an entire large Python program that is not a good candidate for Psyco optimisation will most likely degrade the performance of your program. Therefore, avoid using psyco.full(), except in cases where your program is short, and/or you are sure that the whole program is CPU-bound, by the nature of the problem you are solving. You can instruct Psyco to not try to compile the whole Python program, but to first profile the program, and then native-compile only the critical parts based on this profiling. Do this by using psyco.profile() instead of psyco.full(). Profiling is recommended for large programs. You can also pass a watermark value to psyco.profile(), which should be between 0.0 and 1.0. A smaller watermark indicates that more functions are to be native-compiled, and the larger value that less functions should be compiled. The default watermark value is 0.09, if you don’t pass the parameter.
  • The Psyco functions psyco.full() and psyco.profile() take another argument, memory, which specifies the kilobytes limit below which the functions are compiled. Psyco also provides a log() function to enable logging (the log file is named program.log-psyco, by default). The following Psyco code enables logging and the profiled compilation of those functions that take at least 7 per cent of execution time, and less than 200 kilobytes per function:
  • You can also manually choose which functions you want Psyco to compile, using the psyco.bind() and psyco.proxy()functions. Both these functions take the name of the function to be compiled (in Python, that is a reference to a function object). The difference in their behaviour is as follows:
    • The proxy() function native-compiles the passed function object, and returns a reference to the native-code-compiled chunk as its return value. The returned reference is still callable; it is still a function object. However, the original function object remains unchanged—the same byte-code that was created when you ran the program. To run the optimised code, you must invoke the native-code-compiled function object via the variable containing the proxy() function’s return value (g in the example below). If you invoke the function by its original name (function2 as from a def function2() statement) then you are calling the byte-code/unoptimised version.
    • In contrast, psyco.bind() compiles the function object passed to it, and replaces the original byte-code function object with the native-code-compiled version. It also native-code-compiles functions called from the passed function. Thus, you can continue to invoke the function with the same name (that is, function1 as from a def function1()statement) and you will be running the optimised native code. In the sample code snippet shown below, function1 and the inner functions it calls are native-compiled in-place, but function2 remains as byte-code; only the returned object stored in the variable g is native-compiled.
      g = psyco.proxy(function2)
      g(args)            #optimised
      function2(args)    #unoptimised
      function1(args)    #optimised

Psyco lets you boost your app’s runtime performance with the addition of only a few lines of code, and a little understanding about the nature of your Python program and its various functions and operations. You can choose the speed-up mode that’s most suitable for your Python program based on the broad rules mentioned above, and with a little trial and error.

Psyco comes with many Python test programs. You can go through the documentation provided on the Psyco project page to explore it further.

Unladen Swallow, Google’s steroids for CPython

The founder of Python, Guido van Rossum, works for Google. Python is one of the three official programming languages used there, along with C++ and Java — so who could be more concerned about the speed of Python programs? The result is Unladen Swallow, an optimisation branch of the standard Python code base. The main goal of Unladen Swallow is to maintain full compatibility with CPython and the libraries, while boosting performance around 5 times.

It is based on Python 2.6.1, and changes only the Python run time, nothing else. Currently, YouTube.com is making heavy use of the Unladen Swallow. The Unladen Swallow project is addressing improvements in all the areas that make a Python program slow, as we discussed earlier. It has implemented JIT compilation to boost the runtime performance of Python programs. The runtime’s current implementation converts the Python byte-code of all the hot functions first to LLVM (Low-Level Virtual Machine) IR (Intermediate Representation), and then those are finally compiled to native machine code.

The long-term plans of the project are to change the stack-based CPython VM to a register-based VM, based on LLVM code, to extract more acceleration through the LLVM JIT on this VM. It also has put out many improvements to the CPython garbage collector, which lower the delays caused by the GC.

Unladen Swallow is an ongoing project. To use it, you have to fetch its code from the project SVN trunk, and build it. The prerequisites for building it are LLVM and Clang; you also need the standard build tools that we installed in the Psyco section, to build LLVM and Clang. The project page provides detailed information on installing it, so please follow that to set it up on your machine. The project page also provides the Python test suite to benchmark Swallow’s performance against CPython’s.

You can also try to run your own Python programs with the Unladen Swallow Python interpreter, to judge the performance gains. Explore more about Unladen Swallow through the various links and documents provided on its project page.

Shedskin, a Python-to-C++ compiler

Shedskin pumps up the performance of existing Python code through a different approach. It takes Python programs written in a static subset of the standard Python language, and generates corresponding optimised C++ code. This generated C++ code can be compiled as a native-code application, or as an extension module to be imported in a larger Python program. Thus, Shedskin can boost your CPU-intensive program’s speed.

Currently, Shedskin is in an experimental stage, and supports only around 20 modules from the standard library. Still, it’s a very promising project to use to turbo-charge your Python programs.

You can find .deb and .rpm packages of Shedskin on its project page, in addition to the source tarball. To install Shedskin from source, the prerequisites are g++, the Boehm garbage collector, and the PCRE library. You can install these on an Ubuntu machine by running sudo apt-get install g++ libgc-dev libpcre3-dev. You could build the GC and libpcre on platforms where no pre-built packages are available: download their tarballs from their individual project pages, and follow the provided build instructions.

Once the prerequisites are installed, download the latest tarball of Shedskin from its project page, and run tar zxvf shedskin-version.tgz && cd shedskin-version. Finally, run python setup.py install (with sudo or su privileges if you want to do a system-wide install). If everything goes well, check that running shedskin in the terminal shows some info.

Shedskin translates implicit Python types to explicit C++ types. It uses type-inference techniques to determine the types used in Python programs. It does iterative analysis of the Python code, and sees what values are assigned to different objects, to deduce their static types. Therefore, it doesn’t allow you to assign different types to the variable (reusing the variable name with different data types is forbidden). It also forces you to use only similar types as the elements of various Python collections (tuple, list, set, etc.).

Here are a few examples of valid and invalid Python expressions in Shedskin:

i = 56                         # valid
i = 'a'                        # invalid
t1 = (1, 7, 9, 16)             # valid
t2 = (1, 's', 3.14, [1, 2, 3]) # invalid
t3 = ([1], [1, 6], [4, 7, 5])  # valid
a = ['abc', 'a', 'efgh']       # valid
b = [(0.01), (0.1, 3.178)]     # valid
c = [1, 6.78, 5, 0]            # invalid
d1= {'1':'shed', '2':'skin'}   # valid
d2= {'a':1, 2:5, 'c':7}        # invalid

The basic usage of Shedskin is shedskin file.py. It creates file.hpp, file.cpp and Makefile in the current working directory. It also provides a few switches to control the way output files are generated. For example, -a causes an additional annotated Python file to be generated; the name of the generated Makefile is customisable with the -m switch, etc. Finally, you invoke the generated Makefile to build the compiled executable from the generated C++ source.

The Shedskin project page provides many test programs in a tar archive. The page mentions a speed-up of 2 to 200 times over CPython, for these test programs. Let’s test some of these Python programs ourselves, to check the speed-up. Download the shedskin-examples-version.tgz from its page, and run tar zxvf shedskin-examples-version.tgz && cd shedskin-examples-version.

First, we test a JPEG-to-BMP conversion program; run shedskin -m jpg2bmp.mk TonyJpegDecoder.py && make -f jpg2bmp.mk to build the native executable for this Python program. Now compare the performance of the Python version and the compiled version: run time python TonyJpegDecoder.py and time ./TonyJpegDecoder and compare the timings.

In my test, the compiled version of the TonyJpegDecoder sample provided a 1.5x speed-up over the Python version on my Ubuntu machine. I also tried a few other test programs; the compiled version of chess.py showed a 12x speed-up, and that of sudoku1.py provided a 31x speed-up over their Python versions.

Try to measure Shedskin speed-up by experimenting with various kinds of Python programs yourself. You could further explore Shedskin by going through the documentation provided on its project page, and studying the test programs.

Wirbel, a Python-like compiled language

Wirbel is another interesting experimental project for people who are looking to boost Python program performance to near-native speed. It is the fastest of the modern programming languages, as mentioned on its home page. Wirbel is like Shedskin regarding the static data types it uses, and it also uses type inference to deduce information about the data types, to produce the native executable code.

Wirbel is a new project with limited library support, so it currently doesn’t support the wide functionality that Python does — but it could be useful for small programs in Python-like syntax, to reap the benefits of native-code speed of execution.

To install Wirbel on your machine, download the latest source tarball from its home page, and run tar zxvf wirbel-version.tar.gz && cd wirbel-version. Next, run ./configure && make to build Wirbel. If the compilation completes without errors, then issue make install with sudo or su rights to install it. On my Ubuntu 64-bit machine, I got some compilation failures due to the removal of some headers in the latest versions of g++. So you add the headers #include<cstdio>
in both src/Type.cc and src/Location.cc, and #include<stdint.h> in baustones/httpd/HTTPRequest.h, respectively. These subdirectories are in the top-level Wirbel source directory. I’ve mailed this fix to the author of Wirbel, and hope that it’ll be included in the next release of the project.

Now you should have two new commands available: the Wirbel compiler, wic, and the wirbel utility, which compiles and links the .w code and runs the created native executable.

The wic compiler takes many switches (explore man wic regarding this). You could add the absolute path to the wirbel binary with the hash bang (#!) magic as the first line of the .w files, and make them executable with chmod +x, to directly run the scripts.

In addition to the static nature of data types, Wirbel has a few more differences from Python that you should keep in mind while converting existing Python programs:

  • The extension of Wirbel source files is .w not .py; if you try to compile .py files with the Wirbel compiler, it throws an error.
  • You don’t use import statements in Wirbel files — it doesn’t understand the import statement. Wirbel itself finds everything it requires.
  • Wirbel supports function overloading, which Python does not.
  • The print function in Wirbel uses parentheses, like in Python 3.0.

Wirbel comes with many examples and tests that you could go through to see it in action. Measure the performance boost to compare it to similar programs in Python. You can also go through the information on the Wirbel home page to explore it further.

Psyco boosts the performance of Python programs very significantly, and you only need to add a few lines of code. Unladen Swallow is the new turbo-charged Python runtime that optimises CPython without breaking backward compatibility. Shedskin and Wirbel are new experimental projects that target a static subset of Python code to provide native-compiled performance.

In the next part, we will dirty our hands with more options that specifically address the GIL issue in CPython, to extract performance gains for Python programs running on mighty multi-core processors.

All published articles are released under Creative Commons Attribution-NonCommercial 3.0 Unported License, unless otherwise noted.
Open Source For You is powered by WordPress, which gladly sits on top of a CentOS-based LEMP stack.

Creative Commons License.