Python Threading and its Caveats

Python threading and its caveats

As a rapid application development language, Python is highly preferred for being easy to use, feature-rich and robust. With multi-core processors now becoming more ubiquitous, one would expect Python programs to exploit redundant cores and perform faster… but they don’t! This article, targeted at Python developers, systems administrators, and students, explores why this is so by analysing Python 2.6.5 on Ubuntu 10.04 on a dual-core Intel x86 system (with a T6500 CPU).

Python supports multiple threads in a program; a multi-threaded program can execute multiple sub-tasks (I/O-bound and CPU-bound) independently. Apart from intelligently mixing CPU-bound and I/O-bound threads, developers try to exploit the availability of multiple cores, which allows the parallel execution of tasks.

What is special about Python threads is that they are regular system threads! The Python interpreter maps Python thread requests to either POSIX/pthreads, or Windows threads. Hence, similar to ordinary threads, Python threads are handled by the host operating system.

There is no logic for thread scheduling in the Python interpreter. Thus thread priority, scheduling schemes, and thread pre-emption do not exist in the Python interpreter. The scheduling and context switching of Python threads is at the disposal of the host scheduler.

Python threads are expensive

Python threads, especially those that are CPU-bound, are very expensive in terms of the usage of system calls. We created a small Python program with two threads, which performs a basic arithmetic task:

# A test snap: threads.py 

from threading import Thread 

class mythread(Thread): 
       def __init__(self): 
               Thread.__init__(self) 
       def count(self, n): 
           while n > 0: 
               n -= 1 

       def run(self): 
           self.count(10000000) 

th1 = mythread() 
th1.start() 

th2 = mythread() 
th2.start() 

th1.join() 
th2.join()

Let’s check the number of system calls used in this program:

vishal@tangerine:~$ strace -f -c python ./thread.py 

% time     seconds  usecs/call     calls    errors syscall 
------ ----------- ----------- --------- --------- ---------------- 
99.97    0.361636           7     54446     26494 futex 
 0.03    0.000124           1       238       163 open 
 0.00    0.000000           0       116           read 
 0.00    0.000000           0        75           close 
 0.00    0.000000           0         1           execve 

 [...] 
------ ----------- ----------- --------- --------- ---------------- 
100.00   0.361760                 55336     26738 total

The (outrageous number of) futex() calls are used to synchronise access to a global data structure (GIL), which is explained below. Next, we tweaked our program logic to run in a sequential flow, eliminating threads, and thus, ran it on a single core.

# thread.py 

class my(): 
       def count(self, n): 
           while n > 0: 
               n -= 1 

th1.count(10000000) 
th1.count(10000000)
vishal@tangerine:~$ strace -f -c python ./thread.py 

% time     seconds  usecs/call     calls    errors syscall 
------ ----------- ----------- --------- --------- ---------------- 
100.00    0.000040           0       238       163  open 
 0.00     0.000000           0       116            read 
 0.00     0.000000           0        75            close 
[...] 
------ ----------- ----------- --------- --------- ---------------- 
100.00    0.000040                   882       245 total

Now, let’s take a look at the time consumed by both versions of the program:

Two threads running on a dual-core Intel x86 machine:

vishal@tangerine:~$ time python ./thread.py 

real    0m1.988s 
user    0m1.856s 
sys     0m0.384s

Sequential version of the same program:

vishal@tangerine:~$ time python ./thread.py 

real    0m1.443s 
user    0m1.436s 
sys     0m0.004s

As apparent, Python threads are very expensive in:

  1. The number of system calls used
  2. The higher turn-around time of application

The GIL

Multi-threading is used to exploit redundant hardware, get better performance, and reduce turn-around time. Python does not serve the purpose in this context. What is the reason for such behaviour?

The cause of this inefficiency is the way Python provides interpreter access to a thread. Only one thread can be active in the Python interpreter, at a time. Every thread in a Python program shares a global data structure called the Global Interpreter Lock (GIL).

The GIL is implemented with a mutex and conditional variable. It ensures that each running thread gets exclusive access to interpreter internals. A Python thread should acquire the GIL to become eligible to run.

A thread can be in any of the following states:

  • Ready: Ready to run in the system scheduler.
  • Blocked: Waiting for a resource.
  • Running: Running by the system scheduler.
  • Terminated: The thread has exited, normally.

A thread that’s in the “ready” state but does not have the GIL, may get scheduled by the host scheduler, but would not proceed to the “running” state until it acquires the GIL. Since the GIL is a critical resource, the Python interpreter ensures that each thread releases and reacquires the GIL:

  • After a pre-specified interval (termed as “ticks”).
  • If the thread performs an I/O operation (read, write, send, etc.)

Ever wondered why you could not interrupt your multi-threaded Python program with Ctrl+C? Ctrl+C sends the SIGINT signal to a Linux process. In Python, only the “main” thread can handle this signal. As mentioned in the last point, only a single thread can be active at a time in the Python interpreter. Also, the Python interpreter periodically asks a thread to release, and gives another thread a chance to acquire the GIL.

The ownership of the GIL is not moderated, and is random. By switching threads, the Python interpreter makes sure that the “main” thread will eventually get a chance to acquire the GIL. The host scheduler will schedule the “main” thread and run the signal-handler code, so we have to wait till then. And then, what’s next?

Now, the main thread is often blocked on an uninterruptible thread-join or lock. Hence, it never gets a chance to process the signal handler. Thus, your Ctrl+C does not reach the main thread. The only solution is to kill the Python process.

All published articles are released under Creative Commons Attribution-NonCommercial 3.0 Unported License, unless otherwise noted.
Open Source For You is powered by WordPress, which gladly sits on top of a CentOS-based LEMP stack.

Creative Commons License.