Python for Research: An Initiation

1
4963

The purpose of research is to try and explain something to the world based on certain observations. The challenge is to come up with an explanation before anyone else does. You, as a researcher, are confronted with a volume of data. You need to model it and demonstrate that your model is correct.

How likely are you to hit upon a valid model on your first try? Not very, unless the problem is trivial. So, the challenge is not just to prove that the correct model works. The even greater challenge is to find out as quickly as possible, whether your model is the correct solution, and if not, to eliminate it and move on. This is precisely where Python is superb.

Well, there are some issues.

Python works with dynamic types. A list can contain any mixture of data types. The research problem deals with data of one type. The mixture of data types, obviously, allows for the data to be of the same type as well. However, there is a price. The type of data has to be checked at run time. As a consequence, for massive computation, Python can be virtually unusable.

So, Jim Hugunin, a graduate student at MIT, created Numeric in 1995. He then went on to create Jython and IronPython. Numeric continued to grow in the scientific and engineering community. Many students and scientists built their modules around the Numeric module. The project’s name on SourceForge.net was NumPy. This work has now become a part of the much broader SciPy project. You can find out more about it at www.scipy.org/History_of_SciPy.

Python does not come with the NumPy module included by default. It is, however, available in various distributions. You will need to make sure that it is installed. Familiarity with core Python is assumed.

NumPy one-dimensional arrays or vectors

The best way to learn is to try a little code. Suppose you need to multiply corresponding elements of two arrays. How would the code and performance be different? Try the following small benchmark:

N=1000000
a=[i for i in range(N)]
b=[0.5*i for i in range(N)]
for t in range(10):
     ab = [a[i]*b[i] for i in range(N)]

You have defined two arrays of a million elements each. Your code creates a new array, which is the product of the corresponding elements. You repeat it 10 times so that the execution times can be compared reasonably.

Now, you can write the same code using NumPy arrays.

import numpy as np
N=1000000
a=np.array([i for i in range(N)])
b=np.array([0.5*i for i in range(N)])
for t in range(10):
     ab = a*b

The method array in the NumPy module converts a Python list into an array object of uniform type elements. There are a number of operations defined on these objects—for example, multiplication results in a new array, which is the product of the corresponding elements.

On my system, the core Python program took 8.9 seconds while the NumPy version took 1.7 seconds.

Now, try the following:

>>> a = [1,2,3]
>>> b = [3,2,1]
>>> a + b

Before you give your answer, try the following as well:

>>> s = ['a','b','c']
>>> a + s

The only reasonable, consistent answer is concatenation of the two lists. Now, try the same thing with NumPy arrays:

>>> a=np.array([1,2,3])
>>> b=np.array([3,2,1])
>>> s=np.array(['a', 'b', 'c'])
>>> a + b
array([4, 4, 4])
>>> a + s

Since the elements in the array are of the same type, summation of a and b is very reasonable. Now, what should happen in the case of adding s to a? Understandably, it should raise an exception, as the addition of a number to a string is not meaningful.

You need to compute the sine of a list of numbers. This can also be a good test to benchmark. So, try the following code:

import math
N=1000000
a=[i*2*math.pi/N for i in range(N)]
for t in range(10):
     sina = [math.sin(v) for v in a]

You have defined an array of one million equally spaced values between 0 and 2π. As before, create a new array containing the sine of the corresponding value in the first array. Repeat it 10 times so that the run time is large enough.

The corresponding program using NumPy will be:

import math
import numpy as np
N=1000000
a=np.array([i*2*math.pi/N for i in range(N)])
for t in range(10):
     sina = np.sin(a)

As before, you have converted a Python list into a NumPy array. You can pass an array to a function and its meaning is very clear. It creates a new array whose values are the function of each of the corresponding elements of the array, passed as a parameter.

In this case, on my system, the core Python version takes 7.3 seconds while the NumPy version takes 1.8 seconds.

NumPy multi-dimensional arrays and matrices

A list of lists would be a two dimensional array. Try another little test:

r = 1280*[1]
m = 1024*[r]
for t in range(10):
     m2 = []
     for r in m:
          m2.append([.5*e for e in r])

Define a list of 1,024 rows. Each row has 1,280 elements, with a value of 1. The task is to scale all elements with a constant.

The corresponding program using NumPy would be:

import numpy as np
r = 1280*[1]
m = np.array(1024*[r])
print m.shape
for t in range(10):
     m2 = .5*m

As before, the array method converts a Python list into an array. Printing the shape of the array confirms that it is a two dimensional array of size 1024×1280. The core Python program takes 5.9 seconds while the NumPy application takes 0.7 seconds.

But notice how easy it is to multiply all elements in an array with a constant. You can use NumPy to define the matrix with all elements being 1, as follows:

import numpy as np
m = np.ones((1024,1280))
print m.shape
for t in range(10):
     mp = .5*m

This version runs in 0.4 seconds.

It is reasonable to assume that the addition of two arrays of the same shape is possible and the corresponding elements will be added. Explore the following:

>>> m = np.array([[1,2,3], [ 4,5,6]])
>>> n = np.array([[1,2,3], [ 4,5,6]])
>>> m.shape
(2, 3)
>>> n.shape
(2, 3)
>>> m + n
array([[ 2,  4,  6],
     [ 8, 10, 12]])
>>> n.shape = (6,1)
>>> m + n

You are creating two arrays of size 2×3 and adding them. The result is as expected. NumPy allows an array to be reinterpreted by changing the shape of the array. In the example above, the array n is reinterpreted as a one-dimensional array. Now, if you try to add m and n, it should fail and it does.

Multiplication of two-dimensional arrays could be interpreted as a matrix multiplication. However, that would not be consistent across various dimensions. Hence, in NumPy, multiplication of two arrays is allowed if their shapes are identical and correspond to the creation of a new array whose elements are a product of the corresponding elements of the two arrays. You can try the following:

>>> m = np.array([[1,2],[3,4]])
>>> n = np.array([[1,2],[3,4]])
>>> m*n
array([[ 1,  4],
     [ 9, 16]])

If you want to work with matrices, you need to define a two-dimensional array as a matrix. Try the following example—compare and explore:

>>> m = np.matrix([[1,2],[3,4]])
>>> n = np.matrix([[1,2],[3,4]])
>>> m*n
matrix([[ 7, 10],
     [15, 22]])
>>> m.shape=4,1
>>> n.shape = 4,1
>>> m*n
>>> n*m

Use the built-in help to know more about the possibilities available for arrays, matrices and a lot more.

It should be clear that if an operation can be applied to an array, the performance gains over core Python could be substantial. NumPy provides a lot of functionality, which can be very useful when analysing large volumes of data, giving researchers an edge to beat the competition!

1 COMMENT

LEAVE A REPLY

Please enter your comment!
Please enter your name here