Analyse Your Data with Pandas

0
7670

Panda-Analysing-figure

Here’s an introduction to Pandas, an open source software library that’s written in Python for data manipulation and analysis. Pandas facilitates the manipulation of numerical tables and the time series.

In recent times, it has been proven again and again that data has become an increasingly important resource. Now, with the Internet boom, large volumes of data are being generated every second. To stay ahead of the competition, companies need efficient ways of analysing this data, which can be represented as a matrix, using Python’s mathematical package, NumPy.

The problem with NumPy is that it doesn’t have sufficient data analysis tools built into it. This is where Pandas comes in. It is a data analysis package, which is built to integrate with NumPy arrays. Pandas has a lot of functionality, but we will cover only a small portion of it in this article.

Getting started  
Installing Pandas is a one-step process if you use Pip. Run the following command to install Pandas.

sudo pip install pandas

If you face any difficulties, visit http://pandas.pydata.org/pandas-docs/stable/install.html.  You can now try importing Pandas into your Python environment by issuing the following command:

import pandas

In this tutorial, we will be using data from Weather Underground. The dataset for this article can be downloaded from http://www.synesthesiam.com/assets/weather_year.csv and can be imported into Pandas using:

 data = pandas.read_csv(weather_year.csv)

The read_csv function creates a dataframe. A dataframe is a tabular representation of the data read. You can get a summary of the dataset by printing the object. The output of the print is as follows:

data

<class pandas.core.frame.DataFrame>
Int64Index: 366 entries, 0 to 365
Data columns:
EDT 366 non-null values
Max TemperatureF 366 non-null values
Mean TemperatureF 366 non-null values
Min TemperatureF 366 non-null values
Max Dew PointF 366 non-null values
MeanDew PointF 366 non-null values
Min DewpointF 366 non-null values
Max Humidity 366 non-null values
Mean Humidity 366 non-null values
Min Humidity 366 non-null values
Max Sea Level PressureIn 366 non-null values
Mean Sea Level PressureIn 366 non-null values
Min Sea Level PressureIn 366 non-null values
Max VisibilityMiles 366 non-null values
Mean VisibilityMiles 366 non-null values
Min VisibilityMiles 366 non-null values
Max Wind SpeedMPH 366 non-null values
Mean Wind SpeedMPH 366 non-null values
Max Gust SpeedMPH 365 non-null values
PrecipitationIn 366 non-null values
CloudCover 366 non-null values
Events 162 non-null values
WindDirDegrees 366 non-null values
dtypes: float64(4), int64(16), object(3)

As you can see, there are 366 entries in the given dataframe. You can get the column names using data.columns.
The output of the command is given below:

data.columns
Index([EDT, Max TemperatureF, Mean TemperatureF, Min TemperatureF, Max Dew PointF, MeanDew PointF, Min DewpointF, Max Humidity,  Mean Humidity,  Min Humidity,  Max Sea Level PressureIn,  Mean Sea Level PressureIn,  Min Sea Level PressureIn,  Max VisibilityMiles,  Mean VisibilityMiles,  Min VisibilityMiles,  Max Wind SpeedMPH,  Mean Wind SpeedMPH,  Max Gust SpeedMPH, PrecipitationIn,  CloudCover,  Events,  WindDirDegrees], dtype=object)

To print a particular column of the dataframe, you can simply index it as data[‘EDT’] for a single column or data[[‘EDT’,’Max Humidity’]] for multiple columns.  The output for data[‘EDT’] is:

data[EDT]

0     2012-3-10
1     2012-3-11
2     2012-3-12
3     2012-3-13
4     2012-3-14
5     2012-3-15
6     2012-3-16
...
...
...
361     2013-3-6
362     2013-3-7
363     2013-3-8
364     2013-3-9
365    2013-3-10
Name: EDT, Length: 366

And the output for data[[EDT,Max Humidity]] is:

data[[EDT,Max Humidity]]
<class pandas.core.frame.DataFrame>
Int64Index: 366 entries, 0 to 365
Data columns:
EDT 366 non-null values
Max Humidity 366 non-null values
dtypes: int64(1), object(1)

Sometimes, it may be useful to only view a part of the data, just so that you can get a sense of what kind of data you are dealing with. Here you can use the head and tail functions to view the start and end of your dataframe:

data[Max Humidity].head()
0 74
1 78
2 90
3 93
4 93
Name: Max Humidity

Note: The head and tail functions take a parameter which sets the number of rows to be displayed. And can be used as data[Max Humidity].head(n), where ‘n’ is the number of rows. The default is 5.

Working with columns
Now that we have a basis on which to work with our dataframe, we can explore various useful functions provided by Pandas like std to compute the standard deviation, mean to compute the average value, sum to compute the sum of all elements in a column, etc. So if you want to compute the mean of the Max Humidity column, for instance, you can use the following commands:

data['Max Humidity'].mean()
90.027322404371589
data['Max Humidity'].sum()
32950
data['Max Humidity'].std()
9.10843757197798

 Note: Most of the Pandas functions ignore NaNs, by default. These regularly occur in data and a convenient way of handling them must be established. This topic is covered more in detail later in this article.

The std and sum function can be used in a similar manner. Also, rather than running these functions on individual columns, you can run them on the entire dataframe, as follows:

data.mean()
Max TemperatureF 66.803279
Mean TemperatureF 55.683060
Min TemperatureF 44.101093
Max Dew PointF 49.549180
MeanDew PointF 44.057377
Min DewpointF 37.980874
Max Humidity 90.027322
Mean Humidity 67.860656
Min Humidity 45.193989
Max Sea Level PressureIn 30.108907
Mean Sea Level PressureIn 30.022705
Min Sea Level PressureIn 29.936831
Max VisibilityMiles 9.994536
Mean VisibilityMiles 8.732240
Min VisibilityMiles 5.797814
Max Wind SpeedMPH 16.418033
Mean Wind SpeedMPH 6.057377
Max Gust SpeedMPH 22.764384
CloudCover 2.885246
WindDirDegrees 189.704918

Using apply for bulk operations
As we have already seen, functions like mean, std and sum work on entire columns, but sometimes it may be useful to apply our own functions to entire columns of the dataframe. For this purpose, Pandas provides the apply function, which takes an anonymous function as a parameter and applies to every element in the column. In this example, let us try to get the square of every element in a column. We can do this with the following code:

data[Max Humidity].apply(lambda d: d**2)

0      5476
1      6084
2      8100
3      8649
4      8649
5      8100
...
...
...
361     8464
362     7225
363     7744
364     5625
365     2916
Name: Max Humidity, Length: 366

Note: In the Lambda function, the parameter d is implicitly passed to it by Pandas, and contains each element of the a column.

Now you may wonder why you can’t just do this with a loop. Well, the answer is that this operation was written in one single line, which saves code writing time and is much easier to read.

Dealing with NaN values
Pandas provides a function called isnull, which returns a ‘True’ or ‘False’ value depending on whether the value of an element in the column is NaN or None. These values are treated as missing values from the dataset, and so it is always convenient to deal with them separately. We can use the apply function to test every element in a column to see if any NaNs are present. You can use the following command:

e = data[Events].apply(lambda d: pandas.isnull(d))
e
0      True
1     False
2     False
3      True
4      True
5     False
...
361    False
362     True
363     True
364     True
365     True
Name:  Events, Length: 366

As you can see, a list of Booleans was returned, representing values that are NaN. Now there are two options of how to deal with the NaN values. First, you can choose to drop all rows with NaN values using the dropna function, in the following manner:

data.dropna(subset=[Events])

<class pandas.core.frame.DataFrame>
Int64Index: 162 entries, 1 to 361
Data columns:
EDT                                   162  non-null values
Max TemperatureF                  162  non-null values
Mean TemperatureF                 162  non-null values
Min TemperatureF                 162  non-null values
Max Dew PointF                    162  non-null values
MeanDew PointF                    162  non-null values
Min DewpointF                     162  non-null values
Max Humidity                      162  non-null values
 Mean Humidity                    162  non-null values
 Min Humidity                     162  non-null values
 Max Sea Level PressureIn        162  non-null values
 Mean Sea Level PressureIn   162  non-null values
 Min Sea Level PressureIn     162  non-null values
 Max VisibilityMiles              162  non-null values
 Mean VisibilityMiles             162  non-null values
 Min VisibilityMiles              162  non-null values
 Max Wind SpeedMPH           162  non-null values
 Mean Wind SpeedMPH         162  non-null values
 Max Gust SpeedMPH            162  non-null values
PrecipitationIn                   162  non-null values
 CloudCover                       162  non-null values
 Events                               162  non-null values
 WindDirDegrees                   162  non-null values
dtypes: float64(4), int64(16), object(3)

As you can see, there are only 162 rows, which don’t contain NaNs in the column Events. The other option you have is to replace the NaN values with something easier to deal with using the fillna function. You can do this in the following manner:

data[Events].fillna()

0
1                  Rain
2                  Rain
3
4
5     Rain-Thunderstorm
6
7      Fog-Thunderstorm
8                  Rain
362
363
364
365
Name:  Events, Length: 366

Accessing individual rows
So far we have discussed methods dealing with indexing entire columns, but what if you want to access a specific row in your dataframe? Well, Pandas provides a function called irow, which lets you get the value of a specific row. You can use it as follows:

data.irow(0)

EDT                           2012-3-10
Max TemperatureF                         56
Mean TemperatureF                        40
Min TemperatureF                         24
Max Dew PointF                           24
MeanDew PointF                           20
Min DewpointF                            16
Max Humidity                             74
 Mean Humidity                           50
 Min Humidity                            26
 Max Sea Level PressureIn             30.53
 Mean Sea Level PressureIn            30.45
 Min Sea Level PressureIn             30.34
 Max VisibilityMiles                     10
 Mean VisibilityMiles                    10
 Min VisibilityMiles                     10
 Max Wind SpeedMPH                       13
 Mean Wind SpeedMPH                       6
 Max Gust SpeedMPH                       17
PrecipitationIn                            0.00
 CloudCover                               0
 Events                                    NaN
 WindDirDegrees                         138
Name: 0

Note: Indices start from 0 for indexing the rows.

Filtering
Sometimes you may need to find rows of special interest to you. Let’s suppose we want to find out data points in our data frame, which have a mean temperature greater than 40 and less than 50.You can filter out values from your dataframe using the following syntax:

data[(data['Mean TemperatureF']>40) & (data['Mean TemperatureF']<50)]

<class 'pandas.core.frame.DataFrame'>
Int64Index: 51 entries, 1 to 364
Data columns:
EDT                                51  non-null values
Max TemperatureF                  51  non-null values
Mean TemperatureF                 51  non-null values
Min TemperatureF                  51  non-null values
Max Dew PointF                    51  non-null values
MeanDew PointF                    51  non-null values
Min DewpointF                     51  non-null values
Max Humidity                      51  non-null values
 Mean Humidity                    51  non-null values
 Min Humidity                     51  non-null values
 Max Sea Level PressureIn    51  non-null values
 Mean Sea Level PressureIn  51  non-null values
 Min Sea Level PressureIn     51  non-null values
 Max VisibilityMiles              51  non-null values
 Mean VisibilityMiles             51  non-null values
 Min VisibilityMiles              51  non-null values
 Max Wind SpeedMPH           51  non-null values
 Mean Wind SpeedMPH         51  non-null values
 Max Gust SpeedMPH            51  non-null values
PrecipitationIn                   51  non-null values
 CloudCover                       51  non-null values
 Events                               23  non-null values
 WindDirDegrees                   51  non-null values
dtypes: float64(4), int64(16), object(3)

Note: The output of the condition data[Mean TemperatureF]>40 and  data[Mean TemperatureF]<50 return a NumPy array, and we must use the brackets to separate them before using the & operator, or else you will get an error message saying that the expression is ambiguous.

Now you can easily get meaningful data from your dataframe by simply filtering out the data that you aren’t interested in. This provides you with a very powerful technique that you can use in conjunction with higher Pandas functions to understand your data.

Getting data out
You can easily write data out by using the to_csv function to write your data out as a csv file.

data.to_csv(weather-mod.csv)

Want to make a separate tab? No problem.  

data.to_csv(data/weather-mod.tsv, sep=\t)

 Note: Generally, the dataframe can be indexed by any Boolean NumPy array. In a sense, only values that are true will be retained. For example, if we use the variable e, (e = data[Events].apply(lambda d: pandas.isnull(d))) which contains the list of all rows that have NaN values for data[Events], as data[e], we will get a dataframe which has rows that only have NaN values for data[Events]

You can also write your data out in other formats that can be found on the Pandas doc.

This article only covers the basics of what can be done with Pandas. It also supports a lot of higher level functions like plotting data to give a better feel of the data being dealt with. If you want to learn more about Pandas, check the online documentation. It is very readable, user-friendly and is a great place to get a better understanding of how Pandas works. Frameworks like Pandas let a Python application take advantage of such data analysis tools easily.  There are also other languages which support data analysis and you may want to check them out. These include R, MATLAB, Julia and Octave. To wrap up, these languages and packages greatly increase your  understanding of data. In a world where data is becoming increasingly important, it is critical that we deal with our data smartly.

LEAVE A REPLY

Please enter your comment!
Please enter your name here