Developers

Everything you need to know about regression models in Python

December 19, 2016

844

Python is a very versatile piece of software. It offers many choices for Web applications and can be used in thousands of third party modules. Python is also an effective tool for regression analysis.

I think that machine learning, data mining, chemometrics and statistics are not the same thing but there are aspects that are common to all four topics. For example, a chapter about regression analysis (the techniques for estimating the numeric relationship among data elements) can be found in books on each of them. Some examples include Reference 1 chapter 6, Reference 2 chapter 13, Reference 3 chapter 2, and Reference 4 chapter 7. In the real world, sometimes there are problems with the fundamentals of linear algebra. There is also the idea that a higher explained variance or a higher correlation means a better model. Or that the only way is the use of a commercial spreadsheet. For these reasons, I think that it’s necessary to clarify some things with an introductory article.

From a more technical point of view, for example, if I need to measure the concentration of a certain substance in an underground water sample, I must be capable of measuring a concentration less than 10 times the minimum concentration admitted. For example, if the contamination limit is 1.1 micrograms per litre, I must, at least, be capable of measuring a concentration equal to 0.11 micrograms per litre (these values depend on national laws). With a limit settled by the law, the choice of a regression model becomes even more important. That choice is important also in ion chromatography, because it’s better to speak of calibration rather than linearity. The calibration can’t always be represented by a regression of the first order, but can be represented sometimes, by a second degree curve. So the concept of linearity should be carefully considered, particularly with regard to the measurement for suppressed conductivity (see Reference 5). This problem is not considered in some methods in which the calibration is valid only if it’s linear, discarding a priori any other type of model.

Figure 1 A linear weighted model — Figure 1: A linear weighted model

The toolbox

This article is based on Mint 18 Xfce, Emacs 24.5, Geany and Anaconda. The last software can be freely downloaded from https://www.continuum.io/downloads. Then, from the terminal window, type bash Anaconda3-4.1.1-Linux-x86.sh and just follow the instructions on the screen. Because the packages PrettyTable and Seaborn are not available in the Anaconda distribution, I have easily installed them by typing the following code in a terminal window:

cd /home/<your-username>/anaconda_4.1.1/bin
easy_install prettytable
easy_install seaborn

My Emacs configuration for Python on Linux is almost the same as on Windows (see the September 2016 issue of OSFY). There is only one difference, because I have replaced the line given below:

(add-to-list ‘exec-path “C:\\WinPython-32bit-3.4.4.2\\python-3.4.4”)

… with the following ones:

(setq python-shell-interpreter “/home/<your-username>/anaconda_4.1.1/bin/python”
org-babel-python-command “/home/<your-username>/anaconda_4.1.1/bin/python”)

With respect to Geany, I have settled only the ‘Execute’ command /home/<your username>/anaconda_4.1.1/bin/python “%f”

Figure 2 Python and LaTeX — Figure 2: Pytho and LaTex

Figure 3 Pandas and Seaborn — Figure 3: Pandas and Seaborn

Regression models

The first way to build a polynomial model is by using matrix algebra. Let’s consider the following code, which is used instead of equations:

w=xlrd.open_workbook(“filename.xls”).sheet_by_name(“Sheet1”)
x,y=w.col_values(0),w.col_values(1)
degree=1
weight=1/array(x)**2
f=zeros((w.nrows,degree+1))
for i in range(0,w.nrows):
for j in range(0,degree+1):
f[i,j]=x[i]**j
q=diag(weight)
fT=dot(transpose(f),q)
fTf=dot(fT,f)
# fTf=fTf+diag([0.001]*(degree+1))
fTy=dot(fT,y)
coef=dot(inv(fTf),fTy)
xx=linspace(min(x),max(x),1000)
yp=polyval(coef[::-1],xx)

In the example, the degree is equal to 1 (a linear model) and the weight is equal to 1/x2 (this is a weighted model). For a simple linear model, degree=1 and weight=1 (or weight=1/array(x)**0). For a simple quadratic model, degree=2 and weight=1. With some simple matrix calculations, it’s possible to know the coefficients coef of the model and then plot the curve xx vs yp. For fitting through zero, set weight=1/array(x)**0 and for j in range(1,degree+1); add a small constant, for example 0.001, to the diagonal of fTf. For a linear model, the predicted values (x values back calculated) are calculated with predicted=(y-intercept)/slope and, for a quadratic model, with predicted=(-b+sqrt(b**2-(4*a*delta)))/(2*a) in which delta=c-y. The a, b and c values are about the coefficient of the second degree term, the coefficient of the first degree term and the constant value, respectively. The accuracy is then calculated as accuracy=predicted*100/x. The following tables present a linear simple and a linear weighted model for the same experimental data. I’m not a big fan of the squared correlation coefficient, but, considering only that, I should choose the linear model simple, because it’s the one with the highest value of R2. I think it would be better to consider the ‘Accuracy’ column in both tables and choose the weighted model. In this particular example, we must also consider what is already mentioned in the ‘Introduction’ section about the values established by law. Carry out these calculations using a spreadsheet (and without using macros), As I think that’s unnecessarily complicated.

Linear simple
Intercept -9343.97333
Slope 16984.40793
R2 0.99860
+------+---------+-----------+-------------+
| X    |    Y    | Predicted |   Accuracy  |
+------+---------+-----------+-------------+
| 0.1  |   1633  |   0.65    |    646.30   |
| 0.5  |   7610  |   1.00    |    199.64   |
| 1.0  |  14930  |   1.43    |    142.92   |
| 2.5  |  35347  |   2.63    |    105.25   |
| 5.0  |  69403  |   4.64    |    92.73    |
| 10.0 |  141643 |   8.89    |    88.90    |
| 25.0 |  402754 |   24.26   |    97.05    |
| 50.0 |  850161 |   50.61   |    101.21   |
+-----+----------+-----------+-------------+

Linear weighted 1/x2
Intercept 123.19563
Slope 15010.23851
R2 0.99422
+--------+---------+--------------+-----------+
|   X    |    Y    |   Predicted  | Accuracy  |
+--------+---------+--------------+-----------+
|  0.1   |   1633  |     0.10     |  100.58   |
|  0.5   |   7610  |     0.50     |   99.76   |
|  1.0   |   14930 |     0.99     |   98.64   |
|  2.5   |   35347 |     2.35     |   93.87   |
|  5.0   |   69403 |     4.62     |   92.31   |
|  10.0  |  141643 |     9.43     |   94.28   |
|  25.0  | 402754  |     26.82    |   107.29  |
|  50.0  | 850161  |     56.63    |   113.26  |
+--------+---------+--------------+-----------+

Figure 1 shows the plot for another data set, a simulation more or less typical for a pharmacokinetics study.
Another way is to build a certain matrix via vstack and then apply lstsq on it. Each A matrix created with vstack has the structure shown in the following examples. There are two important things: if the model is built to fit through zero, there is a column of zeros and, if the model is quadratic, there is a column of squared x. The coefficients of each model are a, b and c (if quadratic).

Linear fit simple
A=vstack([x,[1]*w.nrows]).T
a,b=linalg.lstsq(A,y)[0]
        +------+-----+
A =     | 0.1  | 1.0 |
        | 0.5  | 1.0 |
        | 1.0  | 1.0 |
        | 2.5  | 1.0 |
        | 5.0  | 1.0 |
        | 10.0 | 1.0 |
        | 25.0 | 1.0 |
        | 50.0 | 1.0 |
        +------+-----+

Linear fit through zero
A=vstack([x,[0]*w.nrows]).T
a,b=linalg.lstsq(A,y)[0]
         +------+-----+
A =      |  0.1 | 0.0 |
         |  0.5 | 0.0 |
         |  1.0 | 0.0 |
         |  2.5 | 0.0 |
         |  5.0 | 0.0 |
         | 10.0 | 0.0 |
         | 25.0 | 0.0 |
         | 50.0 | 0.0 |
         +------+-----+

Quadratic fit simple
A=vstack([x,[1]*w.nrows,array(x)**2]).T
a,b,c=linalg.lstsq(A,y)[0]
         +------+-----+--------+
A =      |  0.1 | 1.0 |  0.01  |
         |  0.5 | 1.0 |  0.25  |
         |  1.0 | 1.0 |  1.0   |
         |  2.5 | 1.0 |  6.25  |
         |  5.0 | 1.0 |  25.0  |
         | 10.0 | 1.0 |  100.0 |
         | 25.0 | 1.0 |  625.0 |
         | 50.0 | 1.0 | 2500.0 |
         +------+-----+--------+

Quadratic fit through zero
A=vstack([x,array(x)**2,[0]*w.nrows]).T
a,b,c=linalg.lstsq(A,y)[0]
         +------+--------+-----+
A =      |  0.1 |  0.01  | 0.0 |
         |  0.5 |  0.25  | 0.0 |
         |  1.0 |  1.0   | 0.0 |
         |  2.5 |  6.25  | 0.0 |
         |  5.0 |  25.0  | 0.0 |
         | 10.0 |  100.0 | 0.0 |
         | 25.0 |  625.0 | 0.0 |
         | 50.0 | 2500.0 | 0.0 |
         +------+--------+-----+

Probably, the simplest way is the use of polyfit with the syntax coef=polyfit(x,y,degree). For a better graphical presentation, I would like to say something about the use of LaTeX and about the couple Pandas + Seaborn, specifying also that the Pandas and Seaborn packages have more complex applications than the one presented here. LaTeX must be previously installed on your system; then just add rc(“text”,usetex=True) in your script. The result is shown in Figure 2. About Pandas, in the following example, a DataFrame with three columns is created. Then, x vs y data are plotted using Seaborn with regplot (lw=0 and marker=”o”) and, last, the linear model is plotted always with regplot but with some different options (lw=1 and marker=””). The result is shown in Figure 3, which is practically the same as with ggplot for R. Note that the ggplot plotting system exists also for Python and it is available at http://ggplot.yhathq.com. Another way to obtain a ggplot like plot is with the use of a style sheet. For example, put style.use(“ggplot”) before the plot command. To know all the styles available just use print(style.available).

import pandas as p
import seaborn as s
coef=polyfit(x,y,1)
yp=polyval(coef,x)
df=p.DataFrame({“Conc”:x,”Abs”:y,”Abs pred”:yp})
s.regplot(x=”Conc”,
          y=”Abs”,
          data=df,
          ci=False,
          scatter_kws={“color”:”g”,”alpha”:0.5,”s”:90},
          line_kws={“color”:”w”,”alpha”:0,”lw”:0},
          marker=”o”)
s.regplot(x=”Conc”,
          y=”Abs pred”,
          data=df,
          ci=False,
          scatter_kws={“color”:”w”,”alpha”:0,”s”:0},
          line_kws={“color”:”k”,”alpha”:1,”lw”:1},
          marker=””)

The data frame here is printed via prettytable:

+------+-------+-----------+
| Conc |   Abs |  Abs pred |
+------+-------+-----------+
| 0.05 | 0.046 |   0.053   |
| 0.10 | 0.093 |   0.092   |
| 0.20 | 0.171 |   0.172   |
| 0.25 | 0.217 |   0.212   |
| 0.50 | 0.418 |   0.412   |
| 1.00 | 0.807 |   0.811   |
+------+-------+-----------+

I have never used the confidence band in practice, but there are several ways to calculate and plot it. Here, a calculation is proposed based on the t-distribution from scipy.stats, where ip is the part below (inferior) the model and sp the part above (superior) it. The data set is taken from Reference 7. A nice explanation about the confidence intervals is reported, for example, in Reference 2 pages 86-91.

from scipy import stats
n=w.nrows
coef=polyfit(x,y,1)
yp=polyval(coef,x)
xx=linspace(min(x),max(x),1000)
dx=sum((x-mean(x))**2)
dy=sqrt(sum((y-yp)**2)/(n-2))
dd=((xx-mean(x))**2/dx)
t=stats.t.ppf(0.95,n-2)
ip=coef[1]+coef[0]*xx-(t*dy*sqrt(1/n+dd))
sp=coef[1]+coef[0]*xx+(t*dy*sqrt(1/n+dd))
plot(x,y,”o”,markerfacecolor=”lightgreen”,zorder=3)
plot(x,yp,”k-”,zorder=3)
fill_between(xx,ip,sp,where=sp>=ip,edgecolor=”b”,facecolor=”b”,alpha=0.2,interpolate=True)

Figure 4 Confidence band — Figure 4: Confidence band

Figure 5 StatsModels OLS — Figure 5: StatsModels OLS

The StatsModels package

Another way to build a regression model is by using the StatsModels package in combination with the Pandas package. An example is shown in the following code. The values for x and y are read from an xls file, then the weight is defined as 1/x2. Using the Pandas package, a DataFrame is defined for the couple x, y and a Series for the weights. Two types of regression are then calculated: OLS (Ordinary Least Squares, the one previously called the linear fit simple) and WLS (Weighted Least Squares). Last, both models are plotted with a simple plot. More information can be printed for slope ols_fit.params[1], intercept ols_fit.params[0], r-squared ols_fit.rsquared, a little report ols_fit.summary() or the equivalent for the weighted model using ‘wls’ instead of ‘ols’. Two examples are shown in Figures 7 and 8. Further information, such as residuals and Cook’s distance, can be printed or plotted using resid, resid_pearson and get_influence() respectively. A nice and large collection of examples is presented in Reference 8.

import pandas as p
import xlrd
from statsmodels.formula.api import ols,wls
w=xlrd.open_workbook(“pk.xls”).sheet_by_name(“Sheet1”)
x,y=w.col_values(0),w.col_values(1)
weight=1/array(x)**2
df=p.DataFrame({“Conc”:x,”Resp”:y})
w=p.Series(weight)
ols_fit=ols(“Resp~Conc”,data=df).fit()
wls_fit=wls(“Resp~Conc”,data=df,weights=w).fit()
figure(0)
plot(x,y,”o”,markerfacecolor=”lightgreen”,zorder=3)
plot(df[“Conc”],ols_fit.predict(),color=”k”,zorder=3)
figure(1)
plot(x,y,”o”,markerfacecolor=”lightgreen”,zorder=3)
plot(df[“Conc”],wls_fit.predict(),color=”k”,zorder=3)

Figure 6 StatsModels WLS — Figure 6: StatsModels WLS

Figure 7 OLS summary — Figure 7: OLS summary

Figure 8 WLS summary — Figure 8: WLS summary

Even for the regression models, the Python language offers different ways to calculate the same things. Some of them have been presented here. I would like to say something about the piecewise, or segmented, regression models. In Python there are, for example, some techniques based on numpy.piecewise and scipy.optimize, but I’m not completely satisfied by them. For a segmented regression, as previously discussed in the article, ‘Get analytical with R’ published in September 2015 issue of OSFY, I prefer to use R with the package segmented. An interesting Python port called pysegmented is available from https://github.com/e-koch/pysegmented but, at present, it’s only a partial port. More complex applications can be done with the Scikit-learn package (Reference 9), but this is another story, perhaps for another article in the future.

LEAVE A REPLY Cancel reply

Thought Leaders

HOW TOs

MOST POPULAR

Open Journey

EDITOR PICKS

POPULAR POSTS

POPULAR CATEGORY