Regression Analysis with Python

Introduction

  • In God we trust, all others must bring data. (W. Edwards Deming)
  • In Data we trust, all others must bring data.

 

Statistics

  • Statistics is the science of uncertainty & variability
  • Statistics turns data into information
  • Data -> Information -> Knowledge -> Wisdom
  • Data Driven Decisions (3Ds)
  • Statistics is the interpretation of Science
  • Statistics is the Art & Science of learning from data

 

Variable

  • Characteristic that may vary from individual to individual
  • Height, Weight, CGPA etc

 

Measurement

  • Process of assigning numbers or labels to objects or states in accordance with logically accepted rules

 

Measurement Scales

  • Nominal Scale: Obersvations may be classified into mutually exclusive & exhaustive classes or categories
  • Ordinal Scale: Obersvations may be ranked
  • Interval Scale: Difference between obersvations is meaningful
  • Ratio Scale: Ratio between obersvations is meaningful & true zero point

 

Regression Analysis

  • Quantifying dependency of a normal response on quantitative explanatory variable(s)

 

Population Regression Function

Figure 1: Population Regression Function

 

Simple Linear Regression

  • Quantifying dependency of a normal response on a quantitative explanatory variable

 

Example

Weekly Income ($) Weekly Expenditures ($)
80 70
100 65
120 90
140 95
160 110
180 115
200 120
220 140
240 155
260 150
Income = [80, 100, 120, 140, 160, 180, 200, 220, 240, 260]
Expend = [70, 65, 90, 95, 110, 115, 120, 140, 155, 150]
import pandas as pd
df1 = pd.DataFrame(
 {
   "Income": Income
 , "Expend": Expend
 }
 )

print(df1)
   Expend  Income
0      70      80
1      65     100
2      90     120
3      95     140
4     110     160
5     115     180
6     120     200
7     140     220
8     155     240
9     150     260
from matplotlib import pyplot as plt
fig = plt.figure()
plt.scatter(
  df1["Income"]
, df1["Expend"]
, color = "green"
, marker = "o"
)
plt.title("Scatter plot of Weekly Income (\$) and Weekly Expenditures (\$)")
plt.xlabel("Weekly Income (\$)")
plt.ylabel("Weekly Expenditures (\$)")
plt.show()

from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
Reg1 = ols(formula = "Expend ~ Income", data = df1)
Fit1 = Reg1.fit()
print(Fit1.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 Expend   R-squared:                       0.962
Model:                            OLS   Adj. R-squared:                  0.957
Method:                 Least Squares   F-statistic:                     202.9
Date:                Tue, 02 Oct 2018   Prob (F-statistic):           5.75e-07
Time:                        15:25:26   Log-Likelihood:                -31.781
No. Observations:                  10   AIC:                             67.56
Df Residuals:                       8   BIC:                             68.17
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     24.4545      6.414      3.813      0.005       9.664      39.245
Income         0.5091      0.036     14.243      0.000       0.427       0.592
==============================================================================
Omnibus:                        1.060   Durbin-Watson:                   2.680
Prob(Omnibus):                  0.589   Jarque-Bera (JB):                0.777
Skew:                          -0.398   Prob(JB):                        0.678
Kurtosis:                       1.891   Cond. No.                         561.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
print(Fit1.params)
Intercept    24.454545
Income        0.509091
dtype: float64
print(Fit1.fittedvalues)
0     65.181818
1     75.363636
2     85.545455
3     95.727273
4    105.909091
5    116.090909
6    126.272727
7    136.454545
8    146.636364
9    156.818182
dtype: float64
print(Fit1.resid)
0     4.818182
1   -10.363636
2     4.454545
3    -0.727273
4     4.090909
5    -1.090909
6    -6.272727
7     3.545455
8     8.363636
9    -6.818182
dtype: float64
print(Fit1.bse)
Intercept    6.413817
Income       0.035743
dtype: float64
print(Fit1.centered_tss)
8890.0
print(anova_lm(Fit1))
           df       sum_sq      mean_sq           F        PR(>F)
Income    1.0  8552.727273  8552.727273  202.867925  5.752746e-07
Residual  8.0   337.272727    42.159091         NaN           NaN
fig = plt.figure()
plt.scatter(
  df1["Income"]
, df1["Expend"]
, color = "green"
, marker = "o"
)
plt.plot(df1["Income"], Fit1.fittedvalues)
plt.title("Regression plot of Weekly Income (\$) and Weekly Expenditures (\$)")
plt.xlabel("Weekly Income (\$)")
plt.ylabel("Weekly Expenditures (\$)")
plt.show()

Multiple Linear Regression

  • Quantifying dependency of a normal response on two or more quantitative explanatory variables

 

Example

Fertilizer (Kg) Rainfall (mm) Yield (Kg)
100 10 40
200 20 50
300 10 50
400 30 70
500 20 65
600 20 65
700 30 80
import numpy as np
Fertilizer = np.arange(100, 800, 100)
Rainfall   = [10, 20, 10, 30, 20, 20, 30]
Yield      = [40, 50, 50, 70, 65, 65, 80]
import pandas as pd
df2 = pd.DataFrame(
 {
   "Fertilizer": Fertilizer
 , "Rainfall": Rainfall
 , "Yield": Yield
 }
 )

print(df2)
   Fertilizer  Rainfall  Yield
0         100        10     40
1         200        20     50
2         300        10     50
3         400        30     70
4         500        20     65
5         600        20     65
6         700        30     80
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import pyplot as plt
fig = plt.figure()
ax = fig.add_subplot(111, projection = "3d")
ax.scatter(
  df2["Fertilizer"]
, df2["Rainfall"]
, df2["Yield"]
, color = "green"
, marker = "o"
, alpha  = 1
)
ax.set_xlabel("Fertilizer")
ax.set_ylabel("Rainfall")
ax.set_zlabel("Yield")
plt.show()

from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
Reg2 = ols(formula = "Yield ~ Fertilizer + Rainfall", data = df2)
Fit2 = Reg2.fit()
print(Fit2.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  Yield   R-squared:                       0.981
Model:                            OLS   Adj. R-squared:                  0.972
Method:                 Least Squares   F-statistic:                     105.3
Date:                Tue, 02 Oct 2018   Prob (F-statistic):           0.000347
Time:                        15:25:27   Log-Likelihood:                -13.848
No. Observations:                   7   AIC:                             33.70
Df Residuals:                       4   BIC:                             33.53
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     28.0952      2.491     11.277      0.000      21.178      35.013
Fertilizer     0.0381      0.006      6.532      0.003       0.022       0.054
Rainfall       0.8333      0.154      5.401      0.006       0.405       1.262
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   2.249
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.705
Skew:                          -0.408   Prob(JB):                        0.703
Kurtosis:                       1.677   Cond. No.                     1.28e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.28e+03. This might indicate that there are
strong multicollinearity or other numerical problems.
print(Fit2.params)
Intercept     28.095238
Fertilizer     0.038095
Rainfall       0.833333
dtype: float64
print(Fit2.fittedvalues)
0    40.238095
1    52.380952
2    47.857143
3    68.333333
4    63.809524
5    67.619048
6    79.761905
dtype: float64
print(Fit2.resid)
0   -0.238095
1   -2.380952
2    2.142857
3    1.666667
4    1.190476
5   -2.619048
6    0.238095
dtype: float64
print(Fit2.bse)
Intercept     2.491482
Fertilizer    0.005832
Rainfall      0.154303
dtype: float64
print(Fit2.centered_tss)
1150.0
print(anova_lm(Fit2))
             df      sum_sq     mean_sq           F    PR(>F)
Fertilizer  1.0  972.321429  972.321429  181.500000  0.000176
Rainfall    1.0  156.250000  156.250000   29.166667  0.005690
Residual    4.0   21.428571    5.357143         NaN       NaN
from mpl_toolkits.mplot3d import Axes3D
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from matplotlib import cm
fig = plt.figure()
ax = fig.add_subplot(111, projection = "3d")
ax.scatter(
  df2["Fertilizer"]
, df2["Rainfall"]
, df2["Yield"]
, color = "green"
, marker = "o"
, alpha  = 1
)
ax.set_xlabel("Fertilizer")
ax.set_ylabel("Rainfall")
ax.set_zlabel("Yield")
x_surf = np.arange(100, 720, 20)                
y_surf = np.arange(10, 32, 2)
x_surf, y_surf = np.meshgrid(x_surf, y_surf)

exog = pd.core.frame.DataFrame({
   "Fertilizer": x_surf.ravel()
 , "Rainfall": y_surf.ravel()
 })
out = Fit2.predict(exog = exog)
ax.plot_surface(
           x_surf
         , y_surf
         , out.reshape(x_surf.shape)
         , rstride=1
         , cstride=1
         , color="None"
         , alpha = 0.4
         )
plt.show()

Polynomial Regression Analysis

  • Quantifying non-linear dependency of a normal response on quantitative explanatory variable(s)

Example

An experiment was conducted to evaluate the effects of different levels of nitrogen. Three levels of nitrogen: 0, 10 and 20 grams per plot were used in the experiment. Each treatment was replicated twice and data is given below:

Nitrogen Yield
0 5
0 7
10 15
10 17
20 9
20 11
Nitrogen = [0, 0, 10, 10, 20, 20]
Yield    = [5, 7, 15, 17,  9, 11]
import pandas as pd
df3 = pd.DataFrame(
 {
   "Nitrogen": Nitrogen
 , "Yield": Yield
 }
 )

print(df3)
   Nitrogen  Yield
0         0      5
1         0      7
2        10     15
3        10     17
4        20      9
5        20     11
from matplotlib import pyplot as plt
fig = plt.figure()
plt.scatter(
  df3["Nitrogen"]
, df3["Yield"]
, color = "green"
, marker = "o"
)
plt.title("Scatter plot of Nitrogen and Yield")
plt.xlabel("Nitrogen")
plt.ylabel("Yield")
plt.show()

from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
Reg3 = ols(formula = "Yield ~ Nitrogen + I(Nitrogen**2)", data = df3)
Fit3 = Reg3.fit()
print(Fit3.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  Yield   R-squared:                       0.944
Model:                            OLS   Adj. R-squared:                  0.907
Method:                 Least Squares   F-statistic:                     25.33
Date:                Tue, 02 Oct 2018   Prob (F-statistic):             0.0132
Time:                        15:25:27   Log-Likelihood:                -8.5136
No. Observations:                   6   AIC:                             23.03
Df Residuals:                       3   BIC:                             22.40
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept            6.0000      1.000      6.000      0.009       2.818       9.182
Nitrogen             1.8000      0.255      7.060      0.006       0.989       2.611
I(Nitrogen ** 2)    -0.0800      0.012     -6.532      0.007      -0.119      -0.041
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   3.333
Prob(Omnibus):                    nan   Jarque-Bera (JB):                1.000
Skew:                           0.000   Prob(JB):                        0.607
Kurtosis:                       1.000   Cond. No.                         418.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
print(Fit3.params)
Intercept           6.00
Nitrogen            1.80
I(Nitrogen ** 2)   -0.08
dtype: float64
print(Fit3.fittedvalues)
0     6.0
1     6.0
2    16.0
3    16.0
4    10.0
5    10.0
dtype: float64
print(Fit3.resid)
0   -1.0
1    1.0
2   -1.0
3    1.0
4   -1.0
5    1.0
dtype: float64
print(Fit3.bse)
Intercept           1.000000
Nitrogen            0.254951
I(Nitrogen ** 2)    0.012247
dtype: float64
print(Fit3.centered_tss)
107.333333333
print(anova_lm(Fit3))
                   df     sum_sq    mean_sq          F    PR(>F)
Nitrogen          1.0  16.000000  16.000000   8.000000  0.066276
I(Nitrogen ** 2)  1.0  85.333333  85.333333  42.666667  0.007292
Residual          3.0   6.000000   2.000000        NaN       NaN
fig = plt.figure()
plt.scatter(
  df3["Nitrogen"]
, df3["Yield"]
, color = "green"
, marker = "o"
)
plt.plot(df3["Nitrogen"], Fit3.fittedvalues)
plt.title("Regression plot of Nitrogen and Yield")
plt.xlabel("Nitrogen")
plt.ylabel("Yield")
plt.show()

comments powered by Disqus