Linear Model using Python

Blogs

18 February 2019

Python, Statistics, Linear Model, Regression Analysis, ANOVA, ANCOVA

Python Basics
Simple Linear Regression
Multiple Linear Regression
Polynomial Regression
Regression with Dummy Variables

Python Basics

Variables and Data Types

Variable Assignment

x = 5
x
# dir(x)

Calculations With Variables

x + 2      # Sum of two variables

x - 2      # Subtraction of two variables

x*2        # Multiplication of two variables

x**2       # Exponentiation of a variable

x%2        # Remainder of a variable

x/float(2) # Division of a variable

2.5

Types and Type Conversion

type(5)        # Integer

<class 'int'>

type(5.)       # Float

<class 'float'>

type('5')      # Strings

<class 'str'>

type(5 + 3j)   # Complex Number

<class 'complex'>

type(True)     # Boolean

<class 'bool'>

Logical Operators

a1 = True
print(a1)

True

type(a1)

<class 'bool'>

b1 = False
print(b1)

False

type(b1)

<class 'bool'>

True  and True

True

True  and False

False

False and True

False

False and False

False

True or False

True

not True

False

not False

True

True and not False

True

Comparison

x1 = 30
x1 > 30

False

x1 == 30

True

x1 >= 30

True

x1 > 15

True

not x1 == 42

True

x1 != 42

True

If-Else

x2 = 30                # assign 30 to x2
if x2 > 30:            # predicate: is x2 > 30
    print("Yes")       # if True, do this
else:
    print("No")        # if False, do this

No

Function

def slength1(s):
    """Returns a string describing the
    length of the sequences"""
    if len(s) > 10:
        ans = 'very long'
    else:
        ans = 'normal'
    return ans


help(slength1)

Help on function slength1 in module __main__:

slength1(s)
    Returns a string describing the
    length of the sequences

slength1("Hello")

'normal'

slength1("HelloHello")

'normal'

slength1("Hello again")

'very long'

help(dir)

Help on built-in function dir in module builtins:

dir(...)
    dir([object]) -> list of strings
    
    If called without an argument, return the names in the current scope.
    Else, return an alphabetized list of names comprising (some of) the attributes
    of the given object, and of attributes reachable from it.
    If the object supplies a method named __dir__, it will be used; otherwise
    the default dir() logic is used and returns:
      for a module object: the module's attributes.
      for a class object:  its attributes, and recursively the attributes
        of its bases.
      for any other object: its attributes, its class's attributes, and
        recursively the attributes of its class's base classes.

dir(slength1)

['__annotations__', '__call__', '__class__', '__closure__', '__code__', '__defaults__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__get__', '__getattribute__', '__globals__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__kwdefaults__', '__le__', '__lt__', '__module__', '__name__', '__ne__', '__new__', '__qualname__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__']

def slength2(s):
    """Returns a string describing the
    length of the sequences into 
    empty, very long, normal and short"""
    if len(s) == 0:
        ans = 'empty'
    elif len(s) > 10:
        ans = 'very long'
    elif len(s) > 7:
        ans = 'normal'
    else:
        ans = 'short'
    return ans


help(slength2)

Help on function slength2 in module __main__:

slength2(s)
    Returns a string describing the
    length of the sequences into 
    empty, very long, normal and short

slength2("")

'empty'

slength2("Good Morning")

'very long'

slength2("Greetings")

'normal'

slength2("Hi")

'short'

Help

# help(str)
# dir(str) 
# help(str.__add__)
# help(str.capitalize)
# dir(str.capitalize)

Simple Linear Regression

Income = [80, 100, 120, 140, 160, 180, 200, 220, 240, 260]
Expend = [70, 65, 90, 95, 110, 115, 120, 140, 155, 150]

import pandas as pd
# dir(pd)
# help(pd)
# help(pd.DataFrame)
# dir(pd.DataFrame)

df1 = pd.DataFrame(
 {
   "Income": Income
 , "Expend": Expend
 }
 )

print(df1)
# dir(df1)
# df1.to_html()
# df1.to_latex()
# help(df1.kurt)
# df1.kurt()
# help(df1.kurtosis)
# df1.kurtosis()
# help(df1.mean)
# df1.mean()

   Income  Expend
0      80      70
1     100      65
2     120      90
3     140      95
4     160     110
5     180     115
6     200     120
7     220     140
8     240     155
9     260     150

from matplotlib import pyplot as plt
fig = plt.figure()
plt.scatter(
  x      = "Income"
, y      = "Expend"
, color  = "green"
, marker = "o"
, data   = df1
)
plt.title("Scatter plot of Weekly Income (\$) and Weekly Expenditures (\$)")
plt.xlabel("Weekly Income (\$)")
plt.ylabel("Weekly Expenditures (\$)")
plt.show()

from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
Reg1 = ols(formula = "Expend ~ Income", data = df1)
Fit1 = Reg1.fit()

print(Fit1.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                 Expend   R-squared:                       0.962
Model:                            OLS   Adj. R-squared:                  0.957
Method:                 Least Squares   F-statistic:                     202.9
Date:                Tue, 16 Apr 2019   Prob (F-statistic):           5.75e-07
Time:                        10:31:38   Log-Likelihood:                -31.781
No. Observations:                  10   AIC:                             67.56
Df Residuals:                       8   BIC:                             68.17
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     24.4545      6.414      3.813      0.005       9.664      39.245
Income         0.5091      0.036     14.243      0.000       0.427       0.592
==============================================================================
Omnibus:                        1.060   Durbin-Watson:                   2.680
Prob(Omnibus):                  0.589   Jarque-Bera (JB):                0.777
Skew:                          -0.398   Prob(JB):                        0.678
Kurtosis:                       1.891   Cond. No.                         561.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

/home/myaseen/.local/lib/python3.6/site-packages/scipy/stats/stats.py:1416: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=10
  "anyway, n=%i" % int(n))

print(Fit1.params)

Intercept    24.454545
Income        0.509091
dtype: float64

print(Fit1.fittedvalues)

0     65.181818
1     75.363636
2     85.545455
3     95.727273
4    105.909091
5    116.090909
6    126.272727
7    136.454545
8    146.636364
9    156.818182
dtype: float64

print(Fit1.resid)

0     4.818182
1   -10.363636
2     4.454545
3    -0.727273
4     4.090909
5    -1.090909
6    -6.272727
7     3.545455
8     8.363636
9    -6.818182
dtype: float64

print(Fit1.bse)

Intercept    6.413817
Income       0.035743
dtype: float64

print(Fit1.centered_tss)

8890.0

print(anova_lm(Fit1))

           df       sum_sq      mean_sq           F        PR(>F)
Income    1.0  8552.727273  8552.727273  202.867925  5.752746e-07
Residual  8.0   337.272727    42.159091         NaN           NaN

fig = plt.figure()
plt.scatter(
  x      = "Income"
, y      = "Expend"
, color  = "green"
, marker = "o" 
, data   = df1
)
plt.plot(df1["Income"], Fit1.fittedvalues)
plt.title("Regression plot of Weekly Income (\$) and Weekly Expenditures (\$)")
plt.xlabel("Weekly Income (\$)")
plt.ylabel("Weekly Expenditures (\$)")
plt.show()

Multiple Linear Regression

import numpy as np
Fertilizer = np.arange(100, 800, 100)
Rainfall   = [10, 20, 10, 30, 20, 20, 30]
Yield      = [40, 50, 50, 70, 65, 65, 80]

import pandas as pd

df2 = pd.DataFrame(
 {
   "Fertilizer": Fertilizer
 , "Rainfall": Rainfall
 , "Yield": Yield
 }
 )

print(df2)

   Fertilizer  Rainfall  Yield
0         100        10     40
1         200        20     50
2         300        10     50
3         400        30     70
4         500        20     65
5         600        20     65
6         700        30     80

from mpl_toolkits.mplot3d import Axes3D
from matplotlib import pyplot as plt

fig = plt.figure()
ax = fig.add_subplot(111, projection = "3d")
ax.scatter(
  df2["Fertilizer"]
, df2["Rainfall"]
, df2["Yield"]
, color = "green"
, marker = "o"
, alpha  = 1
)
ax.set_xlabel("Fertilizer")
ax.set_ylabel("Rainfall")
ax.set_zlabel("Yield")
plt.show()

from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
Reg2 = ols(formula = "Yield ~ Fertilizer + Rainfall", data = df2)
Fit2 = Reg2.fit()
print(Fit2.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  Yield   R-squared:                       0.981
Model:                            OLS   Adj. R-squared:                  0.972
Method:                 Least Squares   F-statistic:                     105.3
Date:                Tue, 16 Apr 2019   Prob (F-statistic):           0.000347
Time:                        10:31:39   Log-Likelihood:                -13.848
No. Observations:                   7   AIC:                             33.70
Df Residuals:                       4   BIC:                             33.53
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept     28.0952      2.491     11.277      0.000      21.178      35.013
Fertilizer     0.0381      0.006      6.532      0.003       0.022       0.054
Rainfall       0.8333      0.154      5.401      0.006       0.405       1.262
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   2.249
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.705
Skew:                          -0.408   Prob(JB):                        0.703
Kurtosis:                       1.677   Cond. No.                     1.28e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 1.28e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

/home/myaseen/.local/lib/python3.6/site-packages/statsmodels/stats/stattools.py:72: ValueWarning: omni_normtest is not valid with less than 8 observations; 7 samples were given.
  "samples were given." % int(n), ValueWarning)

print(Fit2.params)

Intercept     28.095238
Fertilizer     0.038095
Rainfall       0.833333
dtype: float64

print(Fit2.fittedvalues)

0    40.238095
1    52.380952
2    47.857143
3    68.333333
4    63.809524
5    67.619048
6    79.761905
dtype: float64

print(Fit2.resid)

0   -0.238095
1   -2.380952
2    2.142857
3    1.666667
4    1.190476
5   -2.619048
6    0.238095
dtype: float64

print(Fit2.bse)

Intercept     2.491482
Fertilizer    0.005832
Rainfall      0.154303
dtype: float64

print(Fit2.centered_tss)

1150.0

print(anova_lm(Fit2))

             df      sum_sq     mean_sq           F    PR(>F)
Fertilizer  1.0  972.321429  972.321429  181.500000  0.000176
Rainfall    1.0  156.250000  156.250000   29.166667  0.005690
Residual    4.0   21.428571    5.357143         NaN       NaN

from mpl_toolkits.mplot3d import Axes3D
from matplotlib import pyplot as plt
import numpy as np
import pandas as pd
from matplotlib import cm
fig = plt.figure()
ax = fig.add_subplot(111, projection = "3d")
ax.scatter(
  df2["Fertilizer"]
, df2["Rainfall"]
, df2["Yield"]
, color = "green"
, marker = "o"
, alpha  = 1
)
ax.set_xlabel("Fertilizer")
ax.set_ylabel("Rainfall")
ax.set_zlabel("Yield")
x_surf = np.arange(100, 720, 20)
y_surf = np.arange(10, 32, 2)
x_surf, y_surf = np.meshgrid(x_surf, y_surf)

exog = pd.core.frame.DataFrame({
   "Fertilizer": x_surf.ravel()
 , "Rainfall": y_surf.ravel()
 })
out = Fit2.predict(exog = exog)
ax.plot_surface(
           x_surf
         , y_surf
         , out.values.reshape(x_surf.shape)
         , rstride=1
         , cstride=1
         , color="None"
         , alpha = 0.4
         )
plt.show()

Polynomial Regression

Nitrogen = [0, 0, 10, 10, 20, 20]
Yield    = [5, 7, 15, 17,  9, 11]

import pandas as pd
df3 = pd.DataFrame(
 {
   "Nitrogen": Nitrogen
 , "Yield": Yield
 }
 )

print(df3)

   Nitrogen  Yield
0         0      5
1         0      7
2        10     15
3        10     17
4        20      9
5        20     11

from matplotlib import pyplot as plt
fig = plt.figure()
plt.scatter(
  df3["Nitrogen"]
, df3["Yield"]
, color = "green"
, marker = "o"
)
plt.title("Scatter plot of Nitrogen and Yield")
plt.xlabel("Nitrogen")
plt.ylabel("Yield")
plt.show()

from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
Reg3 = ols(formula = "Yield ~ Nitrogen + I(Nitrogen**2)", data = df3)
Fit3 = Reg3.fit()
print(Fit3.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  Yield   R-squared:                       0.944
Model:                            OLS   Adj. R-squared:                  0.907
Method:                 Least Squares   F-statistic:                     25.33
Date:                Tue, 16 Apr 2019   Prob (F-statistic):             0.0132
Time:                        10:31:40   Log-Likelihood:                -8.5136
No. Observations:                   6   AIC:                             23.03
Df Residuals:                       3   BIC:                             22.40
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
====================================================================================
                       coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------------
Intercept            6.0000      1.000      6.000      0.009       2.818       9.182
Nitrogen             1.8000      0.255      7.060      0.006       0.989       2.611
I(Nitrogen ** 2)    -0.0800      0.012     -6.532      0.007      -0.119      -0.041
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   3.333
Prob(Omnibus):                    nan   Jarque-Bera (JB):                1.000
Skew:                           0.000   Prob(JB):                        0.607
Kurtosis:                       1.000   Cond. No.                         418.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

/home/myaseen/.local/lib/python3.6/site-packages/statsmodels/stats/stattools.py:72: ValueWarning: omni_normtest is not valid with less than 8 observations; 6 samples were given.
  "samples were given." % int(n), ValueWarning)

print(Fit3.params)

Intercept           6.00
Nitrogen            1.80
I(Nitrogen ** 2)   -0.08
dtype: float64

print(Fit3.fittedvalues)

0     6.0
1     6.0
2    16.0
3    16.0
4    10.0
5    10.0
dtype: float64

print(Fit3.resid)

0   -1.0
1    1.0
2   -1.0
3    1.0
4   -1.0
5    1.0
dtype: float64

print(Fit3.bse)

Intercept           1.000000
Nitrogen            0.254951
I(Nitrogen ** 2)    0.012247
dtype: float64

print(Fit3.centered_tss)

107.33333333333334

print(anova_lm(Fit3))

                   df     sum_sq    mean_sq          F    PR(>F)
Nitrogen          1.0  16.000000  16.000000   8.000000  0.066276
I(Nitrogen ** 2)  1.0  85.333333  85.333333  42.666667  0.007292
Residual          3.0   6.000000   2.000000        NaN       NaN

fig = plt.figure()
plt.scatter(
  df3["Nitrogen"]
, df3["Yield"]
, color = "green"
, marker = "o"
)
plt.plot(df3["Nitrogen"], Fit3.fittedvalues)
plt.title("Regression plot of Nitrogen and Yield")
plt.xlabel("Nitrogen")
plt.ylabel("Yield")
plt.show()

Regression with Dummy Variables

Example 1

Consumption = [5, 6, 7, 15, 16, 17]
Gender = ["Male", "Male","Male", "Female", "Female", "Female"]

import pandas as pd
df4 = pd.DataFrame(
 {
   "Consumption": Consumption
 , "Gender": Gender
 }
 )

print(df4)

   Consumption  Gender
0            5    Male
1            6    Male
2            7    Male
3           15  Female
4           16  Female
5           17  Female

import pandas as pd
fig = plt.figure()
df4.boxplot(
    "Consumption"
  , by = "Gender"
  )
plt.title("Boxplot")
plt.xlabel("Gender")
plt.ylabel("Consumption")
plt.show()

from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
fm4 = ols(formula = "Consumption ~ Gender", data = df4)
Fit4 = fm4.fit()

print(dir(Fit4))

['HC0_se', 'HC1_se', 'HC2_se', 'HC3_se', '_HCCM', '__class__', '__delattr__', '__dict__', '__dir__', '__doc__', '__eq__', '__format__', '__ge__', '__getattribute__', '__gt__', '__hash__', '__init__', '__init_subclass__', '__le__', '__lt__', '__module__', '__ne__', '__new__', '__reduce__', '__reduce_ex__', '__repr__', '__setattr__', '__sizeof__', '__str__', '__subclasshook__', '__weakref__', '_cache', '_data_attr', '_get_robustcov_results', '_is_nested', '_wexog_singular_values', 'aic', 'bic', 'bse', 'centered_tss', 'compare_f_test', 'compare_lm_test', 'compare_lr_test', 'condition_number', 'conf_int', 'conf_int_el', 'cov_HC0', 'cov_HC1', 'cov_HC2', 'cov_HC3', 'cov_kwds', 'cov_params', 'cov_type', 'df_model', 'df_resid', 'eigenvals', 'el_test', 'ess', 'f_pvalue', 'f_test', 'fittedvalues', 'fvalue', 'get_influence', 'get_prediction', 'get_robustcov_results', 'initialize', 'k_constant', 'llf', 'load', 'model', 'mse_model', 'mse_resid', 'mse_total', 'nobs', 'normalized_cov_params', 'outlier_test', 'params', 'predict', 'pvalues', 'remove_data', 'resid', 'resid_pearson', 'rsquared', 'rsquared_adj', 'save', 'scale', 'ssr', 'summary', 'summary2', 't_test', 't_test_pairwise', 'tvalues', 'uncentered_tss', 'use_t', 'wald_test', 'wald_test_terms', 'wresid']

print(Fit4.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:            Consumption   R-squared:                       0.974
Model:                            OLS   Adj. R-squared:                  0.968
Method:                 Least Squares   F-statistic:                     150.0
Date:                Tue, 16 Apr 2019   Prob (F-statistic):           0.000255
Time:                        10:31:40   Log-Likelihood:                -7.2972
No. Observations:                   6   AIC:                             18.59
Df Residuals:                       4   BIC:                             18.18
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==================================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept         16.0000      0.577     27.713      0.000      14.397      17.603
Gender[T.Male]   -10.0000      0.816    -12.247      0.000     -12.267      -7.733
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   2.000
Prob(Omnibus):                    nan   Jarque-Bera (JB):                0.562
Skew:                           0.000   Prob(JB):                        0.755
Kurtosis:                       1.500   Cond. No.                         2.62
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

/home/myaseen/.local/lib/python3.6/site-packages/statsmodels/stats/stattools.py:72: ValueWarning: omni_normtest is not valid with less than 8 observations; 6 samples were given.
  "samples were given." % int(n), ValueWarning)

print(Fit4.params)

Intercept         16.0
Gender[T.Male]   -10.0
dtype: float64

print(Fit4.centered_tss)

154.0

print(anova_lm(Fit4))

           df  sum_sq  mean_sq      F    PR(>F)
Gender    1.0   150.0    150.0  150.0  0.000255
Residual  4.0     4.0      1.0    NaN       NaN

Example 2

Consumption = [5, 7, 15, 17, 17, 19]
EduGroup = ["NoEdu", "NoEdu", "SchoolEdu", "SchoolEdu", "CollegeEdu", "CollegeEdu"]

import pandas as pd
df5 = pd.DataFrame(
 {
   "Consumption": Consumption
 , "EduGroup": EduGroup
 }
 )

print(df5)

   Consumption    EduGroup
0            5       NoEdu
1            7       NoEdu
2           15   SchoolEdu
3           17   SchoolEdu
4           17  CollegeEdu
5           19  CollegeEdu

fig = plt.figure()
df5.boxplot(
    "Consumption"
  , by = "EduGroup"
  )
plt.title("Boxplot")
plt.xlabel("EduGroup")
plt.ylabel("Consumption")
plt.show()

fm5 = ols(formula = "Consumption ~ EduGroup", data = df5)
Fit5 = fm5.fit()

print(Fit5.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:            Consumption   R-squared:                       0.965
Model:                            OLS   Adj. R-squared:                  0.942
Method:                 Least Squares   F-statistic:                     41.33
Date:                Tue, 16 Apr 2019   Prob (F-statistic):            0.00655
Time:                        10:31:40   Log-Likelihood:                -8.5136
No. Observations:                   6   AIC:                             23.03
Df Residuals:                       3   BIC:                             22.40
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
=========================================================================================
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
Intercept                18.0000      1.000     18.000      0.000      14.818      21.182
EduGroup[T.NoEdu]       -12.0000      1.414     -8.485      0.003     -16.501      -7.499
EduGroup[T.SchoolEdu]    -2.0000      1.414     -1.414      0.252      -6.501       2.501
==============================================================================
Omnibus:                          nan   Durbin-Watson:                   3.333
Prob(Omnibus):                    nan   Jarque-Bera (JB):                1.000
Skew:                           0.000   Prob(JB):                        0.607
Kurtosis:                       1.000   Cond. No.                         3.73
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

/home/myaseen/.local/lib/python3.6/site-packages/statsmodels/stats/stattools.py:72: ValueWarning: omni_normtest is not valid with less than 8 observations; 6 samples were given.
  "samples were given." % int(n), ValueWarning)

print(Fit5.params)

Intercept                18.0
EduGroup[T.NoEdu]       -12.0
EduGroup[T.SchoolEdu]    -2.0
dtype: float64

print(Fit5.centered_tss)

171.33333333333334

print(anova_lm(Fit5))

           df      sum_sq    mean_sq          F    PR(>F)
EduGroup  2.0  165.333333  82.666667  41.333333  0.006553
Residual  3.0    6.000000   2.000000        NaN       NaN

Example 3

Consumption = [51, 52, 53, 54, 56, 57, 55, 56, 58, 59, 62, 63]
Gender  = ["Male", "Male", "Male", "Male", "Male", "Male", "Female", "Female", "Female", "Female", "Female", "Female"]
Income = [80, 80, 90, 90, 100, 100, 80, 80, 90, 90, 100, 100]

import pandas as pd
df6 = pd.DataFrame(
 {
   "Consumption": Consumption
 , "Gender": Gender
 , "Income": Income
 }
 )

print(df6)

    Consumption  Gender  Income
0            51    Male      80
1            52    Male      80
2            53    Male      90
3            54    Male      90
4            56    Male     100
5            57    Male     100
6            55  Female      80
7            56  Female      80
8            58  Female      90
9            59  Female      90
10           62  Female     100
11           63  Female     100

Regression with same slopes and different intercepts

from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
Reg6 = ols(formula = "Consumption ~ Gender + Income", data = df6)
Fit6 = Reg6.fit()

print(Fit6.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:            Consumption   R-squared:                       0.963
Model:                            OLS   Adj. R-squared:                  0.955
Method:                 Least Squares   F-statistic:                     116.7
Date:                Tue, 16 Apr 2019   Prob (F-statistic):           3.66e-07
Time:                        10:31:40   Log-Likelihood:                -12.525
No. Observations:                  12   AIC:                             31.05
Df Residuals:                       9   BIC:                             32.51
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==================================================================================
                     coef    std err          t      P>|t|      [0.025      0.975]
----------------------------------------------------------------------------------
Intercept         31.8333      2.546     12.505      0.000      26.075      37.592
Gender[T.Male]    -5.0000      0.458    -10.914      0.000      -6.036      -3.964
Income             0.3000      0.028     10.694      0.000       0.237       0.363
==============================================================================
Omnibus:                        0.407   Durbin-Watson:                   2.294
Prob(Omnibus):                  0.816   Jarque-Bera (JB):                0.503
Skew:                           0.228   Prob(JB):                        0.778
Kurtosis:                       2.107   Cond. No.                     1.00e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large,  1e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

/home/myaseen/.local/lib/python3.6/site-packages/scipy/stats/stats.py:1416: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=12
  "anyway, n=%i" % int(n))

print(Fit6.params)

Intercept         31.833333
Gender[T.Male]    -5.000000
Income             0.300000
dtype: float64

print(Fit6.fittedvalues)

0     50.833333
1     50.833333
2     53.833333
3     53.833333
4     56.833333
5     56.833333
6     55.833333
7     55.833333
8     58.833333
9     58.833333
10    61.833333
11    61.833333
dtype: float64

print(Fit6.resid)

0     0.166667
1     1.166667
2    -0.833333
3     0.166667
4    -0.833333
5     0.166667
6    -0.833333
7     0.166667
8    -0.833333
9     0.166667
10    0.166667
11    1.166667
dtype: float64

print(Fit6.bse)

Intercept         2.545572
Gender[T.Male]    0.458123
Income            0.028054
dtype: float64

print(Fit6.centered_tss)

152.66666666666669

print(anova_lm(Fit6))

           df     sum_sq   mean_sq           F    PR(>F)
Gender    1.0  75.000000  75.00000  119.117647  0.000002
Income    1.0  72.000000  72.00000  114.352941  0.000002
Residual  9.0   5.666667   0.62963         NaN       NaN

import matplotlib.pyplot as plt
from statsmodels.graphics.factorplots import interaction_plot
fig = plt.figure()
fig = interaction_plot(
    x        = Income
  , trace    = Gender
  , response = Fit6.fittedvalues
  , colors   = ['red','blue']
  , markers  = ['D','^']
  , xlabel   ='Income'
  , ylabel   = 'Consumption'
  )
plt.show()

Regression with different slopes and different intercepts

from statsmodels.formula.api import ols
from statsmodels.stats.anova import anova_lm
Reg7 = ols(formula = "Consumption ~ Gender*Income", data = df6)
Fit7 = Reg7.fit()

print(Fit7.summary())

                            OLS Regression Results                            
==============================================================================
Dep. Variable:            Consumption   R-squared:                       0.976
Model:                            OLS   Adj. R-squared:                  0.967
Method:                 Least Squares   F-statistic:                     108.4
Date:                Tue, 16 Apr 2019   Prob (F-statistic):           8.11e-07
Time:                        10:31:41   Log-Likelihood:                -9.9135
No. Observations:                  12   AIC:                             27.83
Df Residuals:                       8   BIC:                             29.77
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
=========================================================================================
                            coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------------
Intercept                27.3333      3.059      8.935      0.000      20.279      34.387
Gender[T.Male]            4.0000      4.326      0.925      0.382      -5.976      13.976
Income                    0.3500      0.034     10.340      0.000       0.272       0.428
Gender[T.Male]:Income    -0.1000      0.048     -2.089      0.070      -0.210       0.010
==============================================================================
Omnibus:                        2.522   Durbin-Watson:                   3.273
Prob(Omnibus):                  0.283   Jarque-Bera (JB):                0.970
Skew:                          -0.055   Prob(JB):                        0.616
Kurtosis:                       1.612   Cond. No.                     2.62e+03
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
[2] The condition number is large, 2.62e+03. This might indicate that there are
strong multicollinearity or other numerical problems.

/home/myaseen/.local/lib/python3.6/site-packages/scipy/stats/stats.py:1416: UserWarning: kurtosistest only valid for n>=20 ... continuing anyway, n=12
  "anyway, n=%i" % int(n))

print(Fit7.params)

Intercept                27.333333
Gender[T.Male]            4.000000
Income                    0.350000
Gender[T.Male]:Income    -0.100000
dtype: float64

print(Fit7.fittedvalues)

0     51.333333
1     51.333333
2     53.833333
3     53.833333
4     56.333333
5     56.333333
6     55.333333
7     55.333333
8     58.833333
9     58.833333
10    62.333333
11    62.333333
dtype: float64

print(Fit7.resid)

0    -0.333333
1     0.666667
2    -0.833333
3     0.166667
4    -0.333333
5     0.666667
6    -0.333333
7     0.666667
8    -0.833333
9     0.166667
10   -0.333333
11    0.666667
dtype: float64

print(Fit7.bse)

Intercept                3.059026
Gender[T.Male]           4.326116
Income                   0.033850
Gender[T.Male]:Income    0.047871
dtype: float64

print(Fit7.centered_tss)

152.66666666666669

print(anova_lm(Fit7))

                df     sum_sq    mean_sq           F    PR(>F)
Gender         1.0  75.000000  75.000000  163.636364  0.000001
Income         1.0  72.000000  72.000000  157.090909  0.000002
Gender:Income  1.0   2.000000   2.000000    4.363636  0.070134
Residual       8.0   3.666667   0.458333         NaN       NaN

import matplotlib.pyplot as plt
from statsmodels.graphics.factorplots import interaction_plot
fig = plt.figure()
fig = interaction_plot(
    x        = Income
  , trace    = Gender
  , response = Fit7.fittedvalues
  , colors   = ['red','blue']
  , markers  = ['D','^']
  , xlabel   ='Income'
  , ylabel   = 'Consumption'
  )
plt.show()

All posts by date