Interactions and ANOVA

Note: This script is based heavily on Jonathan Taylor's class notes http://www.stanford.edu/class/stats191/interactions.html

Download and format data:

In [1]:
%matplotlib inline

from __future__ import print_function
from statsmodels.compat import urlopen
import numpy as np
np.set_printoptions(precision=4, suppress=True)
import statsmodels.api as sm
import pandas as pd
pd.set_option("display.width", 100)
import matplotlib.pyplot as plt
from statsmodels.formula.api import ols
from statsmodels.graphics.api import interaction_plot, abline_plot
from statsmodels.stats.anova import anova_lm

try:
    salary_table = pd.read_csv('salary.table')
except:  # recent pandas can read URL without urlopen
    url = 'http://stats191.stanford.edu/data/salary.table'
    fh = urlopen(url)
    salary_table = pd.read_table(fh)
    salary_table.to_csv('salary.table')

E = salary_table.E
M = salary_table.M
X = salary_table.X
S = salary_table.S

Take a look at the data:

In [2]:
plt.figure(figsize=(6,6))
symbols = ['D', '^']
colors = ['r', 'g', 'blue']
factor_groups = salary_table.groupby(['E','M'])
for values, group in factor_groups:
    i,j = values
    plt.scatter(group['X'], group['S'], marker=symbols[j], color=colors[i-1],
               s=144)
plt.xlabel('Experience');
plt.ylabel('Salary');

Fit a linear model:

In [3]:
formula = 'S ~ C(E) + C(M) + X'
lm = ols(formula, salary_table).fit()
print(lm.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      S   R-squared:                       0.957
Model:                            OLS   Adj. R-squared:                  0.953
Method:                 Least Squares   F-statistic:                     226.8
Date:                Mon, 14 May 2018   Prob (F-statistic):           2.23e-27
Time:                        21:44:50   Log-Likelihood:                -381.63
No. Observations:                  46   AIC:                             773.3
Df Residuals:                      41   BIC:                             782.4
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   8035.5976    386.689     20.781      0.000    7254.663    8816.532
C(E)[T.2]   3144.0352    361.968      8.686      0.000    2413.025    3875.045
C(E)[T.3]   2996.2103    411.753      7.277      0.000    2164.659    3827.762
C(M)[T.1]   6883.5310    313.919     21.928      0.000    6249.559    7517.503
X            546.1840     30.519     17.896      0.000     484.549     607.819
==============================================================================
Omnibus:                        2.293   Durbin-Watson:                   2.237
Prob(Omnibus):                  0.318   Jarque-Bera (JB):                1.362
Skew:                          -0.077   Prob(JB):                        0.506
Kurtosis:                       2.171   Cond. No.                         33.5
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Have a look at the created design matrix:

In [4]:
lm.model.exog[:5]
Out[4]:
array([[1., 0., 0., 1., 1.],
       [1., 0., 1., 0., 1.],
       [1., 0., 1., 1., 1.],
       [1., 1., 0., 0., 1.],
       [1., 0., 1., 0., 1.]])

Or since we initially passed in a DataFrame, we have a DataFrame available in

In [5]:
lm.model.data.orig_exog[:5]
Out[5]:
Intercept C(E)[T.2] C(E)[T.3] C(M)[T.1] X
0 1.0 0.0 0.0 1.0 1.0
1 1.0 0.0 1.0 0.0 1.0
2 1.0 0.0 1.0 1.0 1.0
3 1.0 1.0 0.0 0.0 1.0
4 1.0 0.0 1.0 0.0 1.0

We keep a reference to the original untouched data in

In [6]:
lm.model.data.frame[:5]
Out[6]:
S X E M
0 13876 1 1 1
1 11608 1 3 0
2 18701 1 3 1
3 11283 1 2 0
4 11767 1 3 0

Influence statistics

In [7]:
infl = lm.get_influence()
print(infl.summary_table())
==================================================================================================
       obs      endog     fitted     Cook's   student.   hat diag    dffits   ext.stud.     dffits
                           value          d   residual              internal   residual           
--------------------------------------------------------------------------------------------------
         0  13876.000  15465.313      0.104     -1.683      0.155     -0.722     -1.723     -0.739
         1  11608.000  11577.992      0.000      0.031      0.130      0.012      0.031      0.012
         2  18701.000  18461.523      0.001      0.247      0.109      0.086      0.244      0.085
         3  11283.000  11725.817      0.005     -0.458      0.113     -0.163     -0.453     -0.162
         4  11767.000  11577.992      0.001      0.197      0.130      0.076      0.195      0.075
         5  20872.000  19155.532      0.092      1.787      0.126      0.678      1.838      0.698
         6  11772.000  12272.001      0.006     -0.513      0.101     -0.172     -0.509     -0.170
         7  10535.000   9127.966      0.056      1.457      0.116      0.529      1.478      0.537
         8  12195.000  12124.176      0.000      0.074      0.123      0.028      0.073      0.027
         9  12313.000  12818.185      0.005     -0.516      0.091     -0.163     -0.511     -0.161
        10  14975.000  16557.681      0.084     -1.655      0.134     -0.650     -1.692     -0.664
        11  21371.000  19701.716      0.078      1.728      0.116      0.624      1.772      0.640
        12  19800.000  19553.891      0.001      0.252      0.096      0.082      0.249      0.081
        13  11417.000  10220.334      0.033      1.227      0.098      0.405      1.234      0.408
        14  20263.000  20100.075      0.001      0.166      0.093      0.053      0.165      0.053
        15  13231.000  13216.544      0.000      0.015      0.114      0.005      0.015      0.005
        16  12884.000  13364.369      0.004     -0.488      0.082     -0.146     -0.483     -0.145
        17  13245.000  13910.553      0.007     -0.674      0.075     -0.192     -0.669     -0.191
        18  13677.000  13762.728      0.000     -0.089      0.113     -0.032     -0.087     -0.031
        19  15965.000  17650.049      0.082     -1.747      0.119     -0.642     -1.794     -0.659
        20  12336.000  11312.702      0.021      1.043      0.087      0.323      1.044      0.323
        21  21352.000  21192.443      0.001      0.163      0.091      0.052      0.161      0.051
        22  13839.000  14456.737      0.006     -0.624      0.070     -0.171     -0.619     -0.170
        23  22884.000  21340.268      0.052      1.579      0.095      0.511      1.610      0.521
        24  16978.000  18742.417      0.083     -1.822      0.111     -0.644     -1.877     -0.664
        25  14803.000  15549.105      0.008     -0.751      0.065     -0.199     -0.747     -0.198
        26  17404.000  19288.601      0.093     -1.944      0.110     -0.684     -2.016     -0.709
        27  22184.000  22284.811      0.000     -0.103      0.096     -0.034     -0.102     -0.033
        28  13548.000  12405.070      0.025      1.162      0.083      0.350      1.167      0.352
        29  14467.000  13497.438      0.018      0.987      0.086      0.304      0.987      0.304
        30  15942.000  16641.473      0.007     -0.705      0.068     -0.190     -0.701     -0.189
        31  23174.000  23377.179      0.001     -0.209      0.108     -0.073     -0.207     -0.072
        32  23780.000  23525.004      0.001      0.260      0.092      0.083      0.257      0.082
        33  25410.000  24071.188      0.040      1.370      0.096      0.446      1.386      0.451
        34  14861.000  14043.622      0.014      0.834      0.091      0.263      0.831      0.262
        35  16882.000  17733.841      0.012     -0.863      0.077     -0.249     -0.860     -0.249
        36  24170.000  24469.547      0.003     -0.312      0.127     -0.119     -0.309     -0.118
        37  15990.000  15135.990      0.018      0.878      0.104      0.300      0.876      0.299
        38  26330.000  25163.556      0.035      1.202      0.109      0.420      1.209      0.422
        39  17949.000  18826.209      0.017     -0.897      0.093     -0.288     -0.895     -0.287
        40  25685.000  26108.099      0.008     -0.452      0.169     -0.204     -0.447     -0.202
        41  27837.000  26802.108      0.039      1.087      0.141      0.440      1.089      0.441
        42  18838.000  19918.577      0.033     -1.119      0.117     -0.407     -1.123     -0.408
        43  17483.000  16774.542      0.018      0.743      0.138      0.297      0.739      0.295
        44  19207.000  20464.761      0.052     -1.313      0.131     -0.511     -1.325     -0.515
        45  19346.000  18959.278      0.009      0.423      0.208      0.216      0.419      0.214
==================================================================================================

or get a dataframe

In [8]:
df_infl = infl.summary_frame()
In [9]:
df_infl[:5]
Out[9]:
dfb_Intercept dfb_C(E)[T.2] dfb_C(E)[T.3] dfb_C(M)[T.1] dfb_X cooks_d dffits dffits_internal hat_diag standard_resid student_resid
0 -0.505123 0.376134 0.483977 -0.369677 0.399111 0.104186 -0.738880 -0.721753 0.155327 -1.683099 -1.723037
1 0.004663 0.000145 0.006733 -0.006220 -0.004449 0.000029 0.011972 0.012120 0.130266 0.031318 0.030934
2 0.013627 0.000367 0.036876 0.030514 -0.034970 0.001492 0.085380 0.086377 0.109021 0.246931 0.244082
3 -0.083152 -0.074411 0.009704 0.053783 0.105122 0.005338 -0.161773 -0.163364 0.113030 -0.457630 -0.453173
4 0.029382 0.000917 0.042425 -0.039198 -0.028036 0.001166 0.075439 0.076340 0.130266 0.197257 0.194929

Now plot the reiduals within the groups separately:

In [10]:
resid = lm.resid
plt.figure(figsize=(6,6));
for values, group in factor_groups:
    i,j = values
    group_num = i*2 + j - 1  # for plotting purposes
    x = [group_num] * len(group)
    plt.scatter(x, resid[group.index], marker=symbols[j], color=colors[i-1],
            s=144, edgecolors='black')
plt.xlabel('Group');
plt.ylabel('Residuals');

Now we will test some interactions using anova or f_test

In [11]:
interX_lm = ols("S ~ C(E) * X + C(M)", salary_table).fit()
print(interX_lm.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      S   R-squared:                       0.961
Model:                            OLS   Adj. R-squared:                  0.955
Method:                 Least Squares   F-statistic:                     158.6
Date:                Mon, 14 May 2018   Prob (F-statistic):           8.23e-26
Time:                        21:44:50   Log-Likelihood:                -379.47
No. Observations:                  46   AIC:                             772.9
Df Residuals:                      39   BIC:                             785.7
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
===============================================================================
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept    7256.2800    549.494     13.205      0.000    6144.824    8367.736
C(E)[T.2]    4172.5045    674.966      6.182      0.000    2807.256    5537.753
C(E)[T.3]    3946.3649    686.693      5.747      0.000    2557.396    5335.333
C(M)[T.1]    7102.4539    333.442     21.300      0.000    6428.005    7776.903
X             632.2878     53.185     11.888      0.000     524.710     739.865
C(E)[T.2]:X  -125.5147     69.863     -1.797      0.080    -266.826      15.796
C(E)[T.3]:X  -141.2741     89.281     -1.582      0.122    -321.861      39.313
==============================================================================
Omnibus:                        0.432   Durbin-Watson:                   2.179
Prob(Omnibus):                  0.806   Jarque-Bera (JB):                0.590
Skew:                           0.144   Prob(JB):                        0.744
Kurtosis:                       2.526   Cond. No.                         69.7
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Do an ANOVA check

In [12]:
from statsmodels.stats.api import anova_lm

table1 = anova_lm(lm, interX_lm)
print(table1)

interM_lm = ols("S ~ X + C(E)*C(M)", data=salary_table).fit()
print(interM_lm.summary())

table2 = anova_lm(lm, interM_lm)
print(table2)
   df_resid           ssr  df_diff       ss_diff         F    Pr(>F)
0      41.0  4.328072e+07      0.0           NaN       NaN       NaN
1      39.0  3.941068e+07      2.0  3.870040e+06  1.914856  0.160964
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      S   R-squared:                       0.999
Model:                            OLS   Adj. R-squared:                  0.999
Method:                 Least Squares   F-statistic:                     5517.
Date:                Mon, 14 May 2018   Prob (F-statistic):           1.67e-55
Time:                        21:44:50   Log-Likelihood:                -298.74
No. Observations:                  46   AIC:                             611.5
Df Residuals:                      39   BIC:                             624.3
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
=======================================================================================
                          coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------------
Intercept            9472.6854     80.344    117.902      0.000    9310.175    9635.196
C(E)[T.2]            1381.6706     77.319     17.870      0.000    1225.279    1538.063
C(E)[T.3]            1730.7483    105.334     16.431      0.000    1517.690    1943.806
C(M)[T.1]            3981.3769    101.175     39.351      0.000    3776.732    4186.022
C(E)[T.2]:C(M)[T.1]  4902.5231    131.359     37.322      0.000    4636.825    5168.222
C(E)[T.3]:C(M)[T.1]  3066.0351    149.330     20.532      0.000    2763.986    3368.084
X                     496.9870      5.566     89.283      0.000     485.728     508.246
==============================================================================
Omnibus:                       74.761   Durbin-Watson:                   2.244
Prob(Omnibus):                  0.000   Jarque-Bera (JB):             1037.873
Skew:                          -4.103   Prob(JB):                    4.25e-226
Kurtosis:                      24.776   Cond. No.                         79.0
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
   df_resid           ssr  df_diff       ss_diff           F        Pr(>F)
0      41.0  4.328072e+07      0.0           NaN         NaN           NaN
1      39.0  1.178168e+06      2.0  4.210255e+07  696.844466  3.025504e-31
/Users/taugspurger/Envs/statsmodels-dev/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in greater
  return (self.a < x) & (x < self.b)
/Users/taugspurger/Envs/statsmodels-dev/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in less
  return (self.a < x) & (x < self.b)
/Users/taugspurger/Envs/statsmodels-dev/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py:1821: RuntimeWarning: invalid value encountered in less_equal
  cond2 = cond0 & (x <= self.a)

The design matrix as a DataFrame

In [13]:
interM_lm.model.data.orig_exog[:5]
Out[13]:
Intercept C(E)[T.2] C(E)[T.3] C(M)[T.1] C(E)[T.2]:C(M)[T.1] C(E)[T.3]:C(M)[T.1] X
0 1.0 0.0 0.0 1.0 0.0 0.0 1.0
1 1.0 0.0 1.0 0.0 0.0 0.0 1.0
2 1.0 0.0 1.0 1.0 0.0 1.0 1.0
3 1.0 1.0 0.0 0.0 0.0 0.0 1.0
4 1.0 0.0 1.0 0.0 0.0 0.0 1.0

The design matrix as an ndarray

In [14]:
interM_lm.model.exog
interM_lm.model.exog_names
Out[14]:
['Intercept',
 'C(E)[T.2]',
 'C(E)[T.3]',
 'C(M)[T.1]',
 'C(E)[T.2]:C(M)[T.1]',
 'C(E)[T.3]:C(M)[T.1]',
 'X']
In [15]:
infl = interM_lm.get_influence()
resid = infl.resid_studentized_internal
plt.figure(figsize=(6,6))
for values, group in factor_groups:
    i,j = values
    idx = group.index
    plt.scatter(X[idx], resid[idx], marker=symbols[j], color=colors[i-1],
            s=144, edgecolors='black')
plt.xlabel('X');
plt.ylabel('standardized resids');

Looks like one observation is an outlier.

In [16]:
drop_idx = abs(resid).argmax()
print(drop_idx)  # zero-based index
idx = salary_table.index.drop(drop_idx)

lm32 = ols('S ~ C(E) + X + C(M)', data=salary_table, subset=idx).fit()

print(lm32.summary())
print('\n')

interX_lm32 = ols('S ~ C(E) * X + C(M)', data=salary_table, subset=idx).fit()

print(interX_lm32.summary())
print('\n')


table3 = anova_lm(lm32, interX_lm32)
print(table3)
print('\n')


interM_lm32 = ols('S ~ X + C(E) * C(M)', data=salary_table, subset=idx).fit()

table4 = anova_lm(lm32, interM_lm32)
print(table4)
print('\n')
32
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      S   R-squared:                       0.955
Model:                            OLS   Adj. R-squared:                  0.950
Method:                 Least Squares   F-statistic:                     211.7
Date:                Mon, 14 May 2018   Prob (F-statistic):           2.45e-26
Time:                        21:44:51   Log-Likelihood:                -373.79
No. Observations:                  45   AIC:                             757.6
Df Residuals:                      40   BIC:                             766.6
Df Model:                           4                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept   8044.7518    392.781     20.482      0.000    7250.911    8838.592
C(E)[T.2]   3129.5286    370.470      8.447      0.000    2380.780    3878.277
C(E)[T.3]   2999.4451    416.712      7.198      0.000    2157.238    3841.652
C(M)[T.1]   6866.9856    323.991     21.195      0.000    6212.175    7521.796
X            545.7855     30.912     17.656      0.000     483.311     608.260
==============================================================================
Omnibus:                        2.511   Durbin-Watson:                   2.265
Prob(Omnibus):                  0.285   Jarque-Bera (JB):                1.400
Skew:                          -0.044   Prob(JB):                        0.496
Kurtosis:                       2.140   Cond. No.                         33.1
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.


                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      S   R-squared:                       0.959
Model:                            OLS   Adj. R-squared:                  0.952
Method:                 Least Squares   F-statistic:                     147.7
Date:                Mon, 14 May 2018   Prob (F-statistic):           8.97e-25
Time:                        21:44:51   Log-Likelihood:                -371.70
No. Observations:                  45   AIC:                             757.4
Df Residuals:                      38   BIC:                             770.0
Df Model:                           6                                         
Covariance Type:            nonrobust                                         
===============================================================================
                  coef    std err          t      P>|t|      [0.025      0.975]
-------------------------------------------------------------------------------
Intercept    7266.0887    558.872     13.001      0.000    6134.711    8397.466
C(E)[T.2]    4162.0846    685.728      6.070      0.000    2773.900    5550.269
C(E)[T.3]    3940.4359    696.067      5.661      0.000    2531.322    5349.549
C(M)[T.1]    7088.6387    345.587     20.512      0.000    6389.035    7788.243
X             631.6892     53.950     11.709      0.000     522.473     740.905
C(E)[T.2]:X  -125.5009     70.744     -1.774      0.084    -268.714      17.712
C(E)[T.3]:X  -139.8410     90.728     -1.541      0.132    -323.511      43.829
==============================================================================
Omnibus:                        0.617   Durbin-Watson:                   2.194
Prob(Omnibus):                  0.734   Jarque-Bera (JB):                0.728
Skew:                           0.162   Prob(JB):                        0.695
Kurtosis:                       2.468   Cond. No.                         68.7
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.


   df_resid           ssr  df_diff       ss_diff         F    Pr(>F)
0      40.0  4.320910e+07      0.0           NaN       NaN       NaN
1      38.0  3.937424e+07      2.0  3.834859e+06  1.850508  0.171042


   df_resid           ssr  df_diff       ss_diff            F        Pr(>F)
0      40.0  4.320910e+07      0.0           NaN          NaN           NaN
1      38.0  1.711881e+05      2.0  4.303791e+07  4776.734853  2.291239e-46


/Users/taugspurger/Envs/statsmodels-dev/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in greater
  return (self.a < x) & (x < self.b)
/Users/taugspurger/Envs/statsmodels-dev/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in less
  return (self.a < x) & (x < self.b)
/Users/taugspurger/Envs/statsmodels-dev/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py:1821: RuntimeWarning: invalid value encountered in less_equal
  cond2 = cond0 & (x <= self.a)

Replot the residuals

In [17]:
try:
    resid = interM_lm32.get_influence().summary_frame()['standard_resid']
except:
    resid = interM_lm32.get_influence().summary_frame()['standard_resid']

plt.figure(figsize=(6,6))
for values, group in factor_groups:
    i,j = values
    idx = group.index
    plt.scatter(X[idx], resid[idx], marker=symbols[j], color=colors[i-1],
            s=144, edgecolors='black')
plt.xlabel('X[~[32]]');
plt.ylabel('standardized resids');
/Users/taugspurger/Envs/statsmodels-dev/lib/python3.6/site-packages/pandas/core/series.py:696: FutureWarning: 
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  return self.loc[key]

Plot the fitted values

In [18]:
lm_final = ols('S ~ X + C(E)*C(M)', data = salary_table.drop([drop_idx])).fit()
mf = lm_final.model.data.orig_exog
lstyle = ['-','--']

plt.figure(figsize=(6,6))
for values, group in factor_groups:
    i,j = values
    idx = group.index
    plt.scatter(X[idx], S[idx], marker=symbols[j], color=colors[i-1],
                s=144, edgecolors='black')
    # drop NA because there is no idx 32 in the final model
    plt.plot(mf.X[idx].dropna(), lm_final.fittedvalues[idx].dropna(),
            ls=lstyle[j], color=colors[i-1])
plt.xlabel('Experience');
plt.ylabel('Salary');
/Users/taugspurger/Envs/statsmodels-dev/lib/python3.6/site-packages/pandas/core/series.py:696: FutureWarning: 
Passing list-likes to .loc or [] with any missing label will raise
KeyError in the future, you can use .reindex() as an alternative.

See the documentation here:
http://pandas.pydata.org/pandas-docs/stable/indexing.html#deprecate-loc-reindex-listlike
  return self.loc[key]

From our first look at the data, the difference between Master's and PhD in the management group is different than in the non-management group. This is an interaction between the two qualitative variables management,M and education,E. We can visualize this by first removing the effect of experience, then plotting the means within each of the 6 groups using interaction.plot.

In [19]:
U = S - X * interX_lm32.params['X']

plt.figure(figsize=(6,6))
interaction_plot(E, M, U, colors=['red','blue'], markers=['^','D'],
        markersize=10, ax=plt.gca())
Out[19]:

Minority Employment Data

In [20]:
try:
    jobtest_table = pd.read_table('jobtest.table')
except:  # don't have data already
    url = 'http://stats191.stanford.edu/data/jobtest.table'
    jobtest_table = pd.read_table(url)

factor_group = jobtest_table.groupby(['MINORITY'])

fig, ax = plt.subplots(figsize=(6,6))
colors = ['purple', 'green']
markers = ['o', 'v']
for factor, group in factor_group:
    ax.scatter(group['TEST'], group['JPERF'], color=colors[factor],
                marker=markers[factor], s=12**2)
ax.set_xlabel('TEST');
ax.set_ylabel('JPERF');
In [21]:
min_lm = ols('JPERF ~ TEST', data=jobtest_table).fit()
print(min_lm.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  JPERF   R-squared:                       0.517
Model:                            OLS   Adj. R-squared:                  0.490
Method:                 Least Squares   F-statistic:                     19.25
Date:                Mon, 14 May 2018   Prob (F-statistic):           0.000356
Time:                        21:44:53   Log-Likelihood:                -36.614
No. Observations:                  20   AIC:                             77.23
Df Residuals:                      18   BIC:                             79.22
Df Model:                           1                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      1.0350      0.868      1.192      0.249      -0.789       2.859
TEST           2.3605      0.538      4.387      0.000       1.230       3.491
==============================================================================
Omnibus:                        0.324   Durbin-Watson:                   2.896
Prob(Omnibus):                  0.850   Jarque-Bera (JB):                0.483
Skew:                          -0.186   Prob(JB):                        0.785
Kurtosis:                       2.336   Cond. No.                         5.26
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [22]:
fig, ax = plt.subplots(figsize=(6,6));
for factor, group in factor_group:
    ax.scatter(group['TEST'], group['JPERF'], color=colors[factor],
                marker=markers[factor], s=12**2)

ax.set_xlabel('TEST')
ax.set_ylabel('JPERF')
fig = abline_plot(model_results = min_lm, ax=ax)
In [23]:
min_lm2 = ols('JPERF ~ TEST + TEST:MINORITY',
        data=jobtest_table).fit()

print(min_lm2.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  JPERF   R-squared:                       0.632
Model:                            OLS   Adj. R-squared:                  0.589
Method:                 Least Squares   F-statistic:                     14.59
Date:                Mon, 14 May 2018   Prob (F-statistic):           0.000204
Time:                        21:44:53   Log-Likelihood:                -33.891
No. Observations:                  20   AIC:                             73.78
Df Residuals:                      17   BIC:                             76.77
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         1.1211      0.780      1.437      0.169      -0.525       2.768
TEST              1.8276      0.536      3.412      0.003       0.698       2.958
TEST:MINORITY     0.9161      0.397      2.306      0.034       0.078       1.754
==============================================================================
Omnibus:                        0.388   Durbin-Watson:                   3.008
Prob(Omnibus):                  0.823   Jarque-Bera (JB):                0.514
Skew:                           0.050   Prob(JB):                        0.773
Kurtosis:                       2.221   Cond. No.                         5.96
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [24]:
fig, ax = plt.subplots(figsize=(6,6));
for factor, group in factor_group:
    ax.scatter(group['TEST'], group['JPERF'], color=colors[factor],
                marker=markers[factor], s=12**2)

fig = abline_plot(intercept = min_lm2.params['Intercept'],
                 slope = min_lm2.params['TEST'], ax=ax, color='purple');
fig = abline_plot(intercept = min_lm2.params['Intercept'],
        slope = min_lm2.params['TEST'] + min_lm2.params['TEST:MINORITY'],
        ax=ax, color='green');
In [25]:
min_lm3 = ols('JPERF ~ TEST + MINORITY', data = jobtest_table).fit()
print(min_lm3.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  JPERF   R-squared:                       0.572
Model:                            OLS   Adj. R-squared:                  0.522
Method:                 Least Squares   F-statistic:                     11.38
Date:                Mon, 14 May 2018   Prob (F-statistic):           0.000731
Time:                        21:44:54   Log-Likelihood:                -35.390
No. Observations:                  20   AIC:                             76.78
Df Residuals:                      17   BIC:                             79.77
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
Intercept      0.6120      0.887      0.690      0.500      -1.260       2.483
TEST           2.2988      0.522      4.400      0.000       1.197       3.401
MINORITY       1.0276      0.691      1.487      0.155      -0.430       2.485
==============================================================================
Omnibus:                        0.251   Durbin-Watson:                   3.028
Prob(Omnibus):                  0.882   Jarque-Bera (JB):                0.437
Skew:                          -0.059   Prob(JB):                        0.804
Kurtosis:                       2.286   Cond. No.                         5.72
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [26]:
fig, ax = plt.subplots(figsize=(6,6));
for factor, group in factor_group:
    ax.scatter(group['TEST'], group['JPERF'], color=colors[factor],
                marker=markers[factor], s=12**2)

fig = abline_plot(intercept = min_lm3.params['Intercept'],
                 slope = min_lm3.params['TEST'], ax=ax, color='purple');
fig = abline_plot(intercept = min_lm3.params['Intercept'] + min_lm3.params['MINORITY'],
        slope = min_lm3.params['TEST'], ax=ax, color='green');
In [27]:
min_lm4 = ols('JPERF ~ TEST * MINORITY', data = jobtest_table).fit()
print(min_lm4.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                  JPERF   R-squared:                       0.664
Model:                            OLS   Adj. R-squared:                  0.601
Method:                 Least Squares   F-statistic:                     10.55
Date:                Mon, 14 May 2018   Prob (F-statistic):           0.000451
Time:                        21:44:54   Log-Likelihood:                -32.971
No. Observations:                  20   AIC:                             73.94
Df Residuals:                      16   BIC:                             77.92
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
=================================================================================
                    coef    std err          t      P>|t|      [0.025      0.975]
---------------------------------------------------------------------------------
Intercept         2.0103      1.050      1.914      0.074      -0.216       4.236
TEST              1.3134      0.670      1.959      0.068      -0.108       2.735
MINORITY         -1.9132      1.540     -1.242      0.232      -5.179       1.352
TEST:MINORITY     1.9975      0.954      2.093      0.053      -0.026       4.021
==============================================================================
Omnibus:                        3.377   Durbin-Watson:                   3.015
Prob(Omnibus):                  0.185   Jarque-Bera (JB):                1.330
Skew:                           0.120   Prob(JB):                        0.514
Kurtosis:                       1.760   Cond. No.                         13.8
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In [28]:
fig, ax = plt.subplots(figsize=(8,6));
for factor, group in factor_group:
    ax.scatter(group['TEST'], group['JPERF'], color=colors[factor],
                marker=markers[factor], s=12**2)

fig = abline_plot(intercept = min_lm4.params['Intercept'],
                 slope = min_lm4.params['TEST'], ax=ax, color='purple');
fig = abline_plot(intercept = min_lm4.params['Intercept'] + min_lm4.params['MINORITY'],
        slope = min_lm4.params['TEST'] + min_lm4.params['TEST:MINORITY'],
        ax=ax, color='green');
In [29]:
# is there any effect of MINORITY on slope or intercept?
table5 = anova_lm(min_lm, min_lm4)
print(table5)
   df_resid        ssr  df_diff    ss_diff         F    Pr(>F)
0      18.0  45.568297      0.0        NaN       NaN       NaN
1      16.0  31.655473      2.0  13.912824  3.516061  0.054236
/Users/taugspurger/Envs/statsmodels-dev/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in greater
  return (self.a < x) & (x < self.b)
/Users/taugspurger/Envs/statsmodels-dev/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in less
  return (self.a < x) & (x < self.b)
/Users/taugspurger/Envs/statsmodels-dev/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py:1821: RuntimeWarning: invalid value encountered in less_equal
  cond2 = cond0 & (x <= self.a)
In [30]:
# is there any effect of MINORITY on intercept
table6 = anova_lm(min_lm, min_lm3)
print(table6)
   df_resid        ssr  df_diff   ss_diff         F    Pr(>F)
0      18.0  45.568297      0.0       NaN       NaN       NaN
1      17.0  40.321546      1.0  5.246751  2.212087  0.155246
/Users/taugspurger/Envs/statsmodels-dev/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in greater
  return (self.a < x) & (x < self.b)
/Users/taugspurger/Envs/statsmodels-dev/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in less
  return (self.a < x) & (x < self.b)
/Users/taugspurger/Envs/statsmodels-dev/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py:1821: RuntimeWarning: invalid value encountered in less_equal
  cond2 = cond0 & (x <= self.a)
In [31]:
# is there any effect of MINORITY on slope
table7 = anova_lm(min_lm, min_lm2)
print(table7)
   df_resid        ssr  df_diff    ss_diff         F    Pr(>F)
0      18.0  45.568297      0.0        NaN       NaN       NaN
1      17.0  34.707653      1.0  10.860644  5.319603  0.033949
/Users/taugspurger/Envs/statsmodels-dev/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in greater
  return (self.a < x) & (x < self.b)
/Users/taugspurger/Envs/statsmodels-dev/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in less
  return (self.a < x) & (x < self.b)
/Users/taugspurger/Envs/statsmodels-dev/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py:1821: RuntimeWarning: invalid value encountered in less_equal
  cond2 = cond0 & (x <= self.a)
In [32]:
# is it just the slope or both?
table8 = anova_lm(min_lm2, min_lm4)
print(table8)
   df_resid        ssr  df_diff  ss_diff         F    Pr(>F)
0      17.0  34.707653      0.0      NaN       NaN       NaN
1      16.0  31.655473      1.0  3.05218  1.542699  0.232115
/Users/taugspurger/Envs/statsmodels-dev/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in greater
  return (self.a < x) & (x < self.b)
/Users/taugspurger/Envs/statsmodels-dev/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in less
  return (self.a < x) & (x < self.b)
/Users/taugspurger/Envs/statsmodels-dev/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py:1821: RuntimeWarning: invalid value encountered in less_equal
  cond2 = cond0 & (x <= self.a)

One-way ANOVA

In [33]:
try:
    rehab_table = pd.read_csv('rehab.table')
except:
    url = 'http://stats191.stanford.edu/data/rehab.csv'
    rehab_table = pd.read_table(url, delimiter=",")
    rehab_table.to_csv('rehab.table')

fig, ax = plt.subplots(figsize=(8,6))
fig = rehab_table.boxplot('Time', 'Fitness', ax=ax, grid=False)
In [34]:
rehab_lm = ols('Time ~ C(Fitness)', data=rehab_table).fit()
table9 = anova_lm(rehab_lm)
print(table9)

print(rehab_lm.model.data.orig_exog)
              df  sum_sq     mean_sq          F    PR(>F)
C(Fitness)   2.0   672.0  336.000000  16.961538  0.000041
Residual    21.0   416.0   19.809524        NaN       NaN
    Intercept  C(Fitness)[T.2]  C(Fitness)[T.3]
0         1.0              0.0              0.0
1         1.0              0.0              0.0
2         1.0              0.0              0.0
3         1.0              0.0              0.0
4         1.0              0.0              0.0
5         1.0              0.0              0.0
6         1.0              0.0              0.0
7         1.0              0.0              0.0
8         1.0              1.0              0.0
9         1.0              1.0              0.0
10        1.0              1.0              0.0
11        1.0              1.0              0.0
12        1.0              1.0              0.0
13        1.0              1.0              0.0
14        1.0              1.0              0.0
15        1.0              1.0              0.0
16        1.0              1.0              0.0
17        1.0              1.0              0.0
18        1.0              0.0              1.0
19        1.0              0.0              1.0
20        1.0              0.0              1.0
21        1.0              0.0              1.0
22        1.0              0.0              1.0
23        1.0              0.0              1.0
In [35]:
print(rehab_lm.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                   Time   R-squared:                       0.618
Model:                            OLS   Adj. R-squared:                  0.581
Method:                 Least Squares   F-statistic:                     16.96
Date:                Mon, 14 May 2018   Prob (F-statistic):           4.13e-05
Time:                        21:44:56   Log-Likelihood:                -68.286
No. Observations:                  24   AIC:                             142.6
Df Residuals:                      21   BIC:                             146.1
Df Model:                           2                                         
Covariance Type:            nonrobust                                         
===================================================================================
                      coef    std err          t      P>|t|      [0.025      0.975]
-----------------------------------------------------------------------------------
Intercept          38.0000      1.574     24.149      0.000      34.728      41.272
C(Fitness)[T.2]    -6.0000      2.111     -2.842      0.010     -10.390      -1.610
C(Fitness)[T.3]   -14.0000      2.404     -5.824      0.000     -18.999      -9.001
==============================================================================
Omnibus:                        0.163   Durbin-Watson:                   2.209
Prob(Omnibus):                  0.922   Jarque-Bera (JB):                0.211
Skew:                          -0.163   Prob(JB):                        0.900
Kurtosis:                       2.675   Cond. No.                         3.80
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

Two-way ANOVA

In [36]:
try:
    kidney_table = pd.read_table('./kidney.table')
except:
    url = 'http://stats191.stanford.edu/data/kidney.table'
    kidney_table = pd.read_csv(url, delim_whitespace=True)

Explore the dataset

In [37]:
kidney_table.head(10)
Out[37]:
Days Duration Weight ID
0 0.0 1 1 1
1 2.0 1 1 2
2 1.0 1 1 3
3 3.0 1 1 4
4 0.0 1 1 5
5 2.0 1 1 6
6 0.0 1 1 7
7 5.0 1 1 8
8 6.0 1 1 9
9 8.0 1 1 10

Balanced panel

In [38]:
kt = kidney_table
plt.figure(figsize=(8,6))
fig = interaction_plot(kt['Weight'], kt['Duration'], np.log(kt['Days']+1),
        colors=['red', 'blue'], markers=['D','^'], ms=10, ax=plt.gca())

You have things available in the calling namespace available in the formula evaluation namespace

In [39]:
kidney_lm = ols('np.log(Days+1) ~ C(Duration) * C(Weight)', data=kt).fit()

table10 = anova_lm(kidney_lm)

print(anova_lm(ols('np.log(Days+1) ~ C(Duration) + C(Weight)',
                data=kt).fit(), kidney_lm))
print(anova_lm(ols('np.log(Days+1) ~ C(Duration)', data=kt).fit(),
               ols('np.log(Days+1) ~ C(Duration) + C(Weight, Sum)',
                   data=kt).fit()))
print(anova_lm(ols('np.log(Days+1) ~ C(Weight)', data=kt).fit(),
               ols('np.log(Days+1) ~ C(Duration) + C(Weight, Sum)',
                   data=kt).fit()))
   df_resid        ssr  df_diff   ss_diff        F    Pr(>F)
0      56.0  29.624856      0.0       NaN      NaN       NaN
1      54.0  28.989198      2.0  0.635658  0.59204  0.556748
   df_resid        ssr  df_diff    ss_diff          F    Pr(>F)
0      58.0  46.596147      0.0        NaN        NaN       NaN
1      56.0  29.624856      2.0  16.971291  16.040454  0.000003
   df_resid        ssr  df_diff   ss_diff         F   Pr(>F)
0      57.0  31.964549      0.0       NaN       NaN      NaN
1      56.0  29.624856      1.0  2.339693  4.422732  0.03997
/Users/taugspurger/Envs/statsmodels-dev/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in greater
  return (self.a < x) & (x < self.b)
/Users/taugspurger/Envs/statsmodels-dev/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py:879: RuntimeWarning: invalid value encountered in less
  return (self.a < x) & (x < self.b)
/Users/taugspurger/Envs/statsmodels-dev/lib/python3.6/site-packages/scipy/stats/_distn_infrastructure.py:1821: RuntimeWarning: invalid value encountered in less_equal
  cond2 = cond0 & (x <= self.a)

Sum of squares

Illustrates the use of different types of sums of squares (I,II,II) and how the Sum contrast can be used to produce the same output between the 3.

Types I and II are equivalent under a balanced design.

Don't use Type III with non-orthogonal contrast - ie., Treatment

In [40]:
sum_lm = ols('np.log(Days+1) ~ C(Duration, Sum) * C(Weight, Sum)',
            data=kt).fit()

print(anova_lm(sum_lm))
print(anova_lm(sum_lm, typ=2))
print(anova_lm(sum_lm, typ=3))
                                   df     sum_sq   mean_sq          F    PR(>F)
C(Duration, Sum)                  1.0   2.339693  2.339693   4.358293  0.041562
C(Weight, Sum)                    2.0  16.971291  8.485645  15.806745  0.000004
C(Duration, Sum):C(Weight, Sum)   2.0   0.635658  0.317829   0.592040  0.556748
Residual                         54.0  28.989198  0.536837        NaN       NaN
                                    sum_sq    df          F    PR(>F)
C(Duration, Sum)                  2.339693   1.0   4.358293  0.041562
C(Weight, Sum)                   16.971291   2.0  15.806745  0.000004
C(Duration, Sum):C(Weight, Sum)   0.635658   2.0   0.592040  0.556748
Residual                         28.989198  54.0        NaN       NaN
                                     sum_sq    df           F        PR(>F)
Intercept                        156.301830   1.0  291.153237  2.077589e-23
C(Duration, Sum)                   2.339693   1.0    4.358293  4.156170e-02
C(Weight, Sum)                    16.971291   2.0   15.806745  3.944502e-06
C(Duration, Sum):C(Weight, Sum)    0.635658   2.0    0.592040  5.567479e-01
Residual                          28.989198  54.0         NaN           NaN
In [41]:
nosum_lm = ols('np.log(Days+1) ~ C(Duration, Treatment) * C(Weight, Treatment)',
            data=kt).fit()
print(anova_lm(nosum_lm))
print(anova_lm(nosum_lm, typ=2))
print(anova_lm(nosum_lm, typ=3))
                                               df     sum_sq   mean_sq          F    PR(>F)
C(Duration, Treatment)                        1.0   2.339693  2.339693   4.358293  0.041562
C(Weight, Treatment)                          2.0  16.971291  8.485645  15.806745  0.000004
C(Duration, Treatment):C(Weight, Treatment)   2.0   0.635658  0.317829   0.592040  0.556748
Residual                                     54.0  28.989198  0.536837        NaN       NaN
                                                sum_sq    df          F    PR(>F)
C(Duration, Treatment)                        2.339693   1.0   4.358293  0.041562
C(Weight, Treatment)                         16.971291   2.0  15.806745  0.000004
C(Duration, Treatment):C(Weight, Treatment)   0.635658   2.0   0.592040  0.556748
Residual                                     28.989198  54.0        NaN       NaN
                                                sum_sq    df          F    PR(>F)
Intercept                                    10.427596   1.0  19.424139  0.000050
C(Duration, Treatment)                        0.054293   1.0   0.101134  0.751699
C(Weight, Treatment)                         11.703387   2.0  10.900317  0.000106
C(Duration, Treatment):C(Weight, Treatment)   0.635658   2.0   0.592040  0.556748
Residual                                     28.989198  54.0        NaN       NaN