Regression diagnostics ======================== .. _regression_diagnostics_notebook: `Link to Notebook GitHub `_ .. raw:: html

This example file shows how to use a few of the statsmodels regression diagnostic tests in a real-life context. You can learn about more tests and find out more information abou the tests here on the Regression Diagnostics page.

Note that most of the tests described here only return a tuple of numbers, without any annotation. A full description of outputs is always included in the docstring and in the online statsmodels documentation. For presentation purposes, we use the zip(name,test) construct to pretty-print short descriptions in the examples below.

Estimate a regression model

In [ ]:
from __future__ import print_function
   from statsmodels.compat import lzip
   import statsmodels
   import numpy as np
   import pandas as pd
   import statsmodels.formula.api as smf
   import statsmodels.stats.api as sms
   import matplotlib.pyplot as plt
   
   # Load data
   url = 'http://vincentarelbundock.github.io/Rdatasets/csv/HistData/Guerry.csv'
   dat = pd.read_csv(url)
   
   # Fit regression model (using the natural log of one of the regressaors)
   results = smf.ols('Lottery ~ Literacy + np.log(Pop1831)', data=dat).fit()
   
   # Inspect the results
   print(results.summary())
   

Normality of the residuals

Jarque-Bera test:

In [ ]:
name = ['Jarque-Bera', 'Chi^2 two-tail prob.', 'Skew', 'Kurtosis']
   test = sms.jarque_bera(results.resid)
   lzip(name, test)
   
                            OLS Regression Results                            
   ==============================================================================
   Dep. Variable:                Lottery   R-squared:                       0.348
   Model:                            OLS   Adj. R-squared:                  0.333
   Method:                 Least Squares   F-statistic:                     22.20
   Date:                Mon, 20 Jul 2015   Prob (F-statistic):           1.90e-08
   Time:                        17:44:23   Log-Likelihood:                -379.82
   No. Observations:                  86   AIC:                             765.6
   Df Residuals:                      83   BIC:                             773.0
   Df Model:                           2                                         
   Covariance Type:            nonrobust                                         
   ===================================================================================
                         coef    std err          t      P>|t|      [95.0% Conf. Int.]
   -----------------------------------------------------------------------------------
   Intercept         246.4341     35.233      6.995      0.000       176.358   316.510
   Literacy           -0.4889      0.128     -3.832      0.000        -0.743    -0.235
   np.log(Pop1831)   -31.3114      5.977     -5.239      0.000       -43.199   -19.424
   ==============================================================================
   Omnibus:                        3.713   Durbin-Watson:                   2.019
   Prob(Omnibus):                  0.156   Jarque-Bera (JB):                3.394
   Skew:                          -0.487   Prob(JB):                        0.183
   Kurtosis:                       3.003   Cond. No.                         702.
   ==============================================================================
   
   Warnings:
   [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
   

Omni test:

In [ ]:
name = ['Chi^2', 'Two-tail probability']
   test = sms.omni_normtest(results.resid)
   lzip(name, test)
   

Influence tests

Once created, an object of class OLSInfluence holds attributes and methods that allow users to assess the influence of each observation. For example, we can compute and extract the first few rows of DFbetas by:

In [ ]:
from statsmodels.stats.outliers_influence import OLSInfluence
   test_class = OLSInfluence(results)
   test_class.dfbetas[:5,:]
   

Explore other options by typing dir(influence_test)

Useful information on leverage can also be plotted:

In [ ]:
from statsmodels.graphics.regressionplots import plot_leverage_resid2
   fig, ax = plt.subplots(figsize=(8,6))
   fig = plot_leverage_resid2(results, ax = ax)
   

Other plotting options can be found on the Graphics page.

Multicollinearity

Condition number:

In [ ]:
np.linalg.cond(results.model.exog)
   

Heteroskedasticity tests

Breush-Pagan test:

In [ ]:
name = ['Lagrange multiplier statistic', 'p-value', 
           'f-value', 'f p-value']
   test = sms.het_breushpagan(results.resid, results.model.exog)
   lzip(name, test)
   

Goldfeld-Quandt test

In [ ]:
name = ['F statistic', 'p-value']
   test = sms.het_goldfeldquandt(results.resid, results.model.exog)
   lzip(name, test)
   

Linearity

Harvey-Collier multiplier test for Null hypothesis that the linear specification is correct:

In [ ]:
name = ['t value', 'p value']
   test = sms.linear_harvey_collier(results)
   lzip(name, test)