Robust Linear Models ====================== .. _robust_models_0_notebook: `Link to Notebook GitHub `_ .. raw:: html
In [ ]:
from __future__ import print_function
   import numpy as np
   import statsmodels.api as sm
   import matplotlib.pyplot as plt
   from statsmodels.sandbox.regression.predstd import wls_prediction_std
   

Estimation

Load data:

In [ ]:
data = sm.datasets.stackloss.load()
   data.exog = sm.add_constant(data.exog)
   

Huber's T norm with the (default) median absolute deviation scaling

In [ ]:
huber_t = sm.RLM(data.endog, data.exog, M=sm.robust.norms.HuberT())
   hub_results = huber_t.fit()
   print(hub_results.params)
   print(hub_results.bse)
   print(hub_results.summary(yname='y',
               xname=['var_%d' % i for i in range(len(hub_results.params))]))
   

Huber's T norm with 'H2' covariance matrix

In [ ]:
hub_results2 = huber_t.fit(cov="H2")
   print(hub_results2.params)
   print(hub_results2.bse)
   
[-41.02649835   0.82938433   0.92606597  -0.12784672]
   [ 9.79189854  0.11100521  0.30293016  0.12864961]
                       Robust linear Model Regression Results                    
   ==============================================================================
   Dep. Variable:                      y   No. Observations:                   21
   Model:                            RLM   Df Residuals:                       17
   Method:                          IRLS   Df Model:                            3
   Norm:                          HuberT                                         
   Scale Est.:                       mad                                         
   Cov Type:                          H1                                         
   Date:                Mon, 20 Jul 2015                                         
   Time:                        17:44:35                                         
   No. Iterations:                    19                                         
   ==============================================================================
                    coef    std err          z      P>|z|      [95.0% Conf. Int.]
   ------------------------------------------------------------------------------
   var_0        -41.0265      9.792     -4.190      0.000       -60.218   -21.835
   var_1          0.8294      0.111      7.472      0.000         0.612     1.047
   var_2          0.9261      0.303      3.057      0.002         0.332     1.520
   var_3         -0.1278      0.129     -0.994      0.320        -0.380     0.124
   ==============================================================================
   
   If the model instance has been used for another fit with different fit
   parameters, then the fit options might not be the correct ones anymore .
   

Andrew's Wave norm with Huber's Proposal 2 scaling and 'H3' covariance matrix

In [ ]:
andrew_mod = sm.RLM(data.endog, data.exog, M=sm.robust.norms.AndrewWave())
   andrew_results = andrew_mod.fit(scale_est=sm.robust.scale.HuberScale(), cov="H3")
   print('Parameters: ', andrew_results.params)
   
[-41.02649835   0.82938433   0.92606597  -0.12784672]
   [ 9.08950419  0.11945975  0.32235497  0.11796313]
   

See help(sm.RLM.fit) for more options and module sm.robust.scale for scale options

Comparing OLS and RLM

Artificial data with outliers:

In [ ]:
nsample = 50
   x1 = np.linspace(0, 20, nsample)
   X = np.column_stack((x1, (x1-5)**2))
   X = sm.add_constant(X)
   sig = 0.3   # smaller error variance makes OLS<->RLM contrast bigger
   beta = [5, 0.5, -0.0]
   y_true2 = np.dot(X, beta)
   y2 = y_true2 + sig*1. * np.random.normal(size=nsample)
   y2[[39,41,43,45,48]] -= 5   # add some outliers (10% of nsample)
   
Parameters:  [-40.8817957    0.79276138   1.04857556  -0.13360865]
   

Example 1: quadratic function with linear truth

Note that the quadratic term in OLS regression will capture outlier effects.

In [ ]:
res = sm.OLS(y2, X).fit()
   print(res.params)
   print(res.bse)
   print(res.predict())
   

Estimate RLM:

In [ ]:
resrlm = sm.RLM(y2, X).fit()
   print(resrlm.params)
   print(resrlm.bse)
   
[ 5.05563298  0.52746225 -0.01292805]
   [ 0.47889156  0.07393439  0.00654205]
   [  4.73243167   4.99833616   5.25993311   5.51722249   5.77020432
      6.0188786    6.26324532   6.50330448   6.73905608   6.97050013
      7.19763663   7.42046557   7.63898695   7.85320077   8.06310704
      8.26870576   8.46999692   8.66698052   8.85965656   9.04802505
      9.23208599   9.41183936   9.58728518   9.75842345   9.92525416
     10.08777731  10.24599291  10.39990095  10.54950143  10.69479436
     10.83577974  10.97245755  11.10482781  11.23289052  11.35664567
     11.47609326  11.59123329  11.70206578  11.8085907   11.91080807
     12.00871788  12.10232014  12.19161484  12.27660198  12.35728157
     12.4336536   12.50571808  12.573475    12.63692436  12.69606617]
   

Draw a plot to compare OLS estimates to the robust estimates:

In [ ]:
fig = plt.figure(figsize=(12,8))
   ax = fig.add_subplot(111)
   ax.plot(x1, y2, 'o',label="data")
   ax.plot(x1, y_true2, 'b-', label="True")
   prstd, iv_l, iv_u = wls_prediction_std(res)
   ax.plot(x1, res.fittedvalues, 'r-', label="OLS")
   ax.plot(x1, iv_u, 'r--')
   ax.plot(x1, iv_l, 'r--')
   ax.plot(x1, resrlm.fittedvalues, 'g.-', label="RLM")
   ax.legend(loc="best")
   
[  4.97550374e+00   5.13254535e-01  -2.01895684e-03]
   [ 0.10367794  0.01600647  0.00141633]
   

Example 2: linear function with linear truth

Fit a new OLS model using only the linear term and the constant:

In [ ]:
X2 = X[:,[0,1]] 
   res2 = sm.OLS(y2, X2).fit()
   print(res2.params)
   print(res2.bse)
   

Estimate RLM:

In [ ]:
resrlm2 = sm.RLM(y2, X2).fit()
   print(resrlm2.params)
   print(resrlm2.bse)
   
[ 5.57671265  0.39818173]
   [ 0.41167908  0.03547193]
   

Draw a plot to compare OLS estimates to the robust estimates:

In [ ]:
prstd, iv_l, iv_u = wls_prediction_std(res2)
   
   fig, ax = plt.subplots(figsize=(8,6))
   ax.plot(x1, y2, 'o', label="data")
   ax.plot(x1, y_true2, 'b-', label="True")
   ax.plot(x1, res2.fittedvalues, 'r-', label="OLS")
   ax.plot(x1, iv_u, 'r--')
   ax.plot(x1, iv_l, 'r--')
   ax.plot(x1, resrlm2.fittedvalues, 'g.-', label="RLM")
   legend = ax.legend(loc="best")
   
[ 5.03066706  0.49664337]
   [ 0.08236985  0.00709732]