Prediction (out of sample)

In [1]:
%matplotlib inline

from __future__ import print_function
import numpy as np
import statsmodels.api as sm

Artificial data

In [2]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

In [3]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.979
Model:                            OLS   Adj. R-squared:                  0.978
Method:                 Least Squares   F-statistic:                     730.4
Date:                Mon, 14 May 2018   Prob (F-statistic):           8.64e-39
Time:                        21:44:43   Log-Likelihood:                -4.2900
No. Observations:                  50   AIC:                             16.58
Df Residuals:                      46   BIC:                             24.23
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.9950      0.094     53.315      0.000       4.806       5.184
x1             0.4908      0.014     33.965      0.000       0.462       0.520
x2             0.4860      0.057      8.557      0.000       0.372       0.600
x3            -0.0194      0.001    -15.303      0.000      -0.022      -0.017
==============================================================================
Omnibus:                        0.621   Durbin-Watson:                   1.956
Prob(Omnibus):                  0.733   Jarque-Bera (JB):                0.681
Skew:                          -0.244   Prob(JB):                        0.711
Kurtosis:                       2.701   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

In [4]:
ypred = olsres.predict(X)
print(ypred)
[ 4.50966384  4.97889316  5.40995859  5.77637232  6.06120589  6.2598714
  6.38087536  6.44442107  6.47908926  6.51714238  6.58922395  6.71932393
  6.92083724  7.19436301  7.52760613  7.89739732  8.27349977  8.62357697
  8.91850548  9.13715987  9.26988354  9.32007527  9.30363083  9.24633134
  9.1796075   9.13537562  9.14079386  9.21380021  9.36016536  9.57254456
  9.83168435 10.10958587 10.37410495 10.59423335 10.74519331 10.81250845
 10.79438222 10.70199306 10.55765783 10.39116516 10.23488133 10.11843255
 10.06383744 10.08188948 10.17038361 10.31447858 10.48913528 10.66323061
 10.80467092 10.88566489]

Create a new sample of explanatory variables Xnew, predict and plot

In [5]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[10.87574726 10.73750459 10.48999684 10.17665926  9.85466789  9.58094089
  9.39820303  9.32452509  9.3488996   9.43393577]

Plot comparison

In [6]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

In [7]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we don't want any expansion magic from using **2

In [8]:
res.params
Out[8]:
Intercept           4.995017
x1                  0.490756
np.sin(x1)          0.486022
I((x1 - 5) ** 2)   -0.019414
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

In [9]:
res.predict(exog=dict(x1=x1n))
Out[9]:
0    10.875747
1    10.737505
2    10.489997
3    10.176659
4     9.854668
5     9.580941
6     9.398203
7     9.324525
8     9.348900
9     9.433936
dtype: float64