Prediction (out of sample)

In [1]:
%matplotlib inline

from __future__ import print_function
import numpy as np
import statsmodels.api as sm

Artificial data

In [2]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

In [3]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.982
Model:                            OLS   Adj. R-squared:                  0.981
Method:                 Least Squares   F-statistic:                     857.7
Date:                Thu, 15 Nov 2018   Prob (F-statistic):           2.31e-40
Time:                        03:15:16   Log-Likelihood:               -0.60971
No. Observations:                  50   AIC:                             9.219
Df Residuals:                      46   BIC:                             16.87
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.0426      0.087     57.934      0.000       4.867       5.218
x1             0.4916      0.013     36.619      0.000       0.465       0.519
x2             0.3787      0.053      7.176      0.000       0.272       0.485
x3            -0.0193      0.001    -16.343      0.000      -0.022      -0.017
==============================================================================
Omnibus:                        0.170   Durbin-Watson:                   2.361
Prob(Omnibus):                  0.918   Jarque-Bera (JB):                0.311
Skew:                          -0.121   Prob(JB):                        0.856
Kurtosis:                       2.699   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

In [4]:
ypred = olsres.predict(X)
print(ypred)
[ 4.56104383  4.98740541  5.38265308  5.72614849  6.00470156  6.21473756
  6.36288448  6.464884    6.54300524  6.62238595  6.72690249  6.87524699
  7.07785632  7.33519727  7.63768995  7.96728166  8.30041293  8.61188831
  8.87901582  9.08533528  9.2233226   9.29562599  9.31463088  9.30042502
  9.27749808  9.27071815  9.3012459   9.383058    9.52065075  9.70830139
  9.93100829 10.16695573 10.39109811 10.57927489 10.71218011 10.77853428
 10.7769376  10.71609992 10.61340973 10.49207731 10.37732132 10.29222546
 10.25394587 10.27089243 10.34134706 10.45374596 10.58857955 10.7215978
 10.82779446 10.88551552]

Create a new sample of explanatory variables Xnew, predict and plot

In [5]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[10.86931159 10.75105353 10.54559219 10.28677079 10.01913887  9.7870451
  9.62377908  9.54342096  9.53739407  9.57656485]

Plot comparison

In [6]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

In [7]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we don't want any expansion magic from using **2

In [8]:
res.params
Out[8]:
Intercept           5.042602
x1                  0.491561
np.sin(x1)          0.378692
I((x1 - 5) ** 2)   -0.019262
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

In [9]:
res.predict(exog=dict(x1=x1n))
Out[9]:
0    10.869312
1    10.751054
2    10.545592
3    10.286771
4    10.019139
5     9.787045
6     9.623779
7     9.543421
8     9.537394
9     9.576565
dtype: float64