Prediction (out of sample)

In [1]:
%matplotlib inline
In [2]:
from __future__ import print_function
import numpy as np
import statsmodels.api as sm

Artificial data

In [3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

In [4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results                            
==============================================================================
Dep. Variable:                      y   R-squared:                       0.987
Model:                            OLS   Adj. R-squared:                  0.986
Method:                 Least Squares   F-statistic:                     1125.
Date:                Sun, 24 Nov 2019   Prob (F-statistic):           4.94e-43
Time:                        07:49:26   Log-Likelihood:                 6.8869
No. Observations:                  50   AIC:                            -5.774
Df Residuals:                      46   BIC:                             1.874
Df Model:                           3                                         
Covariance Type:            nonrobust                                         
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.0627      0.075     67.574      0.000       4.912       5.214
x1             0.4783      0.012     41.392      0.000       0.455       0.502
x2             0.4341      0.045      9.558      0.000       0.343       0.526
x3            -0.0182      0.001    -17.950      0.000      -0.020      -0.016
==============================================================================
Omnibus:                        1.443   Durbin-Watson:                   2.206
Prob(Omnibus):                  0.486   Jarque-Bera (JB):                1.114
Skew:                          -0.365   Prob(JB):                        0.573
Kurtosis:                       2.956   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

In [5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.60748496  5.0463146   5.45076499  5.797176    6.07042632  6.26641786
  6.39274902  6.46746631  6.51609936  6.56746647  6.64893986  6.78194846
  6.97845725  7.23900137  7.55259828  7.8985521   8.24985393  8.57761929
  8.8558338   9.06562724  9.19837388  9.25710968  9.25603348  9.21817416
  9.17160706  9.1448413   9.16213582  9.23951366  9.38212945  9.58342253
  9.82619482 10.08543644 10.33243475 10.53949175 10.68447479 10.75445292
 10.74782137 10.67456525 10.55461897 10.41459115 10.28339283 10.18748751
 10.14654323 10.17020123 10.25649185 10.3921582  10.55483419 10.71671923
 10.84914579 10.92728939]

Create a new sample of explanatory variables Xnew, predict and plot

In [6]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[10.92513612 10.80792256 10.59267396 10.31818866 10.03553896  9.79556679
  9.63643586  9.57428786  9.59929038  9.67804426]

Plot comparison

In [7]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

In [8]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we don't want any expansion magic from using **2

In [9]:
res.params
Out[9]:
Intercept           5.062746
x1                  0.478278
np.sin(x1)          0.434138
I((x1 - 5) ** 2)   -0.018210
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

In [10]:
res.predict(exog=dict(x1=x1n))
Out[10]:
0    10.925136
1    10.807923
2    10.592674
3    10.318189
4    10.035539
5     9.795567
6     9.636436
7     9.574288
8     9.599290
9     9.678044
dtype: float64