Prediction (out of sample)

[1]:
%matplotlib inline
[2]:
import numpy as np
import matplotlib.pyplot as plt

import statsmodels.api as sm

plt.rc("figure", figsize=(16, 8))
plt.rc("font", size=14)

Artificial data

[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1 - 5) ** 2))
X = sm.add_constant(X)
beta = [5.0, 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.978
Model:                            OLS   Adj. R-squared:                  0.977
Method:                 Least Squares   F-statistic:                     696.5
Date:                Tue, 03 Aug 2021   Prob (F-statistic):           2.51e-38
Time:                        09:22:16   Log-Likelihood:                -5.0719
No. Observations:                  50   AIC:                             18.14
Df Residuals:                      46   BIC:                             25.79
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.1679      0.095     54.305      0.000       4.976       5.360
x1             0.4809      0.015     32.763      0.000       0.451       0.510
x2             0.4874      0.058      8.448      0.000       0.371       0.604
x3            -0.0185      0.001    -14.391      0.000      -0.021      -0.016
==============================================================================
Omnibus:                        2.429   Durbin-Watson:                   2.234
Prob(Omnibus):                  0.297   Jarque-Bera (JB):                1.771
Skew:                           0.455   Prob(JB):                        0.413
Kurtosis:                       3.142   Cond. No.                         221.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.70433763  5.1666748   5.59104661  5.95088898  6.22922469  6.42145263
  6.53610382  6.59343973  6.62212339  6.65451016  6.72133181  6.84664739
  7.04389039  7.31366166  7.64363069  8.01056148  8.38413014  8.73190703
  9.02468485  9.24127735  9.37200009  9.42026166  9.40200361  9.34308128
  9.2750159   9.22981569  9.23471701  9.30770952  9.45458052  9.6679641
  9.92855138 10.20826285 10.47486178 10.69725063 10.85058028 10.92033268
 10.90470609 10.81491115 10.67332898 10.509834   10.35688569 10.24419559
 10.19384571 10.21666063 10.31042893 10.46026662 10.64106249 10.82160381
 10.9697045  11.05749322]

Create a new sample of explanatory variables Xnew, predict and plot

[6]:
x1n = np.linspace(20.5, 25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n - 5) ** 2))
Xnew = sm.add_constant(Xnew)
ynewpred = olsres.predict(Xnew)  # predict out of sample
print(ynewpred)
[11.05600521 10.92628547 10.68744884 10.38305565 10.07044659  9.80670375
  9.63467491  9.57248291  9.60908857  9.70699343]

Plot comparison

[7]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, "o", label="Data")
ax.plot(x1, y_true, "b-", label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), "r", label="OLS prediction")
ax.legend(loc="best")
[7]:
<matplotlib.legend.Legend at 0x7f615a88d640>
../../../_images/examples_notebooks_generated_predict_12_1.png

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

[8]:
from statsmodels.formula.api import ols

data = {"x1": x1, "y": y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2

[9]:
res.params
[9]:
Intercept           5.167946
x1                  0.480852
np.sin(x1)          0.487422
I((x1 - 5) ** 2)   -0.018544
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0    11.056005
1    10.926285
2    10.687449
3    10.383056
4    10.070447
5     9.806704
6     9.634675
7     9.572483
8     9.609089
9     9.706993
dtype: float64