Prediction (out of sample)

[1]:
%matplotlib inline
[2]:
import numpy as np
import matplotlib.pyplot as plt

import statsmodels.api as sm

plt.rc("figure", figsize=(16, 8))
plt.rc("font", size=14)

Artificial data

[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1 - 5) ** 2))
X = sm.add_constant(X)
beta = [5.0, 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.983
Model:                            OLS   Adj. R-squared:                  0.982
Method:                 Least Squares   F-statistic:                     878.4
Date:                Tue, 07 Dec 2021   Prob (F-statistic):           1.35e-40
Time:                        17:49:20   Log-Likelihood:                -3.2221
No. Observations:                  50   AIC:                             14.44
Df Residuals:                      46   BIC:                             22.09
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.7449      0.092     51.739      0.000       4.560       4.930
x1             0.5292      0.014     37.413      0.000       0.501       0.558
x2             0.4923      0.056      8.855      0.000       0.380       0.604
x3            -0.0211      0.001    -17.009      0.000      -0.024      -0.019
==============================================================================
Omnibus:                        5.432   Durbin-Watson:                   1.865
Prob(Omnibus):                  0.066   Jarque-Bera (JB):                4.497
Skew:                           0.714   Prob(JB):                        0.106
Kurtosis:                       3.349   Cond. No.                         221.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.21687788  4.71097404  5.1659254   5.55489992  5.8607491   6.07882544
  6.217746    6.29797664  6.34846955  6.4019065   6.48932926  6.6350396
  6.85260653  7.14263694  7.49267595  7.87925314  8.27173873  8.63737594
  8.94666281  9.17819936  9.32220347  9.38211812  9.37404566  9.32410227
  9.26412715  9.22645142  9.23858621  9.31870265  9.47264643  9.69297733
  9.96019167 10.24592669 10.51762045 10.74386171 10.89955067 10.97002282
 10.95345824 10.86118068 10.71579701 10.54748305 10.38902572 10.27043627
 10.2140195  10.23070916 10.31827155 10.46167252 10.63554752 10.80836897
 10.9476261  11.0251664 ]

Create a new sample of explanatory variables Xnew, predict and plot

[6]:
x1n = np.linspace(20.5, 25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n - 5) ** 2))
Xnew = sm.add_constant(Xnew)
ynewpred = olsres.predict(Xnew)  # predict out of sample
print(ynewpred)
[11.0089388  10.86199072 10.60362984 10.27785591  9.94258807  9.65548426
  9.4598245   9.37391446  9.38660359  9.46001514]

Plot comparison

[7]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, "o", label="Data")
ax.plot(x1, y_true, "b-", label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), "r", label="OLS prediction")
ax.legend(loc="best")
[7]:
<matplotlib.legend.Legend at 0x7f7d9b71ef40>
../../../_images/examples_notebooks_generated_predict_12_1.png

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

[8]:
from statsmodels.formula.api import ols

data = {"x1": x1, "y": y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2

[9]:
res.params
[9]:
Intercept           4.744921
x1                  0.529158
np.sin(x1)          0.492339
I((x1 - 5) ** 2)   -0.021122
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0    11.008939
1    10.861991
2    10.603630
3    10.277856
4     9.942588
5     9.655484
6     9.459825
7     9.373914
8     9.386604
9     9.460015
dtype: float64