Prediction (out of sample)

[1]:
%matplotlib inline
[2]:
import numpy as np
import matplotlib.pyplot as plt

import statsmodels.api as sm

plt.rc("figure", figsize=(16, 8))
plt.rc("font", size=14)

Artificial data

[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1 - 5) ** 2))
X = sm.add_constant(X)
beta = [5.0, 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.984
Model:                            OLS   Adj. R-squared:                  0.983
Method:                 Least Squares   F-statistic:                     953.4
Date:                Thu, 01 Dec 2022   Prob (F-statistic):           2.11e-41
Time:                        18:25:09   Log-Likelihood:                 2.2384
No. Observations:                  50   AIC:                             3.523
Df Residuals:                      46   BIC:                             11.17
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.9745      0.082     60.502      0.000       4.809       5.140
x1             0.4894      0.013     38.595      0.000       0.464       0.515
x2             0.4563      0.050      9.153      0.000       0.356       0.557
x3            -0.0191      0.001    -17.194      0.000      -0.021      -0.017
==============================================================================
Omnibus:                        1.001   Durbin-Watson:                   2.346
Prob(Omnibus):                  0.606   Jarque-Bera (JB):                1.049
Skew:                          -0.247   Prob(JB):                        0.592
Kurtosis:                       2.491   Cond. No.                         221.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.49597305  4.95177161  5.37143825  5.73010774  6.01188857  6.21247391
  6.33984919  6.4129801   6.45869662  6.5072849   6.58751135  6.72189639
  6.9230144   7.19142781  7.51559468  7.87376498  8.23755413  8.57660643
  8.86358246  9.07865086  9.21274655  9.26906014  9.26251365  9.21730895
  9.16295138  9.12940218  9.14215588  9.21805167  9.36250672  9.5686262
  9.81833607 10.08535256 10.33950058 10.55167132 10.69860485 10.76671154
 10.75430492 10.67187866 10.54038254 10.38778048 10.24445608 10.13822076
 10.08974434 10.10915914 10.19439516 10.33152029 10.49702923 10.66170537
 10.79542101 10.87208725]

Create a new sample of explanatory variables Xnew, predict and plot

[6]:
x1n = np.linspace(20.5, 25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n - 5) ** 2))
Xnew = sm.add_constant(Xnew)
ynewpred = olsres.predict(Xnew)  # predict out of sample
print(ynewpred)
[10.86313665 10.73326287 10.50035832 10.20519754  9.90145417  9.64255975
  9.46862185  9.39660433  9.41617397  9.49223043]

Plot comparison

[7]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, "o", label="Data")
ax.plot(x1, y_true, "b-", label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), "r", label="OLS prediction")
ax.legend(loc="best")
[7]:
<matplotlib.legend.Legend at 0x7f34abff67a0>
../../../_images/examples_notebooks_generated_predict_12_1.png

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

[8]:
from statsmodels.formula.api import ols

data = {"x1": x1, "y": y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2

[9]:
res.params
[9]:
Intercept           4.974540
x1                  0.489406
np.sin(x1)          0.456250
I((x1 - 5) ** 2)   -0.019143
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0    10.863137
1    10.733263
2    10.500358
3    10.205198
4     9.901454
5     9.642560
6     9.468622
7     9.396604
8     9.416174
9     9.492230
dtype: float64