Prediction (out of sample)

[1]:
%matplotlib inline
[2]:
import numpy as np
import matplotlib.pyplot as plt

import statsmodels.api as sm

plt.rc("figure", figsize=(16, 8))
plt.rc("font", size=14)

Artificial data

[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1 - 5) ** 2))
X = sm.add_constant(X)
beta = [5.0, 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.986
Model:                            OLS   Adj. R-squared:                  0.985
Method:                 Least Squares   F-statistic:                     1095.
Date:                Fri, 05 May 2023   Prob (F-statistic):           9.24e-43
Time:                        13:46:30   Log-Likelihood:                 5.5153
No. Observations:                  50   AIC:                            -3.031
Df Residuals:                      46   BIC:                             4.618
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.1190      0.077     66.476      0.000       4.964       5.274
x1             0.4958      0.012     41.750      0.000       0.472       0.520
x2             0.4406      0.047      9.438      0.000       0.347       0.535
x3            -0.0198      0.001    -18.970      0.000      -0.022      -0.018
==============================================================================
Omnibus:                        0.463   Durbin-Watson:                   2.419
Prob(Omnibus):                  0.793   Jarque-Bera (JB):                0.179
Skew:                          -0.144   Prob(JB):                        0.915
Kurtosis:                       3.047   Cond. No.                         221.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.62445026  5.0791561   5.49853785  5.85858285  6.14394448  6.35046358
  6.48585175  6.56842443  6.62409164  6.68210109  6.77023286  6.91023518
  7.1142513   7.38282436  7.70480816  8.05919845  8.41858387  8.75364956
  9.03799351  9.25246431  9.38830773  9.44860496  9.44776639  9.40916405
  9.36129162  9.33308304  9.34915868  9.42578035  9.5681795   9.76969769
 10.0128806  10.27234567 10.51895244 10.7245902  10.86679651 10.93244774
 10.91991521 10.8393328  10.71093198 10.56171783 10.42103219 10.31573286
 10.26578097 10.28096139 10.35927514 10.48726778 10.64223995 10.79597683
 10.91938372 10.98726625]

Create a new sample of explanatory variables Xnew, predict and plot

[6]:
x1n = np.linspace(20.5, 25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n - 5) ** 2))
Xnew = sm.add_constant(Xnew)
ynewpred = olsres.predict(Xnew)  # predict out of sample
print(ynewpred)
[10.97049523 10.83630035 10.60196055 10.3068523  10.00280887  9.74142978
  9.5614474   9.47924392  9.48484044  9.54434013]

Plot comparison

[7]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, "o", label="Data")
ax.plot(x1, y_true, "b-", label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), "r", label="OLS prediction")
ax.legend(loc="best")
[7]:
<matplotlib.legend.Legend at 0x7fceb3698a60>
../../../_images/examples_notebooks_generated_predict_12_1.png

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

[8]:
from statsmodels.formula.api import ols

data = {"x1": x1, "y": y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2

[9]:
res.params
[9]:
Intercept           5.118951
x1                  0.495829
np.sin(x1)          0.440606
I((x1 - 5) ** 2)   -0.019780
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0    10.970495
1    10.836300
2    10.601961
3    10.306852
4    10.002809
5     9.741430
6     9.561447
7     9.479244
8     9.484840
9     9.544340
dtype: float64

Last update: May 05, 2023