Prediction (out of sample)

[1]:
%matplotlib inline
[2]:
import numpy as np
import matplotlib.pyplot as plt

import statsmodels.api as sm

plt.rc("figure", figsize=(16, 8))
plt.rc("font", size=14)

Artificial data

[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1 - 5) ** 2))
X = sm.add_constant(X)
beta = [5.0, 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.984
Model:                            OLS   Adj. R-squared:                  0.983
Method:                 Least Squares   F-statistic:                     962.3
Date:                Fri, 19 Apr 2024   Prob (F-statistic):           1.71e-41
Time:                        16:35:43   Log-Likelihood:                 1.7606
No. Observations:                  50   AIC:                             4.479
Df Residuals:                      46   BIC:                             12.13
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.9988      0.083     60.220      0.000       4.832       5.166
x1             0.5072      0.013     39.617      0.000       0.481       0.533
x2             0.5077      0.050     10.088      0.000       0.406       0.609
x3            -0.0207      0.001    -18.437      0.000      -0.023      -0.018
==============================================================================
Omnibus:                        1.782   Durbin-Watson:                   1.918
Prob(Omnibus):                  0.410   Jarque-Bera (JB):                1.190
Skew:                           0.017   Prob(JB):                        0.551
Kurtosis:                       2.245   Cond. No.                         221.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.48072132  4.97038253  5.42003057  5.80199702  6.09859888  6.30504376
  6.43021733  6.49522338  6.52991656  6.56799731  6.64147501  6.77540891
  6.98379097  7.26724702  7.61293392  7.99664971  8.38680984  8.74963648
  9.05470813  9.27995772  9.41529825  9.46427999  9.44350722  9.37991018
  9.30632049  9.25607706  9.25754854  9.32947246  9.47787672  9.6950892
  9.96099828 10.2463571  10.51758879 10.74230298 10.89461751 10.95941077
 10.93480614 10.83248029 10.67574478 10.49571602 10.32620291 10.19815215
 10.1345638  10.14671241 10.23229466 10.37580777 10.55109659 10.7256508
 10.86594626 10.94295293]

Create a new sample of explanatory variables Xnew, predict and plot

[6]:
x1n = np.linspace(20.5, 25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n - 5) ** 2))
Xnew = sm.add_constant(Xnew)
ynewpred = olsres.predict(Xnew)  # predict out of sample
print(ynewpred)
[10.92308446 10.76894881 10.50045546 10.16297567  9.81623394  9.51968542
  9.31795918  9.22993149  9.24410422  9.32142002]

Plot comparison

[7]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, "o", label="Data")
ax.plot(x1, y_true, "b-", label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), "r", label="OLS prediction")
ax.legend(loc="best")
[7]:
<matplotlib.legend.Legend at 0x7fe398203940>
../../../_images/examples_notebooks_generated_predict_12_1.png

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

[8]:
from statsmodels.formula.api import ols

data = {"x1": x1, "y": y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2

[9]:
res.params
[9]:
Intercept           4.998829
x1                  0.507180
np.sin(x1)          0.507685
I((x1 - 5) ** 2)   -0.020724
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0    10.923084
1    10.768949
2    10.500455
3    10.162976
4     9.816234
5     9.519685
6     9.317959
7     9.229931
8     9.244104
9     9.321420
dtype: float64

Last update: Apr 19, 2024