Prediction (out of sample)

[1]:
%matplotlib inline
[2]:
import numpy as np
import matplotlib.pyplot as plt

import statsmodels.api as sm

plt.rc("figure", figsize=(16, 8))
plt.rc("font", size=14)

Artificial data

[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1 - 5) ** 2))
X = sm.add_constant(X)
beta = [5.0, 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.974
Model:                            OLS   Adj. R-squared:                  0.973
Method:                 Least Squares   F-statistic:                     584.3
Date:                Sun, 14 Jun 2026   Prob (F-statistic):           1.30e-36
Time:                        22:50:29   Log-Likelihood:                -9.6142
No. Observations:                  50   AIC:                             27.23
Df Residuals:                      46   BIC:                             34.88
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          5.1365      0.104     49.288      0.000       4.927       5.346
x1             0.4800      0.016     29.866      0.000       0.448       0.512
x2             0.4726      0.063      7.480      0.000       0.345       0.600
x3            -0.0183      0.001    -12.987      0.000      -0.021      -0.015
==============================================================================
Omnibus:                        1.362   Durbin-Watson:                   2.139
Prob(Omnibus):                  0.506   Jarque-Bera (JB):                0.602
Skew:                          -0.023   Prob(JB):                        0.740
Kurtosis:                       3.536   Cond. No.                         221.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.67834355  5.13360472  5.55193985  5.90759305  6.18410359  6.37701037
  6.49458484  6.55647205  6.59046299  6.62792867  6.69866587  6.82600168
  7.02296089  7.29012608  7.61554187  7.97667919  8.3441367   8.68647141
  8.97536461  9.1902746   9.32181158  9.37328045  9.36013774  9.3074522
  9.24578605  9.20617369  9.21502288  9.28977596  9.43604415  9.64668566
  9.90297902 10.17769903 10.43958971 10.65849962 10.81033565 10.88102143
 10.86881005 10.78457107 10.65000469 10.49407645 10.3482583  10.24135783
 10.19478525 10.21903579 10.31196535 10.45914273 10.63622071 10.81293617
 10.95808224 11.04463553]

Create a new sample of explanatory variables Xnew, predict and plot

[6]:
x1n = np.linspace(20.5, 25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n - 5) ** 2))
Xnew = sm.add_constant(Xnew)
ynewpred = olsres.predict(Xnew)  # predict out of sample
print(ynewpred)
[11.04479349 10.92044734 10.69013038 10.39607768 10.09388542  9.83889903
  9.67266271  9.61274795  9.6484514   9.74341546]

Plot comparison

[7]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, "o", label="Data")
ax.plot(x1, y_true, "b-", label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), "r", label="OLS prediction")
ax.legend(loc="best")
[7]:
<matplotlib.legend.Legend at 0x7f58d86400a0>
../../../_images/examples_notebooks_generated_predict_12_1.png

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

[8]:
from statsmodels.formula.api import ols

data = {"x1": x1, "y": y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2

[9]:
res.params
[9]:
Intercept           5.136530
x1                  0.480016
np.sin(x1)          0.472593
I((x1 - 5) ** 2)   -0.018327
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0    11.044793
1    10.920447
2    10.690130
3    10.396078
4    10.093885
5     9.838899
6     9.672663
7     9.612748
8     9.648451
9     9.743415
dtype: float64