Prediction (out of sample)

[1]:
%matplotlib inline
[2]:
import numpy as np
import matplotlib.pyplot as plt

import statsmodels.api as sm

plt.rc("figure", figsize=(16, 8))
plt.rc("font", size=14)

Artificial data

[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1 - 5) ** 2))
X = sm.add_constant(X)
beta = [5.0, 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.983
Model:                            OLS   Adj. R-squared:                  0.982
Method:                 Least Squares   F-statistic:                     912.0
Date:                Fri, 05 Jun 2026   Prob (F-statistic):           5.75e-41
Time:                        11:09:14   Log-Likelihood:                0.69101
No. Observations:                  50   AIC:                             6.618
Df Residuals:                      46   BIC:                             14.27
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [0.025      0.975]
------------------------------------------------------------------------------
const          4.9193      0.085     58.007      0.000       4.749       5.090
x1             0.5190      0.013     39.682      0.000       0.493       0.545
x2             0.4271      0.051      8.307      0.000       0.324       0.531
x3            -0.0224      0.001    -19.521      0.000      -0.025      -0.020
==============================================================================
Omnibus:                        0.257   Durbin-Watson:                   1.789
Prob(Omnibus):                  0.880   Jarque-Bera (JB):                0.445
Skew:                          -0.097   Prob(JB):                        0.801
Kurtosis:                       2.581   Cond. No.                         221.
==============================================================================

Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.35883304  4.82796654  5.26177672  5.63698588  5.93871711  6.16293856
  6.31712581  6.41903369  6.49377921  6.56971487  6.67377048  6.82702863
  7.04126089  7.31699378  7.64342225  8.00018493  8.36070944  8.69657819
  8.98219724  9.19900137  9.33850427  9.40369284  9.40853641  9.37569165
  9.33278014  9.30785021  9.32476868  9.39929963  9.53651438  9.7299583
  9.96271129 10.2101677  10.44407896 10.63719474 10.76774014 10.82299319
 10.80137507 10.71270949 10.57660868 10.41925118 10.26908075 10.15213314
 10.08775855 10.08544234 10.14324671 10.24812904 10.37808473 10.50576253
 10.60295825 10.64524867]

Create a new sample of explanatory variables Xnew, predict and plot

[6]:
x1n = np.linspace(20.5, 25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n - 5) ** 2))
Xnew = sm.add_constant(Xnew)
ynewpred = olsres.predict(Xnew)  # predict out of sample
print(ynewpred)
[10.5987134  10.43672702 10.1760396   9.8548224   9.52332222  9.23155927
  9.01708041  8.89576644  8.85794386  8.87075335]

Plot comparison

[7]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, "o", label="Data")
ax.plot(x1, y_true, "b-", label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), "r", label="OLS prediction")
ax.legend(loc="best")
[7]:
<matplotlib.legend.Legend at 0x7fd423a098a0>
../../../_images/examples_notebooks_generated_predict_12_1.png

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

[8]:
from statsmodels.formula.api import ols

data = {"x1": x1, "y": y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2

[9]:
res.params
[9]:
Intercept           4.919262
x1                  0.518996
np.sin(x1)          0.427121
I((x1 - 5) ** 2)   -0.022417
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0    10.598713
1    10.436727
2    10.176040
3     9.854822
4     9.523322
5     9.231559
6     9.017080
7     8.895766
8     8.857944
9     8.870753
dtype: float64