Prediction (out of sample)¶
[1]:
%matplotlib inline
[2]:
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
plt.rc("figure", figsize=(16, 8))
plt.rc("font", size=14)
Artificial data¶
[3]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1 - 5) ** 2))
X = sm.add_constant(X)
beta = [5.0, 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)
Estimation¶
[4]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
OLS Regression Results
==============================================================================
Dep. Variable: y R-squared: 0.983
Model: OLS Adj. R-squared: 0.982
Method: Least Squares F-statistic: 912.0
Date: Fri, 05 Jun 2026 Prob (F-statistic): 5.75e-41
Time: 11:09:14 Log-Likelihood: 0.69101
No. Observations: 50 AIC: 6.618
Df Residuals: 46 BIC: 14.27
Df Model: 3
Covariance Type: nonrobust
==============================================================================
coef std err t P>|t| [0.025 0.975]
------------------------------------------------------------------------------
const 4.9193 0.085 58.007 0.000 4.749 5.090
x1 0.5190 0.013 39.682 0.000 0.493 0.545
x2 0.4271 0.051 8.307 0.000 0.324 0.531
x3 -0.0224 0.001 -19.521 0.000 -0.025 -0.020
==============================================================================
Omnibus: 0.257 Durbin-Watson: 1.789
Prob(Omnibus): 0.880 Jarque-Bera (JB): 0.445
Skew: -0.097 Prob(JB): 0.801
Kurtosis: 2.581 Cond. No. 221.
==============================================================================
Notes:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
In-sample prediction¶
[5]:
ypred = olsres.predict(X)
print(ypred)
[ 4.35883304 4.82796654 5.26177672 5.63698588 5.93871711 6.16293856
6.31712581 6.41903369 6.49377921 6.56971487 6.67377048 6.82702863
7.04126089 7.31699378 7.64342225 8.00018493 8.36070944 8.69657819
8.98219724 9.19900137 9.33850427 9.40369284 9.40853641 9.37569165
9.33278014 9.30785021 9.32476868 9.39929963 9.53651438 9.7299583
9.96271129 10.2101677 10.44407896 10.63719474 10.76774014 10.82299319
10.80137507 10.71270949 10.57660868 10.41925118 10.26908075 10.15213314
10.08775855 10.08544234 10.14324671 10.24812904 10.37808473 10.50576253
10.60295825 10.64524867]
Create a new sample of explanatory variables Xnew, predict and plot¶
[6]:
x1n = np.linspace(20.5, 25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n - 5) ** 2))
Xnew = sm.add_constant(Xnew)
ynewpred = olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[10.5987134 10.43672702 10.1760396 9.8548224 9.52332222 9.23155927
9.01708041 8.89576644 8.85794386 8.87075335]
Plot comparison¶
[7]:
import matplotlib.pyplot as plt
fig, ax = plt.subplots()
ax.plot(x1, y, "o", label="Data")
ax.plot(x1, y_true, "b-", label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), "r", label="OLS prediction")
ax.legend(loc="best")
[7]:
<matplotlib.legend.Legend at 0x7fd423a098a0>
Predicting with Formulas¶
Using formulas can make both estimation and prediction a lot easier
[8]:
from statsmodels.formula.api import ols
data = {"x1": x1, "y": y}
res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()
We use the I to indicate use of the Identity transform. Ie., we do not want any expansion magic from using **2
[9]:
res.params
[9]:
Intercept 4.919262
x1 0.518996
np.sin(x1) 0.427121
I((x1 - 5) ** 2) -0.022417
dtype: float64
Now we only have to pass the single variable and we get the transformed right-hand side variables automatically
[10]:
res.predict(exog=dict(x1=x1n))
[10]:
0 10.598713
1 10.436727
2 10.176040
3 9.854822
4 9.523322
5 9.231559
6 9.017080
7 8.895766
8 8.857944
9 8.870753
dtype: float64