Prediction (out of sample)ΒΆ

Link to Notebook GitHub

In [1]:
from __future__ import print_function
import numpy as np
import statsmodels.api as sm

Artificial data

In [2]:
nsample = 50
sig = 0.25
x1 = np.linspace(0, 20, nsample)
X = np.column_stack((x1, np.sin(x1), (x1-5)**2))
X = sm.add_constant(X)
beta = [5., 0.5, 0.5, -0.02]
y_true = np.dot(X, beta)
y = y_true + sig * np.random.normal(size=nsample)

Estimation

In [3]:
olsmod = sm.OLS(y, X)
olsres = olsmod.fit()
print(olsres.summary())
                            OLS Regression Results
==============================================================================
Dep. Variable:                      y   R-squared:                       0.984
Model:                            OLS   Adj. R-squared:                  0.983
Method:                 Least Squares   F-statistic:                     938.7
Date:                Wed, 27 Apr 2016   Prob (F-statistic):           2.99e-41
Time:                        23:30:31   Log-Likelihood:               0.038368
No. Observations:                  50   AIC:                             7.923
Df Residuals:                      46   BIC:                             15.57
Df Model:                           3
Covariance Type:            nonrobust
==============================================================================
                 coef    std err          t      P>|t|      [95.0% Conf. Int.]
------------------------------------------------------------------------------
const          4.9134      0.086     57.186      0.000         4.740     5.086
x1             0.5095      0.013     38.447      0.000         0.483     0.536
x2             0.5669      0.052     10.884      0.000         0.462     0.672
x3            -0.0201      0.001    -17.285      0.000        -0.022    -0.018
==============================================================================
Omnibus:                        3.454   Durbin-Watson:                   2.483
Prob(Omnibus):                  0.178   Jarque-Bera (JB):                2.998
Skew:                           0.599   Prob(JB):                        0.223
Kurtosis:                       2.948   Cond. No.                         221.
==============================================================================

Warnings:
[1] Standard Errors assume that the covariance matrix of the errors is correctly specified.

In-sample prediction

In [4]:
ypred = olsres.predict(X)
print(ypred)
[  4.4105969    4.92230799   5.39034571   5.78381187   6.08295931
   6.28243624   6.39216554   6.43571448   6.44642265   6.4619244
   6.51796559   6.64253064   6.85124481   7.14480699   7.50887485
   7.9164213    8.33217491   8.71841494   9.04116874   9.27579335
   9.4110243    9.45082628   9.41374181   9.32984471   9.23579909
   9.16883551   9.16063412   9.23211973   9.39002385   9.6257788
   9.91692531  10.23080251  10.52991406  10.77808866  10.94642288
  11.01802996  10.99081421  10.87781535  10.7050661   10.50731511
  10.32231784  10.18463342  10.11994674  10.14084841  10.24476626
  10.41438787  10.62050495  10.82681222  10.99587226  11.09526632]

Create a new sample of explanatory variables Xnew, predict and plot

In [5]:
x1n = np.linspace(20.5,25, 10)
Xnew = np.column_stack((x1n, np.sin(x1n), (x1n-5)**2))
Xnew = sm.add_constant(Xnew)
ynewpred =  olsres.predict(Xnew) # predict out of sample
print(ynewpred)
[ 11.09086874  10.93804732  10.65903562  10.3045011    9.94113997
   9.63534761   9.4369627    9.36706496   9.41281406   9.53059344]

Plot comparison

In [6]:
import matplotlib.pyplot as plt

fig, ax = plt.subplots()
ax.plot(x1, y, 'o', label="Data")
ax.plot(x1, y_true, 'b-', label="True")
ax.plot(np.hstack((x1, x1n)), np.hstack((ypred, ynewpred)), 'r', label="OLS prediction")
ax.legend(loc="best");

Predicting with Formulas

Using formulas can make both estimation and prediction a lot easier

In [7]:
from statsmodels.formula.api import ols

data = {"x1" : x1, "y" : y}

res = ols("y ~ x1 + np.sin(x1) + I((x1-5)**2)", data=data).fit()

We use the I to indicate use of the Identity transform. Ie., we don't want any expansion magic from using **2

In [8]:
res.params
Out[8]:
Intercept           4.913360
x1                  0.509459
np.sin(x1)          0.566948
I((x1 - 5) ** 2)   -0.020111
dtype: float64

Now we only have to pass the single variable and we get the transformed right-hand side variables automatically

In [9]:
res.predict(exog=dict(x1=x1n))
Out[9]:
array([ 11.09086874,  10.93804732,  10.65903562,  10.3045011 ,
         9.94113997,   9.63534761,   9.4369627 ,   9.36706496,
         9.41281406,   9.53059344])