Formulas: Fitting models using R-style formulas ================================================= .. _formulas_notebook: `Link to Notebook GitHub `_ .. raw:: html

Since version 0.5.0, statsmodels allows users to fit statistical models using R-style formulas. Internally, statsmodels uses the patsy package to convert formulas and data to the matrices that are used in model fitting. The formula framework is quite powerful; this tutorial only scratches the surface. A full description of the formula language can be found in the patsy docs:

Loading modules and functions

In [ ]:
from __future__ import print_function
   import numpy as np
   import statsmodels.api as sm
   

Import convention

You can import explicitly from statsmodels.formula.api

In [ ]:
from statsmodels.formula.api import ols
   

Alternatively, you can just use the formula namespace of the main statsmodels.api.

In [ ]:
sm.formula.ols
   

Or you can use the following conventioin

In [ ]:
import statsmodels.formula.api as smf
   

These names are just a convenient way to get access to each model's from_formula classmethod. See, for instance

In [ ]:
sm.OLS.from_formula
   

All of the lower case models accept formula and data arguments, whereas upper case ones take endog and exog design matrices. formula accepts a string which describes the model in terms of a patsy formula. data takes a pandas data frame or any other data structure that defines a __getitem__ for variable names like a structured array or a dictionary of variables.

dir(sm.formula) will print a list of available models.

Formula-compatible models have the following generic call signature: (formula, data, subset=None, *args, **kwargs)

OLS regression using formulas

To begin, we fit the linear model described on the Getting Started page. Download the data, subset columns, and list-wise delete to remove missing observations:

In [ ]:
dta = sm.datasets.get_rdataset("Guerry", "HistData", cache=True)
   
In [ ]:
df = dta.data[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna()
   df.head()
   

Fit the model:

In [ ]:
mod = ols(formula='Lottery ~ Literacy + Wealth + Region', data=df)
   res = mod.fit()
   print(res.summary())
   

Categorical variables

Looking at the summary printed above, notice that patsy determined that elements of Region were text strings, so it treated Region as a categorical variable. patsy's default is also to include an intercept, so we automatically dropped one of the Region categories.

If Region had been an integer variable that we wanted to treat explicitly as categorical, we could have done so by using the C() operator:

In [ ]:
res = ols(formula='Lottery ~ Literacy + Wealth + C(Region)', data=df).fit()
   print(res.params)
   
                            OLS Regression Results                            
   ==============================================================================
   Dep. Variable:                Lottery   R-squared:                       0.338
   Model:                            OLS   Adj. R-squared:                  0.287
   Method:                 Least Squares   F-statistic:                     6.636
   Date:                Mon, 20 Jul 2015   Prob (F-statistic):           1.07e-05
   Time:                        17:43:27   Log-Likelihood:                -375.30
   No. Observations:                  85   AIC:                             764.6
   Df Residuals:                      78   BIC:                             781.7
   Df Model:                           6                                         
   Covariance Type:            nonrobust                                         
   ===============================================================================
                     coef    std err          t      P>|t|      [95.0% Conf. Int.]
   -------------------------------------------------------------------------------
   Intercept      38.6517      9.456      4.087      0.000        19.826    57.478
   Region[T.E]   -15.4278      9.727     -1.586      0.117       -34.793     3.938
   Region[T.N]   -10.0170      9.260     -1.082      0.283       -28.453     8.419
   Region[T.S]    -4.5483      7.279     -0.625      0.534       -19.039     9.943
   Region[T.W]   -10.0913      7.196     -1.402      0.165       -24.418     4.235
   Literacy       -0.1858      0.210     -0.886      0.378        -0.603     0.232
   Wealth          0.4515      0.103      4.390      0.000         0.247     0.656
   ==============================================================================
   Omnibus:                        3.049   Durbin-Watson:                   1.785
   Prob(Omnibus):                  0.218   Jarque-Bera (JB):                2.694
   Skew:                          -0.340   Prob(JB):                        0.260
   Kurtosis:                       2.454   Cond. No.                         371.
   ==============================================================================
   
   Warnings:
   [1] Standard Errors assume that the covariance matrix of the errors is correctly specified.
   

Patsy's mode advanced features for categorical variables are discussed in: Patsy: Contrast Coding Systems for categorical variables

Operators

We have already seen that "~" separates the left-hand side of the model from the right-hand side, and that "+" adds new columns to the design matrix.

Removing variables

The "-" sign can be used to remove columns/variables. For instance, we can remove the intercept from a model by:

In [ ]:
res = ols(formula='Lottery ~ Literacy + Wealth + C(Region) -1 ', data=df).fit()
   print(res.params)
   
Intercept         38.651655
   C(Region)[T.E]   -15.427785
   C(Region)[T.N]   -10.016961
   C(Region)[T.S]    -4.548257
   C(Region)[T.W]   -10.091276
   Literacy          -0.185819
   Wealth             0.451475
   dtype: float64
   

Multiplicative interactions

":" adds a new column to the design matrix with the interaction of the other two columns. "*" will also include the individual columns that were multiplied together:

In [ ]:
res1 = ols(formula='Lottery ~ Literacy : Wealth - 1', data=df).fit()
   res2 = ols(formula='Lottery ~ Literacy * Wealth - 1', data=df).fit()
   print(res1.params, '\n')
   print(res2.params)
   
C(Region)[C]    38.651655
   C(Region)[E]    23.223870
   C(Region)[N]    28.634694
   C(Region)[S]    34.103399
   C(Region)[W]    28.560379
   Literacy        -0.185819
   Wealth           0.451475
   dtype: float64
   

Many other things are possible with operators. Please consult the patsy docs to learn more.

Functions

You can apply vectorized functions to the variables in your model:

In [ ]:
res = smf.ols(formula='Lottery ~ np.log(Literacy)', data=df).fit()
   print(res.params)
   
Literacy:Wealth    0.018176
   dtype: float64 
   
   Literacy           0.427386
   Wealth             1.080987
   Literacy:Wealth   -0.013609
   dtype: float64
   

Define a custom function:

In [ ]:
def log_plus_1(x):
       return np.log(x) + 1.
   res = smf.ols(formula='Lottery ~ log_plus_1(Literacy)', data=df).fit()
   print(res.params)
   
Intercept           115.609119
   np.log(Literacy)    -20.393959
   dtype: float64
   

Any function that is in the calling namespace is available to the formula.

Using formulas with models that do not (yet) support them

Even if a given statsmodels function does not support formulas, you can still use patsy's formula language to produce design matrices. Those matrices can then be fed to the fitting function as endog and exog arguments.

To generate numpy arrays:

In [ ]:
import patsy
   f = 'Lottery ~ Literacy * Wealth'
   y,X = patsy.dmatrices(f, df, return_type='dataframe')
   print(y[:5])
   print(X[:5])
   
Intercept               136.003079
   log_plus_1(Literacy)    -20.393959
   dtype: float64
   

To generate pandas data frames:

In [ ]:
f = 'Lottery ~ Literacy * Wealth'
   y,X = patsy.dmatrices(f, df, return_type='dataframe')
   print(y[:5])
   print(X[:5])
   
   Lottery
   0       41
   1       38
   2       66
   3       80
   4       79
      Intercept  Literacy  Wealth  Literacy:Wealth
   0          1        37      73             2701
   1          1        51      22             1122
   2          1        13      61              793
   3          1        46      76             3496
   4          1        69      83             5727
   
In [ ]:
print(sm.OLS(y, X).fit().summary())
   
   Lottery
   0       41
   1       38
   2       66
   3       80
   4       79
      Intercept  Literacy  Wealth  Literacy:Wealth
   0          1        37      73             2701
   1          1        51      22             1122
   2          1        13      61              793
   3          1        46      76             3496
   4          1        69      83             5727