Generalized Linear Models (Formula) ===================================== .. _glm_formula_notebook: `Link to Notebook GitHub `_ .. raw:: html

This notebook illustrates how you can use R-style formulas to fit Generalized Linear Models.

To begin, we load the Star98 dataset and we construct a formula and pre-process the data:

In [ ]:
from __future__ import print_function
   import statsmodels.api as sm
   import statsmodels.formula.api as smf
   star98 = sm.datasets.star98.load_pandas().data
   formula = 'SUCCESS ~ LOWINC + PERASIAN + PERBLACK + PERHISP + PCTCHRT + \
              PCTYRRND + PERMINTE*AVYRSEXP*AVSALK + PERSPENK*PTRATIO*PCTAF'
   dta = star98[['NABOVE', 'NBELOW', 'LOWINC', 'PERASIAN', 'PERBLACK', 'PERHISP',
                 'PCTCHRT', 'PCTYRRND', 'PERMINTE', 'AVYRSEXP', 'AVSALK',
                 'PERSPENK', 'PTRATIO', 'PCTAF']]
   endog = dta['NABOVE'] / (dta['NABOVE'] + dta.pop('NBELOW'))
   del dta['NABOVE']
   dta['SUCCESS'] = endog
   

Then, we fit the GLM model:

In [ ]:
mod1 = smf.glm(formula=formula, data=dta, family=sm.families.Binomial()).fit()
   mod1.summary()
   
/Users/tom.augspurger/Envs/py3/lib/python3.4/site-packages/IPython/kernel/__main__.py:11: SettingWithCopyWarning: 
   A value is trying to be set on a copy of a slice from a DataFrame.
   Try using .loc[row_indexer,col_indexer] = value instead
   
   See the the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
   

Finally, we define a function to operate customized data transformation using the formula framework:

In [ ]:
def double_it(x):
       return 2 * x
   formula = 'SUCCESS ~ double_it(LOWINC) + PERASIAN + PERBLACK + PERHISP + PCTCHRT + \
              PCTYRRND + PERMINTE*AVYRSEXP*AVSALK + PERSPENK*PTRATIO*PCTAF'
   mod2 = smf.glm(formula=formula, data=dta, family=sm.families.Binomial()).fit()
   mod2.summary()
   

As expected, the coefficient for double_it(LOWINC) in the second model is half the size of the LOWINC coefficient from the first model:

In [ ]:
print(mod1.params[1])
   print(mod2.params[1] * 2)