# Generalized Linear Models (Formula)

This notebook illustrates how you can use R-style formulas to fit Generalized Linear Models.

To begin, we load the ``Star98`` dataset and we construct a formula and pre-process the data:

In [1]:
import statsmodels.api as sm
import statsmodels.formula.api as smf

star98 = sm.datasets.star98.load_pandas().data
formula = "SUCCESS ~ LOWINC + PERASIAN + PERBLACK + PERHISP + PCTCHRT + \
           PCTYRRND + PERMINTE*AVYRSEXP*AVSALK + PERSPENK*PTRATIO*PCTAF"
dta = star98[
    [
        "NABOVE",
        "NBELOW",
        "LOWINC",
        "PERASIAN",
        "PERBLACK",
        "PERHISP",
        "PCTCHRT",
        "PCTYRRND",
        "PERMINTE",
        "AVYRSEXP",
        "AVSALK",
        "PERSPENK",
        "PTRATIO",
        "PCTAF",
    ]
].copy()
endog = dta["NABOVE"] / (dta["NABOVE"] + dta.pop("NBELOW"))
del dta["NABOVE"]
dta["SUCCESS"] = endog

Then, we fit the GLM model:

In [2]:
mod1 = smf.glm(formula=formula, data=dta, family=sm.families.Binomial()).fit()
print(mod1.summary())

                 Generalized Linear Model Regression Results                  
Dep. Variable:                SUCCESS   No. Observations:                  303
Model:                            GLM   Df Residuals:                      282
Model Family:                Binomial   Df Model:                           20
Link Function:                  Logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -127.33
Date:                Fri, 19 Apr 2024   Deviance:                       8.5477
Time:                        16:33:37   Pearson chi2:                     8.48
No. Iterations:                     4   Pseudo R-squ. (CS):             0.1115
Covariance Type:            nonrobust                                         
                               coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------
Intercept               

Finally, we define a function to operate customized data transformation using the formula framework:

In [3]:
def double_it(x):
    return 2 * x


formula = "SUCCESS ~ double_it(LOWINC) + PERASIAN + PERBLACK + PERHISP + PCTCHRT + \
           PCTYRRND + PERMINTE*AVYRSEXP*AVSALK + PERSPENK*PTRATIO*PCTAF"
mod2 = smf.glm(formula=formula, data=dta, family=sm.families.Binomial()).fit()
print(mod2.summary())

                 Generalized Linear Model Regression Results                  
Dep. Variable:                SUCCESS   No. Observations:                  303
Model:                            GLM   Df Residuals:                      282
Model Family:                Binomial   Df Model:                           20
Link Function:                  Logit   Scale:                          1.0000
Method:                          IRLS   Log-Likelihood:                -127.33
Date:                Fri, 19 Apr 2024   Deviance:                       8.5477
Time:                        16:33:38   Pearson chi2:                     8.48
No. Iterations:                     4   Pseudo R-squ. (CS):             0.1115
Covariance Type:            nonrobust                                         
                               coef    std err          z      P>|z|      [0.025      0.975]
--------------------------------------------------------------------------------------------
Intercept               

As expected, the coefficient for ``double_it(LOWINC)`` in the second model is half the size of the ``LOWINC`` coefficient from the first model:

In [4]:
print(mod1.params[1])
print(mod2.params[1] * 2)

-0.02039598715475645
-0.020395987154756174


  print(mod1.params[1])
  print(mod2.params[1] * 2)
