.. _formula_examples: Fitting models using R-style formulas ===================================== Since version 0.5.0, ``statsmodels`` allows users to fit statistical models using R-style formulas. Internally, ``statsmodels`` uses the `patsy `_ package to convert formulas and data to the matrices that are used in model fitting. The formula framework is quite powerful; this tutorial only scratches the surface. A full description of the formula language can be found in the ``patsy`` docs: - `Patsy formula language description `_ Loading modules and functions ----------------------------- .. code:: python import statsmodels.formula.api as smf import numpy as np import pandas Notice that we called ``statsmodels.formula.api`` instead of the usual ``statsmodels.api``. The ``formula.api`` hosts many of the same functions found in ``api`` (e.g. OLS, GLM), but it also holds lower case counterparts for most of these models. In general, lower case models accept ``formula`` and ``df`` arguments, whereas upper case ones take ``endog`` and ``exog`` design matrices. ``formula`` accepts a string which describes the model in terms of a ``patsy`` formula. ``df`` takes a `pandas `_ data frame. ``dir(smf)`` will print a list of available models. Formula-compatible models have the following generic call signature: ``(formula, data, subset=None, *args, **kwargs)`` OLS regression using formulas ----------------------------- To begin, we fit the linear model described on the `Getting Started `_ page. Download the data, subset columns, and list-wise delete to remove missing observations: .. code:: python df = sm.datasets.get_rdataset("Guerry", "HistData").data df = df[['Lottery', 'Literacy', 'Wealth', 'Region']].dropna() df.head() .. parsed-literal:: Lottery Literacy Wealth Region 0 41 37 73 E 1 38 51 22 N 2 66 13 61 C 3 80 46 76 E 4 79 69 83 E Fit the model: .. code:: python mod = smf.ols(formula='Lottery ~ Literacy + Wealth + Region', data=df) res = mod.fit() print(res.summary()) .. parsed-literal:: OLS Regression Results ============================================================================== Dep. Variable: Lottery R-squared: 0.338 Model: OLS Adj. R-squared: 0.287 Method: Least Squares F-statistic: 6.636 Date: Sun, 13 Jan 2013 Prob (F-statistic): 1.07e-05 Time: 10:38:36 Log-Likelihood: -375.30 No. Observations: 85 AIC: 764.6 Df Residuals: 78 BIC: 781.7 Df Model: 6 =============================================================================== coef std err t P>|t| [95.0% Conf. Int.] ------------------------------------------------------------------------------- Intercept 38.6517 9.456 4.087 0.000 19.826 57.478 Region[T.E] -15.4278 9.727 -1.586 0.117 -34.793 3.938 Region[T.N] -10.0170 9.260 -1.082 0.283 -28.453 8.419 Region[T.S] -4.5483 7.279 -0.625 0.534 -19.039 9.943 Region[T.W] -10.0913 7.196 -1.402 0.165 -24.418 4.235 Literacy -0.1858 0.210 -0.886 0.378 -0.603 0.232 Wealth 0.4515 0.103 4.390 0.000 0.247 0.656 ============================================================================== Omnibus: 3.049 Durbin-Watson: 1.785 Prob(Omnibus): 0.218 Jarque-Bera (JB): 2.694 Skew: -0.340 Prob(JB): 0.260 Kurtosis: 2.454 Cond. No. 371. ============================================================================== Categorical variables --------------------- Looking at the summary printed above, notice that ``patsy`` determined that elements of *Region* were text strings, so it treated *Region* as a categorical variable. ``patsy``'s default is also to include an intercept, so we automatically dropped one of the *Region* categories. If *Region* had been an integer variable that we wanted to treat explicitly as categorical, we could have done so by using the ``C()`` operator: .. code:: python res = smf.ols(formula='Lottery ~ Literacy + Wealth + C(Region)', data=df).fit() print(res.params) .. parsed-literal:: Intercept 38.651655 C(Region)[T.E] -15.427785 C(Region)[T.N] -10.016961 C(Region)[T.S] -4.548257 C(Region)[T.W] -10.091276 Literacy -0.185819 Wealth 0.451475 Examples more advanced features ``patsy``'s categorical variables function can be found here: `Patsy: Contrast Coding Systems for categorical variables `_ Operators --------- We have already seen that "~" separates the left-hand side of the model from the right-hand side, and that "+" adds new columns to the design matrix. Removing variables ~~~~~~~~~~~~~~~~~~ The "-" sign can be used to remove columns/variables. For instance, we can remove the intercept from a model by: .. code:: python res = smf.ols(formula='Lottery ~ Literacy + Wealth + C(Region) -1 ', data=df).fit() print(res.params) .. parsed-literal:: C(Region)[C] 38.651655 C(Region)[E] 23.223870 C(Region)[N] 28.634694 C(Region)[S] 34.103399 C(Region)[W] 28.560379 Literacy -0.185819 Wealth 0.451475 Multiplicative interactions ~~~~~~~~~~~~~~~~~~~~~~~~~~~ ":" adds a new column to the design matrix with the product of the other two columns. "\*" will also include the individual columns that were multiplied together: .. code:: python res1 = smf.ols(formula='Lottery ~ Literacy : Wealth - 1', data=df).fit() res2 = smf.ols(formula='Lottery ~ Literacy * Wealth - 1', data=df).fit() print(res1.params, '\n') print(res2.params) .. parsed-literal:: Literacy:Wealth 0.018176 Literacy 0.427386 Wealth 1.080987 Literacy:Wealth -0.013609 Many other things are possible with operators. Please consult the `patsy docs `_ to learn more. Functions --------- You can apply vectorized functions to the variables in your model: .. code:: python res = smf.ols(formula='Lottery ~ np.log(Literacy)', data=df).fit() print(res.params) .. parsed-literal:: Intercept 115.609119 np.log(Literacy) -20.393959 Define a custom function: .. code:: python def log_plus_1(x): return np.log(x) + 1. res = smf.ols(formula='Lottery ~ log_plus_1(Literacy)', data=df).fit() print(res.params) .. parsed-literal:: Intercept 136.003079 log_plus_1(Literacy) -20.393959 .. _patsy-namespaces: Namespaces ---------- Notice that all of the above examples use the calling namespace to look for the functions to apply. The namespace used can be controlled via the ``eval_env`` keyword. For example, you may want to give a custom namespace using the :class:`patsy:patsy.EvalEnvironment` or you may want to use a "clean" namespace, which we provide by passing ``eval_func=-1``. The default is to use the caller's namespace. This can have (un)expected consequences, if, for example, someone has a variable names ``C`` in the user namespace or in their data structure passed to ``patsy``, and ``C`` is used in the formula to handle a categorical variable. See the `Patsy API Reference `_ for more information. Using formulas with models that do not (yet) support them --------------------------------------------------------- Even if a given ``statsmodels`` function does not support formulas, you can still use ``patsy``'s formula language to produce design matrices. Those matrices can then be fed to the fitting function as ``endog`` and ``exog`` arguments. To generate ``numpy`` arrays: .. code:: python import patsy f = 'Lottery ~ Literacy * Wealth' y,X = patsy.dmatrices(f, df, return_type='dataframe') print(y[:5]) print(X[:5]) .. parsed-literal:: Lottery 0 41 1 38 2 66 3 80 4 79 Intercept Literacy Wealth Literacy:Wealth 0 1 37 73 2701 1 1 51 22 1122 2 1 13 61 793 3 1 46 76 3496 4 1 69 83 5727 To generate pandas data frames: .. code:: python f = 'Lottery ~ Literacy * Wealth' y,X = patsy.dmatrices(f, df, return_type='dataframe') print(y[:5]) print(X[:5]) .. parsed-literal:: Lottery 0 41 1 38 2 66 3 80 4 79 Intercept Literacy Wealth Literacy:Wealth 0 1 37 73 2701 1 1 51 22 1122 2 1 13 61 793 3 1 46 76 3496 4 1 69 83 5727 .. code:: python print(smf.OLS(y, X).fit().summary()) .. parsed-literal:: OLS Regression Results ============================================================================== Dep. Variable: Lottery R-squared: 0.309 Model: OLS Adj. R-squared: 0.283 Method: Least Squares F-statistic: 12.06 Date: Sun, 13 Jan 2013 Prob (F-statistic): 1.32e-06 Time: 10:38:36 Log-Likelihood: -377.13 No. Observations: 85 AIC: 762.3 Df Residuals: 81 BIC: 772.0 Df Model: 3 =================================================================================== coef std err t P>|t| [95.0% Conf. Int.] ----------------------------------------------------------------------------------- Intercept 38.6348 15.825 2.441 0.017 7.149 70.121 Literacy -0.3522 0.334 -1.056 0.294 -1.016 0.312 Wealth 0.4364 0.283 1.544 0.126 -0.126 0.999 Literacy:Wealth -0.0005 0.006 -0.085 0.933 -0.013 0.012 ============================================================================== Omnibus: 4.447 Durbin-Watson: 1.953 Prob(Omnibus): 0.108 Jarque-Bera (JB): 3.228 Skew: -0.332 Prob(JB): 0.199 Kurtosis: 2.314 Cond. No. 1.40e+04 ============================================================================== The condition number is large, 1.4e+04. This might indicate that there are strong multicollinearity or other numerical problems.