statsmodels.regression.mixed_linear_model.MixedLM.from_formula

classmethod MixedLM.from_formula(formula, data, re_formula=None, vc_formula=None, subset=None, use_sparse=False, *args, **kwargs)[source]

Create a Model from a formula and dataframe.

Parameters:

formula : str or generic Formula object

The formula specifying the model

data : array-like

The data for the model. See Notes.

re_formula : string

A one-sided formula defining the variance structure of the model. The default gives a random intercept for each group.

vc_formula : dict-like

Formulas describing variance components. vc_formula[vc] is the formula for the component with variance parameter named vc. The formula is processed into a matrix, and the columns of this matrix are linearly combined with independent random coefficients having mean zero and a common variance.

subset : array-like

An array-like object of booleans, integers, or index values that indicate the subset of df to use in the model. Assumes df is a pandas.DataFrame

args : extra arguments

These are passed to the model

kwargs : extra keyword arguments

These are passed to the model with one exception. The eval_env keyword is passed to patsy. It can be either a patsy.EvalEnvironment object or an integer indicating the depth of the namespace to use. For example, the default eval_env=0 uses the calling namespace. If you wish to use a “clean” environment set eval_env=-1.

Returns:

model : Model instance

Notes

data must define __getitem__ with the keys in the formula terms args and kwargs are passed on to the model instantiation. E.g., a numpy structured or rec array, a dictionary, or a pandas DataFrame.

If the variance component is intended to produce random intercepts for disjoint subsets of a group, specified by string labels or a categorical data value, always use ‘0 +’ in the formula so that no overall intercept is included.

If the variance components specify random slopes and you do not also want a random group-level intercept in the model, then use ‘0 +’ in the formula to exclude the intercept.

The variance components formulas are processed separately for each group. If a variable is categorical the results will not be affected by whether the group labels are distinct or re-used over the top-level groups.

This method currently does not correctly handle missing values, so missing values should be explicitly dropped from the DataFrame before calling this method.

Examples

Suppose we have an educational data set with students nested in classrooms nested in schools. The students take a test, and we want to relate the test scores to the students’ ages, while accounting for the effects of classrooms and schools. The school will be the top-level group, and the classroom is a nested group that is specified as a variance component. Note that the schools may have different number of classrooms, and the classroom labels may (but need not be) different across the schools.

>>> vc = {'classroom': '0 + C(classroom)'}
>>> MixedLM.from_formula('test_score ~ age', vc_formula=vc,                                   re_formula='1', groups='school', data=data)

Now suppose we also have a previous test score called ‘pretest’. If we want the relationship between pretest scores and the current test to vary by classroom, we can specify a random slope for the pretest score

>>> vc = {'classroom': '0 + C(classroom)', 'pretest': '0 + pretest'}
>>> MixedLM.from_formula('test_score ~ age + pretest', vc_formula=vc,                                   re_formula='1', groups='school', data=data)

The following model is almost equivalent to the previous one, but here the classroom random intercept and pretest slope may be correlated.

>>> vc = {'classroom': '0 + C(classroom)'}
>>> MixedLM.from_formula('test_score ~ age + pretest', vc_formula=vc,                                   re_formula='1 + pretest', groups='school',                                   data=data)