statsmodels.regression.mixed_linear_model.MixedLM.from_formula#

classmethod MixedLM.from_formula(formula, data, re_formula=None, vc_formula=None, subset=None, use_sparse=False, missing='none', *args, **kwargs)[source]#

Create a Model from a formula and dataframe

Parameters:

formulastr or generic Formula object: The formula specifying the model
dataarray_like: The data for the model. See Notes.
re_formulastr: A one-sided formula defining the variance structure of the model. The default gives a random intercept for each group.
vc_formuladict-like: Formulas describing variance components. vc_formula[vc] is the formula for the component with variance parameter named vc. The formula is processed into a matrix, and the columns of this matrix are linearly combined with independent random coefficients having mean zero and a common variance.
subsetarray_like: An array-like object of booleans, integers, or index values that indicate the subset of df to use in the model. Assumes df is a pandas.DataFrame
use_sparsebool: If True, use sparse matrices for variance component design matrices.
missingstr: Either ‘none’ or ‘drop’
argsextra arguments: These are passed to the model
kwargsextra keyword arguments: These are passed to the model with one exception. The eval_env keyword is passed to patsy. It can be either a patsy:patsy.EvalEnvironment object or an integer indicating the depth of the namespace to use. For example, the default eval_env=0 uses the calling namespace. If you wish to use a “clean” environment set eval_env=-1.

Returns:

modelModel instance: The model instance built from the formula and data.

Notes

data must define __getitem__ with the keys in the formula terms args and kwargs are passed on to the model instantiation. E.g., a numpy structured or rec array, a dictionary, or a pandas DataFrame.

If the variance component is intended to produce random intercepts for disjoint subsets of a group, specified by string labels or a categorical data value, always use ‘0 +’ in the formula so that no overall intercept is included.

If the variance components specify random slopes and you do not also want a random group-level intercept in the model, then use ‘0 +’ in the formula to exclude the intercept.

The variance components formulas are processed separately for each group. If a variable is categorical the results will not be affected by whether the group labels are distinct or re-used over the top-level groups.

Examples

Suppose we have data from an educational study with students nested in classrooms nested in schools. The students take a test, and we want to relate the test scores to the students’ ages, while accounting for the effects of classrooms and schools. The school will be the top-level group, and the classroom is a nested group that is specified as a variance component. Note that the schools may have different number of classrooms, and the classroom labels may (but need not be) different across the schools.

>>> vc = {'classroom': '0 + C(classroom)'}
>>> MixedLM.from_formula('test_score ~ age', vc_formula=vc,
...     re_formula='1', groups='school', data=data)

Now suppose we also have a previous test score called ‘pretest’. If we want the relationship between pretest scores and the current test to vary by classroom, we can specify a random slope for the pretest score

>>> vc = {'classroom': '0 + C(classroom)', 'pretest': '0 + pretest'}
>>> MixedLM.from_formula('test_score ~ age + pretest', vc_formula=vc,
...     re_formula='1', groups='school', data=data)

The following model is almost equivalent to the previous one, but here the classroom random intercept and pretest slope may be correlated.

>>> vc = {'classroom': '0 + C(classroom)'}
>>> MixedLM.from_formula('test_score ~ age + pretest', vc_formula=vc,
...     re_formula='1 + pretest', groups='school', data=data)