statsmodels.regression.mixed_linear_model.MixedLM.from_formula¶

classmethod
MixedLM.
from_formula
(formula, data, re_formula=None, vc_formula=None, subset=None, use_sparse=False, *args, **kwargs)[source]¶ Create a Model from a formula and dataframe.
Parameters: formula : str or generic Formula object
The formula specifying the model
data : arraylike
The data for the model. See Notes.
re_formula : string
A onesided formula defining the variance structure of the model. The default gives a random intercept for each group.
vc_formula : dictlike
Formulas describing variance components. vc_formula[vc] is the formula for the component with variance parameter named vc. The formula is processed into a matrix, and the columns of this matrix are linearly combined with independent random coefficients having mean zero and a common variance.
subset : arraylike
An arraylike object of booleans, integers, or index values that indicate the subset of df to use in the model. Assumes df is a pandas.DataFrame
args : extra arguments
These are passed to the model
kwargs : extra keyword arguments
These are passed to the model with one exception. The
eval_env
keyword is passed to patsy. It can be either apatsy.EvalEnvironment
object or an integer indicating the depth of the namespace to use. For example, the defaulteval_env=0
uses the calling namespace. If you wish to use a “clean” environment seteval_env=1
.Returns: model : Model instance
Notes
data must define __getitem__ with the keys in the formula terms args and kwargs are passed on to the model instantiation. E.g., a numpy structured or rec array, a dictionary, or a pandas DataFrame.
If the variance component is intended to produce random intercepts for disjoint subsets of a group, specified by string labels or a categorical data value, always use ‘0 +’ in the formula so that no overall intercept is included.
If the variance components specify random slopes and you do not also want a random grouplevel intercept in the model, then use ‘0 +’ in the formula to exclude the intercept.
The variance components formulas are processed separately for each group. If a variable is categorical the results will not be affected by whether the group labels are distinct or reused over the toplevel groups.
This method currently does not correctly handle missing values, so missing values should be explicitly dropped from the DataFrame before calling this method.
Examples
Suppose we have an educational data set with students nested in classrooms nested in schools. The students take a test, and we want to relate the test scores to the students’ ages, while accounting for the effects of classrooms and schools. The school will be the toplevel group, and the classroom is a nested group that is specified as a variance component. Note that the schools may have different number of classrooms, and the classroom labels may (but need not be) different across the schools.
>>> vc = {'classroom': '0 + C(classroom)'} >>> MixedLM.from_formula('test_score ~ age', vc_formula=vc, re_formula='1', groups='school', data=data)
Now suppose we also have a previous test score called ‘pretest’. If we want the relationship between pretest scores and the current test to vary by classroom, we can specify a random slope for the pretest score
>>> vc = {'classroom': '0 + C(classroom)', 'pretest': '0 + pretest'} >>> MixedLM.from_formula('test_score ~ age + pretest', vc_formula=vc, re_formula='1', groups='school', data=data)
The following model is almost equivalent to the previous one, but here the classroom random intercept and pretest slope may be correlated.
>>> vc = {'classroom': '0 + C(classroom)'} >>> MixedLM.from_formula('test_score ~ age + pretest', vc_formula=vc, re_formula='1 + pretest', groups='school', data=data)