class statsmodels.genmod.generalized_estimating_equations.GEE(endog, exog, groups, time=None, family=None, cov_struct=None, missing='none', offset=None, exposure=None, dep_data=None, constraint=None, update_dep=True, **kwargs)[source]

Estimation of marginal regression models using Generalized Estimating Equations (GEE).

GEE can be used to fit Generalized Linear Models (GLMs) when the data have a grouped structure, and the observations are possibly correlated within groups but not between groups.


endog : array-like

1d array of endogenous values (i.e. responses, outcomes, dependent variables, or ‘Y’ values).

exog : array-like

2d array of exogeneous values (i.e. covariates, predictors, independent variables, regressors, or ‘X’ values). A nobs x k array where nobs is the number of observations and k is the number of regressors. An intercept is not included by default and should be added by the user. See

groups : array-like

A 1d array of length nobs containing the group labels.

time : array-like

A 2d array of time (or other index) values, used by some dependence structures to define similarity relationships among observations within a cluster.

family : family class instance

The default is Gaussian. To specify the binomial distribution use Each family can take a link instance as an argument. See for more information.

cov_struct : CovStruct class instance

The default is Independence. To specify an exchangeable structure use cov_struct = Exchangeable(). See statsmodels.genmod.cov_struct.CovStruct for more information.

offset : array-like

An offset to be included in the fit. If provided, must be an array whose length is the number of rows in exog.

dep_data : array-like

Additional data passed to the dependence structure.

constraint : (ndarray, ndarray)

If provided, the constraint is a tuple (L, R) such that the model parameters are estimated under the constraint L * param = R, where L is a q x p matrix and R is a q-dimensional vector. If constraint is provided, a score test is performed to compare the constrained model to the unconstrained model.

update_dep : bool

If true, the dependence parameters are optimized, otherwise they are held fixed at their starting values.

missing : str

Available options are ‘none’, ‘drop’, and ‘raise’. If ‘none’, no nan checking is done. If ‘drop’, any observations with nans are dropped. If ‘raise’, an error is raised. Default is ‘none.’

See also, Families, Link Functions


Only the following combinations make sense for family and link

             + ident log logit probit cloglog pow opow nbinom loglog logc
Gaussian     |   x    x                        x
inv Gaussian |   x    x                        x
binomial     |   x    x    x     x       x     x    x           x      x
Poission     |   x    x                        x
neg binomial |   x    x                        x          x
gamma        |   x    x                        x

Not all of these link functions are currently available.

Endog and exog are references so that if the data they refer to are already arrays and these arrays are changed, endog and exog will change.

The “robust” covariance type is the standard “sandwich estimator” (e.g. Liang and Zeger (1986)). It is the default here and in most other packages. The “naive” estimator gives smaller standard errors, but is only correct if the working correlation structure is correctly specified. The “bias reduced” estimator of Mancl and DeRouen (Biometrics, 2001) reduces the downard bias of the robust estimator.


Logistic regression with autoregressive working dependence:

>>> import statsmodels.api as sm
>>> family = sm.families.Binomial()
>>> va = sm.cov_struct.Autoregressive()
>>> model = sm.GEE(endog, exog, group, family=family, cov_struct=va)
>>> result =
>>> print result.summary()

Use formulas to fit a Poisson GLM with independent working dependence:

>>> import statsmodels.api as sm
>>> fam = sm.families.Poisson()
>>> ind = sm.cov_struct.Independence()
>>> model = sm.GEE.from_formula("y ~ age + trt + base", "subject",
                             data, cov_struct=ind, family=fam)
>>> result =
>>> print result.summary()

Equivalent, using the formula API:

>>> import statsmodels.api as sm
>>> import statsmodels.formula.api as smf
>>> fam = sm.families.Poisson()
>>> ind = sm.cov_struct.Independence()
>>> model = smf.gee("y ~ age + trt + base", "subject",
                data, cov_struct=ind, family=fam)
>>> result =
>>> print result.summary()


cluster_list(array) Returns array split into subarrays corresponding to the cluster structure.
estimate_scale() Returns an estimate of the scale parameter phi at the current parameter value.
fit([maxiter, ctol, start_params, ...]) Fits a marginal regression model using generalized estimating equations (GEE).
from_formula(formula, groups, data[, ...])
predict(params[, exog, offset, exposure, linear]) Return predicted values for a marginal regression model fit using GEE.
update_cached_means(mean_params) cached_means should always contain the most recent calculation