`endog`

, `exog`

, what’s that?¶

statsmodels is using `endog`

and `exog`

as names for the data, the
observed variables that are used in an estimation problem. Other names that
are often used in different statistical packages or text books are, for
example,

endog |
exog |
---|---|

y |
x |

y variable |
x variable |

left hand side (LHS) |
right hand side (RHS) |

dependent variable |
independent variable |

regressand |
regressors |

outcome |
design |

response variable |
explanatory variable |

The usage is quite often domain and model specific; however, we have chosen to use endog and exog almost exclusively. A mnemonic hint to keep the two terms apart is that exogenous has an “x”, as in x-variable, in its name.

x and y are one letter names that are sometimes used for temporary
variables and are not informative in itself. To avoid one letter names we
decided to use descriptive names and settled on `endog`

and `exog`

.
Since this has been criticized, this might change in future.

## Background¶

Some informal definitions of the terms are

endogenous: caused by factors within the system

exogenous: caused by factors outside the system

*Endogenous variables designates variables in an economic/econometric model
that are explained, or predicted, by that model.*
http://stats.oecd.org/glossary/detail.asp?ID=794

*Exogenous variables designates variables that appear in an
economic/econometric model, but are not explained by that model (i.e. they are
taken as given by the model).* http://stats.oecd.org/glossary/detail.asp?ID=890

In econometrics and statistics the terms are defined more formally, and different definitions of exogeneity (weak, strong, strict) are used depending on the model. The usage in statsmodels as variable names cannot always be interpreted in a formal sense, but tries to follow the same principle.

In the simplest form, a model relates an observed variable, y, to another set of variables, x, in some linear or nonlinear form

```
y = f(x, beta) + noise
y = x * beta + noise
```

However, to have a statistical model we need additional assumptions on the properties of the explanatory variables, x, and the noise. One standard assumption for many basic models is that x is not correlated with the noise. In a more general definition, x being exogenous means that we do not have to consider how the explanatory variables in x were generated, whether by design or by random draws from some underlying distribution, when we want to estimate the effect or impact that x has on y, or test a hypothesis about this effect.

In other words, y is *endogenous* to our model, x is *exogenous* to our model
for the estimation.

As an example, suppose you run an experiment and for the second session some subjects are not available anymore. Is the drop-out relevant for the conclusions you draw for the experiment? In other words, can we treat the drop-out decision as exogenous for our problem.

It is up to the user to know (or to consult a text book to find out) what the
underlying statistical assumptions for the models are. As an example, `exog`

in `OLS`

can have lagged dependent variables if the error or noise term is
independently distributed over time (or uncorrelated over time). However, if
the error terms are autocorrelated in the presense of lagged dependent
variables, then OLS does not have good statistical properties (is inconsistent)
and the correct model will be ARMAX. `statsmodels`

has functions for
regression diagnostics to test whether some of the assumptions are justified or
not.