{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Variance Component Analysis\n", "\n", "This notebook illustrates variance components analysis for two-level\n", "nested and crossed designs." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "execution": { "iopub.execute_input": "2023-12-14T14:44:39.993334Z", "iopub.status.busy": "2023-12-14T14:44:39.993025Z", "iopub.status.idle": "2023-12-14T14:44:41.068540Z", "shell.execute_reply": "2023-12-14T14:44:41.067782Z" } }, "outputs": [], "source": [ "import numpy as np\n", "import statsmodels.api as sm\n", "from statsmodels.regression.mixed_linear_model import VCSpec\n", "import pandas as pd" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Make the notebook reproducible" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "execution": { "iopub.execute_input": "2023-12-14T14:44:41.072976Z", "iopub.status.busy": "2023-12-14T14:44:41.072537Z", "iopub.status.idle": "2023-12-14T14:44:41.078190Z", "shell.execute_reply": "2023-12-14T14:44:41.077360Z" }, "lines_to_next_cell": 1 }, "outputs": [], "source": [ "np.random.seed(3123)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Nested analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In our discussion below, \"Group 2\" is nested within \"Group 1\". As a\n", "concrete example, \"Group 1\" might be school districts, with \"Group\n", "2\" being individual schools. The function below generates data from\n", "such a population. In a nested analysis, the group 2 labels that\n", "are nested within different group 1 labels are treated as\n", "independent groups, even if they have the same label. For example,\n", "two schools labeled \"school 1\" that are in two different school\n", "districts are treated as independent schools, even though they have\n", "the same label." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "execution": { "iopub.execute_input": "2023-12-14T14:44:41.082075Z", "iopub.status.busy": "2023-12-14T14:44:41.081738Z", "iopub.status.idle": "2023-12-14T14:44:41.092353Z", "shell.execute_reply": "2023-12-14T14:44:41.089749Z" }, "lines_to_end_of_cell_marker": 0, "lines_to_next_cell": 1 }, "outputs": [], "source": [ "def generate_nested(\n", " n_group1=200, n_group2=20, n_rep=10, group1_sd=2, group2_sd=3, unexplained_sd=4\n", "):\n", "\n", " # Group 1 indicators\n", " group1 = np.kron(np.arange(n_group1), np.ones(n_group2 * n_rep))\n", "\n", " # Group 1 effects\n", " u = group1_sd * np.random.normal(size=n_group1)\n", " effects1 = np.kron(u, np.ones(n_group2 * n_rep))\n", "\n", " # Group 2 indicators\n", " group2 = np.kron(np.ones(n_group1), np.kron(np.arange(n_group2), np.ones(n_rep)))\n", "\n", " # Group 2 effects\n", " u = group2_sd * np.random.normal(size=n_group1 * n_group2)\n", " effects2 = np.kron(u, np.ones(n_rep))\n", "\n", " e = unexplained_sd * np.random.normal(size=n_group1 * n_group2 * n_rep)\n", " y = effects1 + effects2 + e\n", "\n", " df = pd.DataFrame({\"y\": y, \"group1\": group1, \"group2\": group2})\n", "\n", " return df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Generate a data set to analyze." ] }, { "cell_type": "code", "execution_count": 4, "metadata": { "execution": { "iopub.execute_input": "2023-12-14T14:44:41.096192Z", "iopub.status.busy": "2023-12-14T14:44:41.095868Z", "iopub.status.idle": "2023-12-14T14:44:41.111736Z", "shell.execute_reply": "2023-12-14T14:44:41.110881Z" } }, "outputs": [], "source": [ "df = generate_nested()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Using all the default arguments for `generate_nested`, the population\n", "values of \"group 1 Var\" and \"group 2 Var\" are 2^2=4 and 3^2=9,\n", "respectively. The unexplained variance, listed as \"scale\" at the\n", "top of the summary table, has population value 4^2=16." ] }, { "cell_type": "code", "execution_count": 5, "metadata": { "execution": { "iopub.execute_input": "2023-12-14T14:44:41.115308Z", "iopub.status.busy": "2023-12-14T14:44:41.115028Z", "iopub.status.idle": "2023-12-14T14:44:52.690494Z", "shell.execute_reply": "2023-12-14T14:44:52.689778Z" }, "lines_to_end_of_cell_marker": 0, "lines_to_next_cell": 1 }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Mixed Linear Model Regression Results\n", "==========================================================\n", "Model: MixedLM Dependent Variable: y \n", "No. Observations: 40000 Method: REML \n", "No. Groups: 200 Scale: 15.8825 \n", "Min. group size: 200 Log-Likelihood: -116022.3805\n", "Max. group size: 200 Converged: Yes \n", "Mean group size: 200.0 \n", "-----------------------------------------------------------\n", " Coef. Std.Err. z P>|z| [0.025 0.975]\n", "-----------------------------------------------------------\n", "Intercept -0.035 0.149 -0.232 0.817 -0.326 0.257\n", "group1 Var 3.917 0.112 \n", "group2 Var 8.742 0.063 \n", "==========================================================\n", "\n" ] } ], "source": [ "model1 = sm.MixedLM.from_formula(\n", " \"y ~ 1\",\n", " re_formula=\"1\",\n", " vc_formula={\"group2\": \"0 + C(group2)\"},\n", " groups=\"group1\",\n", " data=df,\n", ")\n", "result1 = model1.fit()\n", "print(result1.summary())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we wish to avoid the formula interface, we can fit the same model\n", "by building the design matrices manually." ] }, { "cell_type": "code", "execution_count": 6, "metadata": { "execution": { "iopub.execute_input": "2023-12-14T14:44:52.697455Z", "iopub.status.busy": "2023-12-14T14:44:52.695377Z", "iopub.status.idle": "2023-12-14T14:44:52.707740Z", "shell.execute_reply": "2023-12-14T14:44:52.706929Z" }, "lines_to_end_of_cell_marker": 0, "lines_to_next_cell": 1 }, "outputs": [], "source": [ "def f(x):\n", " n = x.shape[0]\n", " g2 = x.group2\n", " u = g2.unique()\n", " u.sort()\n", " uv = {v: k for k, v in enumerate(u)}\n", " mat = np.zeros((n, len(u)))\n", " for i in range(n):\n", " mat[i, uv[g2.iloc[i]]] = 1\n", " colnames = [\"%d\" % z for z in u]\n", " return mat, colnames" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Then we set up the variance components using the VCSpec class." ] }, { "cell_type": "code", "execution_count": 7, "metadata": { "execution": { "iopub.execute_input": "2023-12-14T14:44:52.714632Z", "iopub.status.busy": "2023-12-14T14:44:52.712646Z", "iopub.status.idle": "2023-12-14T14:44:53.038380Z", "shell.execute_reply": "2023-12-14T14:44:53.037549Z" } }, "outputs": [], "source": [ "vcm = df.groupby(\"group1\").apply(f).to_list()\n", "mats = [x[0] for x in vcm]\n", "colnames = [x[1] for x in vcm]\n", "names = [\"group2\"]\n", "vcs = VCSpec(names, [colnames], [mats])" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Finally we fit the model. It can be seen that the results of the\n", "two fits are identical." ] }, { "cell_type": "code", "execution_count": 8, "metadata": { "execution": { "iopub.execute_input": "2023-12-14T14:44:53.044194Z", "iopub.status.busy": "2023-12-14T14:44:53.042816Z", "iopub.status.idle": "2023-12-14T14:45:03.191295Z", "shell.execute_reply": "2023-12-14T14:45:03.190627Z" }, "lines_to_end_of_cell_marker": 0, "lines_to_next_cell": 1 }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Mixed Linear Model Regression Results\n", "==========================================================\n", "Model: MixedLM Dependent Variable: y \n", "No. Observations: 40000 Method: REML \n", "No. Groups: 200 Scale: 15.8825 \n", "Min. group size: 200 Log-Likelihood: -116022.3805\n", "Max. group size: 200 Converged: Yes \n", "Mean group size: 200.0 \n", "-----------------------------------------------------------\n", " Coef. Std.Err. z P>|z| [0.025 0.975]\n", "-----------------------------------------------------------\n", "const -0.035 0.149 -0.232 0.817 -0.326 0.257\n", "x_re1 Var 3.917 0.112 \n", "group2 Var 8.742 0.063 \n", "==========================================================\n", "\n" ] } ], "source": [ "oo = np.ones(df.shape[0])\n", "model2 = sm.MixedLM(df.y, oo, exog_re=oo, groups=df.group1, exog_vc=vcs)\n", "result2 = model2.fit()\n", "print(result2.summary())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "## Crossed analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "In a crossed analysis, the levels of one group can occur in any\n", "combination with the levels of the another group. The groups in\n", "Statsmodels MixedLM are always nested, but it is possible to fit a\n", "crossed model by having only one group, and specifying all random\n", "effects as variance components. Many, but not all crossed models\n", "can be fit in this way. The function below generates a crossed data\n", "set with two levels of random structure." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "execution": { "iopub.execute_input": "2023-12-14T14:45:03.195220Z", "iopub.status.busy": "2023-12-14T14:45:03.194739Z", "iopub.status.idle": "2023-12-14T14:45:03.205915Z", "shell.execute_reply": "2023-12-14T14:45:03.205284Z" }, "lines_to_end_of_cell_marker": 0, "lines_to_next_cell": 1 }, "outputs": [], "source": [ "def generate_crossed(\n", " n_group1=100, n_group2=100, n_rep=4, group1_sd=2, group2_sd=3, unexplained_sd=4\n", "):\n", "\n", " # Group 1 indicators\n", " group1 = np.kron(\n", " np.arange(n_group1, dtype=int), np.ones(n_group2 * n_rep, dtype=int)\n", " )\n", " group1 = group1[np.random.permutation(len(group1))]\n", "\n", " # Group 1 effects\n", " u = group1_sd * np.random.normal(size=n_group1)\n", " effects1 = u[group1]\n", "\n", " # Group 2 indicators\n", " group2 = np.kron(\n", " np.arange(n_group2, dtype=int), np.ones(n_group2 * n_rep, dtype=int)\n", " )\n", " group2 = group2[np.random.permutation(len(group2))]\n", "\n", " # Group 2 effects\n", " u = group2_sd * np.random.normal(size=n_group2)\n", " effects2 = u[group2]\n", "\n", " e = unexplained_sd * np.random.normal(size=n_group1 * n_group2 * n_rep)\n", " y = effects1 + effects2 + e\n", "\n", " df = pd.DataFrame({\"y\": y, \"group1\": group1, \"group2\": group2})\n", "\n", " return df" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Generate a data set to analyze." ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "execution": { "iopub.execute_input": "2023-12-14T14:45:03.209527Z", "iopub.status.busy": "2023-12-14T14:45:03.209059Z", "iopub.status.idle": "2023-12-14T14:45:03.223804Z", "shell.execute_reply": "2023-12-14T14:45:03.223080Z" } }, "outputs": [], "source": [ "df = generate_crossed()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we fit the model, note that the `groups` vector is constant.\n", "Using the default parameters for `generate_crossed`, the level 1\n", "variance should be 2^2=4, the level 2 variance should be 3^2=9, and\n", "the unexplained variance should be 4^2=16." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "execution": { "iopub.execute_input": "2023-12-14T14:45:03.227575Z", "iopub.status.busy": "2023-12-14T14:45:03.227262Z", "iopub.status.idle": "2023-12-14T14:45:52.676305Z", "shell.execute_reply": "2023-12-14T14:45:52.675134Z" }, "lines_to_end_of_cell_marker": 0, "lines_to_next_cell": 1 }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Mixed Linear Model Regression Results\n", "==========================================================\n", "Model: MixedLM Dependent Variable: y \n", "No. Observations: 40000 Method: REML \n", "No. Groups: 1 Scale: 15.9824 \n", "Min. group size: 40000 Log-Likelihood: -112684.9688\n", "Max. group size: 40000 Converged: Yes \n", "Mean group size: 40000.0 \n", "-----------------------------------------------------------\n", " Coef. Std.Err. z P>|z| [0.025 0.975]\n", "-----------------------------------------------------------\n", "Intercept -0.251 0.353 -0.710 0.478 -0.943 0.442\n", "g1 Var 4.282 0.154 \n", "g2 Var 8.150 0.291 \n", "==========================================================\n", "\n" ] } ], "source": [ "vc = {\"g1\": \"0 + C(group1)\", \"g2\": \"0 + C(group2)\"}\n", "oo = np.ones(df.shape[0])\n", "model3 = sm.MixedLM.from_formula(\"y ~ 1\", groups=oo, vc_formula=vc, data=df)\n", "result3 = model3.fit()\n", "print(result3.summary())" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "If we wish to avoid the formula interface, we can fit the same model\n", "by building the design matrices manually." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "execution": { "iopub.execute_input": "2023-12-14T14:45:52.681293Z", "iopub.status.busy": "2023-12-14T14:45:52.679667Z", "iopub.status.idle": "2023-12-14T14:45:53.456081Z", "shell.execute_reply": "2023-12-14T14:45:53.454848Z" } }, "outputs": [], "source": [ "def f(g):\n", " n = len(g)\n", " u = g.unique()\n", " u.sort()\n", " uv = {v: k for k, v in enumerate(u)}\n", " mat = np.zeros((n, len(u)))\n", " for i in range(n):\n", " mat[i, uv[g[i]]] = 1\n", " colnames = [\"%d\" % z for z in u]\n", " return [mat], [colnames]\n", "\n", "\n", "vcm = [f(df.group1), f(df.group2)]\n", "mats = [x[0] for x in vcm]\n", "colnames = [x[1] for x in vcm]\n", "names = [\"group1\", \"group2\"]\n", "vcs = VCSpec(names, colnames, mats)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we fit the model without using formulas, it is simple to check\n", "that the results for models 3 and 4 are identical." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "execution": { "iopub.execute_input": "2023-12-14T14:45:53.460415Z", "iopub.status.busy": "2023-12-14T14:45:53.459663Z", "iopub.status.idle": "2023-12-14T14:46:39.676981Z", "shell.execute_reply": "2023-12-14T14:46:39.676001Z" } }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ " Mixed Linear Model Regression Results\n", "==========================================================\n", "Model: MixedLM Dependent Variable: y \n", "No. Observations: 40000 Method: REML \n", "No. Groups: 1 Scale: 15.9824 \n", "Min. group size: 40000 Log-Likelihood: -112684.9688\n", "Max. group size: 40000 Converged: Yes \n", "Mean group size: 40000.0 \n", "-----------------------------------------------------------\n", " Coef. Std.Err. z P>|z| [0.025 0.975]\n", "-----------------------------------------------------------\n", "const -0.251 0.353 -0.710 0.478 -0.943 0.442\n", "group1 Var 4.282 0.154 \n", "group2 Var 8.150 0.291 \n", "==========================================================\n", "\n" ] } ], "source": [ "oo = np.ones(df.shape[0])\n", "model4 = sm.MixedLM(df.y, oo[:, None], exog_re=None, groups=oo, exog_vc=vcs)\n", "result4 = model4.fit()\n", "print(result4.summary())" ] } ], "metadata": { "jupytext": { "cell_metadata_filter": "-all", "main_language": "python", "notebook_metadata_filter": "-all" }, "kernelspec": { "display_name": "Python 3", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.13" } }, "nbformat": 4, "nbformat_minor": 4 }