{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# statsmodels Principal Component Analysis" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "*Key ideas:* Principal component analysis, world bank data, fertility\n", "\n", "In this notebook, we use principal components analysis (PCA) to analyze the time series of fertility rates in 192 countries, using data obtained from the World Bank. The main goal is to understand how the trends in fertility over time differ from country to country. This is a slightly atypical illustration of PCA because the data are time series. Methods such as functional PCA have been developed for this setting, but since the fertility data are very smooth, there is no real disadvantage to using standard PCA in this case." ] }, { "cell_type": "code", "execution_count": 1, "metadata": { "collapsed": false }, "outputs": [], "source": [ "%matplotlib inline\n", "\n", "import pandas as pd\n", "import numpy as np\n", "import matplotlib.pyplot as plt\n", "import statsmodels.api as sm\n", "from statsmodels.multivariate.pca import PCA" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The data can be obtained from the [World Bank web site](http://data.worldbank.org/indicator/SP.DYN.TFRT.IN), but here we work with a slightly cleaned-up version of the data:" ] }, { "cell_type": "code", "execution_count": 2, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Country NameCountry CodeIndicator NameIndicator Code196019611962196319641965...2004200520062007200820092010201120122013
0ArubaABWFertility rate, total (births per woman)SP.DYN.TFRT.IN4.8204.6554.4714.2714.0593.842...1.7861.7691.7541.7391.7261.7131.7011.690NaNNaN
1AndorraANDFertility rate, total (births per woman)SP.DYN.TFRT.INNaNNaNNaNNaNNaNNaN...NaNNaN1.2401.1801.2501.1901.220NaNNaNNaN
2AfghanistanAFGFertility rate, total (births per woman)SP.DYN.TFRT.IN7.6717.6717.6717.6717.6717.671...7.1366.9306.7026.4566.1965.9285.6595.395NaNNaN
3AngolaAGOFertility rate, total (births per woman)SP.DYN.TFRT.IN7.3167.3547.3857.4107.4257.430...6.7046.6576.5986.5236.4346.3316.2186.099NaNNaN
4AlbaniaALBFertility rate, total (births per woman)SP.DYN.TFRT.IN6.1866.0765.9565.8335.7115.594...2.0041.9191.8491.7961.7611.7441.7411.748NaNNaN
\n", "

5 rows × 58 columns

\n", "
" ], "text/plain": [ " Country Name Country Code Indicator Name \\\n", "0 Aruba ABW Fertility rate, total (births per woman) \n", "1 Andorra AND Fertility rate, total (births per woman) \n", "2 Afghanistan AFG Fertility rate, total (births per woman) \n", "3 Angola AGO Fertility rate, total (births per woman) \n", "4 Albania ALB Fertility rate, total (births per woman) \n", "\n", " Indicator Code 1960 1961 1962 1963 1964 1965 ... 2004 \\\n", "0 SP.DYN.TFRT.IN 4.820 4.655 4.471 4.271 4.059 3.842 ... 1.786 \n", "1 SP.DYN.TFRT.IN NaN NaN NaN NaN NaN NaN ... NaN \n", "2 SP.DYN.TFRT.IN 7.671 7.671 7.671 7.671 7.671 7.671 ... 7.136 \n", "3 SP.DYN.TFRT.IN 7.316 7.354 7.385 7.410 7.425 7.430 ... 6.704 \n", "4 SP.DYN.TFRT.IN 6.186 6.076 5.956 5.833 5.711 5.594 ... 2.004 \n", "\n", " 2005 2006 2007 2008 2009 2010 2011 2012 2013 \n", "0 1.769 1.754 1.739 1.726 1.713 1.701 1.690 NaN NaN \n", "1 NaN 1.240 1.180 1.250 1.190 1.220 NaN NaN NaN \n", "2 6.930 6.702 6.456 6.196 5.928 5.659 5.395 NaN NaN \n", "3 6.657 6.598 6.523 6.434 6.331 6.218 6.099 NaN NaN \n", "4 1.919 1.849 1.796 1.761 1.744 1.741 1.748 NaN NaN \n", "\n", "[5 rows x 58 columns]" ] }, "execution_count": 2, "metadata": {}, "output_type": "execute_result" } ], "source": [ "data = sm.datasets.fertility.load_pandas().data\n", "data.head()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we construct a DataFrame that contains only the numerical fertility rate data and set the index to the country names. We also drop all the countries with any missing data." ] }, { "cell_type": "code", "execution_count": 3, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
1960196119621963196419651966196719681969...2002200320042005200620072008200920102011
Country Name
Aruba4.8204.6554.4714.2714.0593.8423.6253.4173.2263.054...1.8251.8051.7861.7691.7541.7391.7261.7131.7011.690
Afghanistan7.6717.6717.6717.6717.6717.6717.6717.6717.6717.671...7.4847.3217.1366.9306.7026.4566.1965.9285.6595.395
Angola7.3167.3547.3857.4107.4257.4307.4227.4037.3757.339...6.7786.7436.7046.6576.5986.5236.4346.3316.2186.099
Albania6.1866.0765.9565.8335.7115.5945.4835.3765.2685.160...2.1952.0972.0041.9191.8491.7961.7611.7441.7411.748
United Arab Emirates6.9286.9106.8936.8776.8616.8416.8166.7836.7386.679...2.4282.3292.2362.1492.0712.0041.9481.9031.8681.841
\n", "

5 rows × 52 columns

\n", "