statsmodels.tools.tools.categorical

statsmodels.tools.tools.categorical(data, col=None, dictnames=False, drop=False)[source]

Returns a dummy matrix given an array of categorical variables.

Parameters
dataarray

A structured array, recarray, array, Series or DataFrame. This can be either a 1d vector of the categorical variable or a 2d array with the column specifying the categorical variable specified by the col argument.

col{str, int, None}

If data is a DataFrame col must in a column of data. If data is a Series, col must be either the name of the Series or None. If data is a structured array or a recarray, col can be a string that is the name of the column that contains the variable. For all other arrays col can be an int that is the (zero-based) column index number. col can only be None for a 1d array. The default is None.

dictnamesbool, optional

If True, a dictionary mapping the column number to the categorical name is returned. Used to have information about plain arrays.

dropbool

Whether or not keep the categorical variable in the returned matrix.

Returns
dummy_matrix, [dictnames, optional]

A matrix of dummy (indicator/binary) float variables for the categorical data. If dictnames is True, then the dictionary is returned as well.

Notes

This returns a dummy variable for EVERY distinct variable. If a a structured or recarray is provided, the names for the new variable is the old variable name - underscore - category name. So if the a variable ‘vote’ had answers as ‘yes’ or ‘no’ then the returned array would have to new variables– ‘vote_yes’ and ‘vote_no’. There is currently no name checking.

Examples

>>> import numpy as np
>>> import statsmodels.api as sm

Univariate examples

>>> import string
>>> string_var = [string.ascii_lowercase[0:5],                       string.ascii_lowercase[5:10],                       string.ascii_lowercase[10:15],                       string.ascii_lowercase[15:20],                         string.ascii_lowercase[20:25]]
>>> string_var *= 5
>>> string_var = np.asarray(sorted(string_var))
>>> design = sm.tools.categorical(string_var, drop=True)

Or for a numerical categorical variable

>>> instr = np.floor(np.arange(10,60, step=2)/10)
>>> design = sm.tools.categorical(instr, drop=True)

With a structured array

>>> num = np.random.randn(25,2)
>>> struct_ar = np.zeros((25,1), dtype=[('var1', 'f4'),('var2', 'f4'),                      ('instrument','f4'),('str_instr','a5')])
>>> struct_ar['var1'] = num[:,0][:,None]
>>> struct_ar['var2'] = num[:,1][:,None]
>>> struct_ar['instrument'] = instr[:,None]
>>> struct_ar['str_instr'] = string_var[:,None]
>>> design = sm.tools.categorical(struct_ar, col='instrument', drop=True)

Or

>>> design2 = sm.tools.categorical(struct_ar, col='str_instr', drop=True)