statsmodels.stats.descriptivestats.describe

statsmodels.stats.descriptivestats.describe(data: Union[numpy.ndarray, pandas.core.series.Series, pandas.core.frame.DataFrame], stats: Sequence[str] = None, *, numeric: bool = True, categorical: bool = True, alpha: float = 0.05, use_t: bool = False, percentiles: Sequence[Union[int, float]] = 1, 5, 10, 25, 50, 75, 90, 95, 99, ntop: bool = 5) → pandas.core.frame.DataFrame[source]

Extended descriptive statistics for data

Parameters
dataarray_like

Data to describe. Must be convertible to a pandas DataFrame.

statsSequence[str], optional

Statistics to include. If not provided the full set of statistics is computed. This list may evolve across versions to reflect best practices. Supported options are: “nobs”, “missing”, “mean”, “std_err”, “ci”, “ci”, “std”, “iqr”, “iqr_normal”, “mad”, “mad_normal”, “coef_var”, “range”, “max”, “min”, “skew”, “kurtosis”, “jarque_bera”, “mode”, “freq”, “median”, “percentiles”, “distinct”, “top”, and “freq”. See Notes for details.

numericbool, default True

Whether to include numeric columns in the descriptive statistics.

categoricalbool, default True

Whether to include categorical columns in the descriptive statistics.

alphafloat, default 0.05

A number between 0 and 1 representing the size used to compute the confidence interval, which has coverage 1 - alpha.

use_tbool, default False

Use the Student’s t distribution to construct confidence intervals.

percentilessequence[float]

A distinct sequence of floating point values all between 0 and 100. The default percentiles are 1, 5, 10, 25, 50, 75, 90, 95, 99.

ntopint, default 5

The number of top categorical labels to report. Default is

Returns
DataFrame

Descriptive statistics

See also

pandas.DataFrame.describe

Basic descriptive statistics

Description

Descriptive statistics class with additional output options

Notes

The selectable statistics include:

  • “nobs” - Number of observations

  • “missing” - Number of missing observations

  • “mean” - Mean

  • “std_err” - Standard Error of the mean assuming no correlation

  • “ci” - Confidence interval with coverage (1 - alpha) using the normal or t. This option creates two entries in any tables: lower_ci and upper_ci.

  • “std” - Standard Deviation

  • “iqr” - Interquartile range

  • “iqr_normal” - Interquartile range relative to a Normal

  • “mad” - Mean absolute deviation

  • “mad_normal” - Mean absolute deviation relative to a Normal

  • “coef_var” - Coefficient of variation

  • “range” - Range between the maximum and the minimum

  • “max” - The maximum

  • “min” - The minimum

  • “skew” - The skewness defined as the standardized 3rd central moment

  • “kurtosis” - The kurtosis defined as the standardized 4th central moment

  • “jarque_bera” - The Jarque-Bera test statistic for normality based on the skewness and kurtosis. This option creates two entries, jarque_bera and jarque_beta_pval.

  • “mode” - The mode of the data. This option creates two entries in all tables, mode and mode_freq which is the empirical frequency of the modal value.

  • “median” - The median of the data.

  • “percentiles” - The percentiles. Values included depend on the input value of percentiles.

  • “distinct” - The number of distinct categories in a categorical.

  • “top” - The mode common categories. Labeled top_n for n in 1, 2, …, ntop.

  • “freq” - The frequency of the common categories. Labeled freq_n for n in 1, 2, …, ntop.