statsmodels.stats.correlation_tools.corr_nearest_factor

statsmodels.stats.correlation_tools.corr_nearest_factor(corr, rank, ctol=1e-06, lam_min=1e-30, lam_max=1e+30, maxiter=1000)[source]

Find the nearest correlation matrix with factor structure to a given square matrix.

Parameters:
corrsquare array

The target matrix (to which the nearest correlation matrix is sought). Must be square, but need not be positive semidefinite.

rankint

The rank of the factor structure of the solution, i.e., the number of linearly independent columns of X.

ctolpositive real

Convergence criterion.

lam_minfloat

Tuning parameter for spectral projected gradient optimization (smallest allowed step in the search direction).

lam_maxfloat

Tuning parameter for spectral projected gradient optimization (largest allowed step in the search direction).

maxiterint

Maximum number of iterations in spectral projected gradient optimization.

Returns:
rsltBunch

rslt.corr is a FactoredPSDMatrix defining the estimated correlation structure. Other fields of rslt contain returned values from spg_optim.

Notes

A correlation matrix has factor structure if it can be written in the form I + XX’ - diag(XX’), where X is n x k with linearly independent columns, and with each row having sum of squares at most equal to 1. The approximation is made in terms of the Frobenius norm.

This routine is useful when one has an approximate correlation matrix that is not positive semidefinite, and there is need to estimate the inverse, square root, or inverse square root of the population correlation matrix. The factor structure allows these tasks to be done without constructing any n x n matrices.

This is a non-convex problem with no known guaranteed globally convergent algorithm for computing the solution. Borsdof, Higham and Raydan (2010) compared several methods for this problem and found the spectral projected gradient (SPG) method (used here) to perform best.

The input matrix corr can be a dense numpy array or any scipy sparse matrix. The latter is useful if the input matrix is obtained by thresholding a very large sample correlation matrix. If corr is sparse, the calculations are optimized to save memory, so no working matrix with more than 10^6 elements is constructed.

References

Examples

Hard thresholding a correlation matrix may result in a matrix that is not positive semidefinite. We can approximate a hard thresholded correlation matrix with a PSD matrix as follows, where corr is the input correlation matrix.

>>> import numpy as np
>>> from statsmodels.stats.correlation_tools import corr_nearest_factor
>>> np.random.seed(1234)
>>> b = 1.5 - np.random.rand(10, 1)
>>> x = np.random.randn(100,1).dot(b.T) + np.random.randn(100,10)
>>> corr = np.corrcoef(x.T)
>>> corr = corr * (np.abs(corr) >= 0.3)
>>> rslt = corr_nearest_factor(corr, 3)