statsmodels.stats.correlation_tools.corr_nearest_factor¶
- statsmodels.stats.correlation_tools.corr_nearest_factor(corr, rank, ctol=1e-06, lam_min=1e-30, lam_max=1e+30, maxiter=1000)[source]¶
Find the nearest correlation matrix with factor structure to a given square matrix.
- Parameters:
- corr
square
array
The target matrix (to which the nearest correlation matrix is sought). Must be square, but need not be positive semidefinite.
- rank
int
The rank of the factor structure of the solution, i.e., the number of linearly independent columns of X.
- ctol
positive
real
Convergence criterion.
- lam_min
float
Tuning parameter for spectral projected gradient optimization (smallest allowed step in the search direction).
- lam_max
float
Tuning parameter for spectral projected gradient optimization (largest allowed step in the search direction).
- maxiter
int
Maximum number of iterations in spectral projected gradient optimization.
- corr
- Returns:
- rslt
Bunch
rslt.corr is a FactoredPSDMatrix defining the estimated correlation structure. Other fields of rslt contain returned values from spg_optim.
- rslt
Notes
A correlation matrix has factor structure if it can be written in the form I + XX’ - diag(XX’), where X is n x k with linearly independent columns, and with each row having sum of squares at most equal to 1. The approximation is made in terms of the Frobenius norm.
This routine is useful when one has an approximate correlation matrix that is not positive semidefinite, and there is need to estimate the inverse, square root, or inverse square root of the population correlation matrix. The factor structure allows these tasks to be done without constructing any n x n matrices.
This is a non-convex problem with no known guaranteed globally convergent algorithm for computing the solution. Borsdof, Higham and Raydan (2010) compared several methods for this problem and found the spectral projected gradient (SPG) method (used here) to perform best.
The input matrix corr can be a dense numpy array or any scipy sparse matrix. The latter is useful if the input matrix is obtained by thresholding a very large sample correlation matrix. If corr is sparse, the calculations are optimized to save memory, so no working matrix with more than 10^6 elements is constructed.
References
Examples
Hard thresholding a correlation matrix may result in a matrix that is not positive semidefinite. We can approximate a hard thresholded correlation matrix with a PSD matrix as follows, where corr is the input correlation matrix.
>>> import numpy as np >>> from statsmodels.stats.correlation_tools import corr_nearest_factor >>> np.random.seed(1234) >>> b = 1.5 - np.random.rand(10, 1) >>> x = np.random.randn(100,1).dot(b.T) + np.random.randn(100,10) >>> corr = np.corrcoef(x.T) >>> corr = corr * (np.abs(corr) >= 0.3) >>> rslt = corr_nearest_factor(corr, 3)