Important Features PCA for High Dimensional Clustering

The high dimensional clustering problem where the number of clusters are known and the signal is sparse can be solved by IF-PCA method, as shown here using ifpca function.

Contents

An example with Lung Cancer data

Load data and transfrom the data into the form we need

load('lungCancer.mat');
Data = [lungCancer_test(1:149, 1:12533); lungCancertrain(:, 1:12533)];
Data = Data';
Class = [lungCancer_test(1:149, 12534); lungCancertrain(:, 12534)];
[p, n] = size(Data);

Run the data with IF-PCA method. Find the corresponding estimated labels and error rate

[IFlabel, stats, L] = ifpca(Data, 2);
t = crosstab(IFlabel, Class);
IFerr = min(sum(diag(t))/n, 1 - sum(diag(t))/n)
IFerr =

    0.0331

Run the data with classical PCA method without feature selection, and record the corresponding estimated labels and error rate

gm = mean(Data'); gsd = std(Data');
Data = (Data - repmat(gm', 1, n))./repmat(gsd', 1, n);
G = Data'*Data;
[Cv, ~] = eigs(G, 1);
Clabel = kmeans(Cv, 2, 'replicates', 30);
t = crosstab(Clabel, Class);
Cerr = min(sum(diag(t)), n - sum(diag(t)))/n
Cerr =

    0.1215

Comparison of the leading eigenvectors with IF-PCA and classical PCA

Find the leading eigenvector for post-selection data matrix

data_select = Data(stats.ranking(1:L), :);
G = data_select'*data_select;
[IFv, ~] = eigs(G, 1);

Draw the plot, with 0 as the threshold (blue dot line) to cluster the two groups. Use the red dots and the blue crossing to differentiate the two groups under truth.

g1 = find(Class == 1); g2 = find(Class == 0);
subplot(121)
plot(g1, Cv(g1), 'ro', g2, Cv(g2), 'b+', 1:n, 0*(1:n), 'b--')
title('Leading Eigenvector with Classical PCA')
subplot(122)
plot(g1, IFv(g1), 'ro', g2, IFv(g2), 'b+', 1:n, 0*(1:n), 'b--')
title('Leading Eigenvector with IF-PCA')

In the figure it can be found that IF-PCA works much better