Contents

Classification with Higher Criticism threshold

The high dimensional classification problem where there are 2 classes and the signal is sparse/weak can be solved by linear classification function with Higher Criticism Threshold, as shown here using HCclassification function.

%%An example with Lung Cancer data
% Load the data and transform it into p-by-n matrix
load('lungCancer.mat');
Data = lungCancertrain(:, 1:12533);
Data = Data';
Class = lungCancertrain(:, 12534);

Run the data with HCclassification function to get the classifier.

[wts, stats] = HCclassification(Data, Class, 'clip', 0.2);

Number of useful features

sum(wts ~= 0)
ans =

   475

It can be found that there are only 475 features useful, compared to 12533 features in total

Apply the classifier to new data

With the classifier ('weight' and 'stats'), new observations can be classified with the HCclassification_fit function

% Load test data for lung cancer data
Test = lungCancer_test(1:149, 1:12533);
Test = Test';
TrueLabel = lungCancer_test(1:149, 12534);

Run the data with HCclassification_fit function. Find the corresponding estimated labels and error rate

[label, score] = HCclassification_fit(wts, stats.xbar, stats.s, Test);
HCerr = mean(label ~= TrueLabel)
HCerr =

    0.0067

Draw the plot of score, with 0 as the threshold (blue dot line) to cluster the two groups. Use the red dots and the blue crossing to differentiate the two groups under truth.

g1 = find(TrueLabel == 1); g2 = find(TrueLabel == 0);
plot(g1, score(g1), 'ro', g2, score(g2), 'b+', 1:149, 0*(1:149), 'b--')
title('Classification Score for Test Data')

In the figure it can be found that classification with HCT works well