Contents
Classification with Higher Criticism threshold
The high dimensional classification problem where there are 2 classes and the signal is sparse/weak can be solved by linear classification function with Higher Criticism Threshold, as shown here using HCclassification function.
%%An example with Lung Cancer data % Load the data and transform it into p-by-n matrix load('lungCancer.mat'); Data = lungCancertrain(:, 1:12533); Data = Data'; Class = lungCancertrain(:, 12534);
Run the data with HCclassification function to get the classifier.
[wts, stats] = HCclassification(Data, Class, 'clip', 0.2);
Number of useful features
sum(wts ~= 0)
ans = 475
It can be found that there are only 475 features useful, compared to 12533 features in total
Apply the classifier to new data
With the classifier ('weight' and 'stats'), new observations can be classified with the HCclassification_fit function
% Load test data for lung cancer data
Test = lungCancer_test(1:149, 1:12533);
Test = Test';
TrueLabel = lungCancer_test(1:149, 12534);
Run the data with HCclassification_fit function. Find the corresponding estimated labels and error rate
[label, score] = HCclassification_fit(wts, stats.xbar, stats.s, Test); HCerr = mean(label ~= TrueLabel)
HCerr = 0.0067
Draw the plot of score, with 0 as the threshold (blue dot line) to cluster the two groups. Use the red dots and the blue crossing to differentiate the two groups under truth.
g1 = find(TrueLabel == 1); g2 = find(TrueLabel == 0); plot(g1, score(g1), 'ro', g2, score(g2), 'b+', 1:149, 0*(1:149), 'b--') title('Classification Score for Test Data')
In the figure it can be found that classification with HCT works well