Jiashun's Page of Software and Data Sets
This page contains some of the software packages I have developed
and co-developed over the years, as well as some data sets to which these software packages have been implemented.
Software
Most of the software is in the areas of high dimensional data analysis,
ranging from large-scale multiple testing, cancer classification, spectral clustering and Principal
Component Analysis, variable selection, and network data analysis. The software has been
successfully applied to real data analysis in genomics, network, and cosmology and astronomy.
The software can be roughly grouped into four main groups.
-
In the first group, we have most recently software on network community detection, namely, the method I call the SCORE.
-
In the second group, we have software developed around the method of "Graphlet Screening". This is a new method
for high dimensional variable selection, including three variants: Univariate Penalized Screening (UPS), Graphlet Screening,
and Covariance Assisted Screening and Estimation (CASE).
- In the third group, we have software developed around the
statistic "Higher Criticism". Higher Criticism is a notion that goes back to
John Tukey in 1976. In the past decade, my collaborators and I have come to be aware
of the so-called ubiquitous phenomena of "Rare and Weak" signals in high dimensional
analysis that can be found in many types of Big Data, including genomics as an iconic example.
In this "Rare and Weak" regime, classical methods and many contemporary empirical methods
are simply overwhelmed, and it is desirable to have principled statistical approaches
to address such a situation. Higher Criticism is specifically designed to deal with
Rare and Weak signals. This group of software can be further
divided into the following subgroups.
-
In the fourth group, we have software developed on the problems of "estimating the proportion of non-null effects" and "estimating
the null parameters". The work is closely related to Efron (JASA, 2004) on the choice of null in large-scale multiple testing.
Data Sets
Below are links to a handful of data sets in genomics and some data sets in social network, most of which have been investigated with our software