Jiashun's Page of Software and Data Sets

This page contains some of the software packages I have developed and co-developed over the years, as well as some data sets to which these software packages have been implemented.

Click here to go back to Jiashun's main page.

Software

Most of the software is in the areas of high dimensional data analysis, ranging from large-scale multiple testing, cancer classification, spectral clustering and Principal Component Analysis, variable selection, and network data analysis. The software has been successfully applied to real data analysis in genomics, network, and cosmology and astronomy.

The software can be roughly grouped into four main groups.

In the first group, we have most recently software on network community detection, namely, the method I call the SCORE.
- Click here for matlab code of SCORE.
In the second group, we have software developed around the method of "Graphlet Screening". This is a new method for high dimensional variable selection, including three variants: Univariate Penalized Screening (UPS), Graphlet Screening, and Covariance Assisted Screening and Estimation (CASE).
In the third group, we have software developed around the statistic "Higher Criticism". Higher Criticism is a notion that goes back to John Tukey in 1976. In the past decade, my collaborators and I have come to be aware of the so-called ubiquitous phenomena of "Rare and Weak" signals in high dimensional analysis that can be found in many types of Big Data, including genomics as an iconic example. In this "Rare and Weak" regime, classical methods and many contemporary empirical methods are simply overwhelmed, and it is desirable to have principled statistical approaches to address such a situation. Higher Criticism is specifically designed to deal with Rare and Weak signals. This group of software can be further divided into the following subgroups.
In the fourth group, we have software developed on the problems of "estimating the proportion of non-null effects" and "estimating the null parameters". The work is closely related to Efron (JASA, 2004) on the choice of null in large-scale multiple testing.
- Click here for R code on estimating the proportion and the null parameters

Data Sets

Below are links to a handful of data sets in genomics and some data sets in social network, most of which have been investigated with our software