Jiashun's Research
Reports
I have sorted my published papers and submitted manuscripts in two ways:
by year and by areas.
Research description
My primary research area is in statistical inference
for Big Data, focusing on how to better the inference
by exploiting all kinds of sparsity
we see today (signal sparsity, graph sparsity,
sparsity in eigenvalues of matrices, etc.).
Vision
Many
of my work have been orbited around the vision that in many
application examples of Big Data we see today, the signals are Rare are Weak.
To me, Rare and Weak signals is not
a mathematical curiosity but is the unavoidable consequence of the
trend of "large p and small n" we frequently see with Big Data.
When you collect data with increasingly
more features (i.e., increasing large dimensions), the signals tend
to be increasingly more sparse as the number of true features would not
increase proportionally. At the same time, in many cases we can not
enroll sufficient subjects for experiments (such as study on a rare
disease), so the sample size would not grow proportionally with the
number of features, and the signals end up being weak.
In this "Rare and Weak" situation, classical methods and most contemporary
empirical methods are simply overwhelmed, and principled statistical approach
are badly in need.
Research Topics
In the past years, I have explored the following topics in high dimensional data analysis,
where in a significant fraction of the work, the theme in on "Rare and Weak" signals.
-
Large-Scale Multiple Hypotheses Testing.
-
Cancer Classification.
-
Variable Selection.
-
Spectral Clustering and Principal Component Analysis (PCA).
-
Random Matrix Theory (RMT).
-
Network Analysis.
-
Graph Theory and Precision Matrix Estimation.
Applications
My research are motivated by many interesting problems in various application areas.
Methods
I have developed and co-developed four groups of new methods appropriate for Rare and Weak signals.
-
Higher Criticism (with variants for signal detection, classification, and spectral clustering).
-
Graphlet Screening for variable selection (with two other variants: Univariate Penalized Screening (UPS) and Covariance Assisted
Screening and Estimation (CASE)).
-
Fourier-transformation based procedures for estimating the null parameters and proportion of non-null effects in large-scale multiple testing.
-
SCORE for network community detection.
-
A new method for estimating the precision matrix (in progress).
Theory
I have a strong interest in statistical theory, and I am especially
fond of the so-called "Phase Diagram" which is a novel way to justify
optimality. The phase diagram can be viewed as a new criterion for
optimality that is especially appropriate for Rare and Weak
signals in Big Data.
Just like there are three phases
for water (water vapor, water, and ice), there are three phases for
many given statistical problems (variable selection, classification,
multiple testing, spectral clustering). The phase diagram is a
two-dimensional parameter space, where the x-axis calibrates the
signal rarity, and the y-axis calibrates the signal strength.
For a particular statistics problem, say, variable selection,
the phase space usually partitions into three sub-regions (and so the
name of phase diagrams), Phase I-III.
- In Phase I, the problem under consideration
is relatively trivial since the signals are sufficiently strong.
-
In Phase II. the problem under consideration is nontrivial but it is still possible to
have a reliable solution as the signals are moderately strong.
-
In Phase III. it is impossible to have a reliable solution simply because the
signals are simply too rate and weak.
In the past years, I have worked out the phase diagrams for the following problems.
-
Detecting rare and weak signals.
-
Variable selection.
-
Classification.
-
Clustering.
-
Estimating the proportion of signals.
-
Low-rank matrix recovery.