Cosma Shalizi
Areas of Interest
This is a somewhat telegraphic list of my current projects, and ideas for
projects.  It's primarily intended for my students and potential students.
Ones marked with initials have collaborators already, though if you find the
idea interesting, don't hesitate to bring it up.
NOTE: I am not taking on any new students until the 2018--2019 academic year at the earliest.
Using mis-specified models
-  Bayesian convergence under mis-specification: specific applications, predictive properties (see below under learning theory)
-  Model checking to identify, measure, and if possible fix mis-specification, as in my paper with Gelman
	
	-  Efficient implementation of double bootstrap, cross-validation, or other bias corrections to posterior predictive tests
	
-  Tests based on de-Baysed prediction intervals (a la Larry, following the "conformal predictors" crowd)
	
-  Tests that tell us how to change the model, not just that something is wrong (as with CSSR, or causal discovery algorithms)
	
 
-  Ensemble methods (see below under growing ensemble)
-  Bootstrap mis-specification tests for parametric regression based on non-parametric smoothers: computational implementation, power (see relevant chapter
in ADAfaEPoV)
Networks: structure
-  Cross-validation: theoretical properties of the "latin CV" of Dabbs and Junker
-  Bootstrapping networks
	
-  Smoothing adjacency matrices and graph-sequence limits (BK, LW)
-  Relation of network sufficient statistics to projectibility and prediction (AR)
	
	-  Are the only projectible ERGMs dyadic-independence models?
	
-  Asymptotic distribution of MLE in projectible exponential
families --- can the Gartner-Ellis result be extended to get a Gaussian distribution?
	
-  Algebraic characterization of projectibility in ERGMs
	
-  Ditto without exponential-family assumption (perhaps assuming completeness of sufficient statistics?)
	
 
-  Model selection (CM)
	
	-  For block models specifically
	
-  More generally for network models, especially when graphs are sparse
	
 
-  Statistical approaches to discovering communities (more generally, blocks)
-  Detecting network change (CG, AT, DA, LW)
	
	-  Significance of fluctuations in network summary statistics
	
-  Testing for differences in higher-level network structure
	
 
-  Effects of aggregating nodes on network inference (SM)
Networks: dynamics
-  Causal inference of contagion/influence on networks
	
	-  Use of community discovery (HW, MK, EM)
	
-  What is identified by random-neighbor assignment?
	
-  Bounds/partial identification (AT)
	
-  What is the "largest" parameter identified in the usual case?
	
-  Experimental design: when is it better to experiment on nodes vs. edges? (AR, VK)
	
-  Analogues to "genomic control" to measure typical size of pure-homophily effects?
	
-  Adaptation of "cryptic relatedness" measures from genetics?
	
 
-  Use of social networks as sensor networks (DA)
-  Distinguishing cultural from biological transmission (DA)
-  Distributed learning and problem-solving (HF)
-  Effects of network structure on institutional change (HF)
Unidentified models
Learning theory for stochastic processes
-  Consistency and convergence of Bayesian nonparametrics for stochastic processes
	
	-  More comprehensible ("primitive") sufficient conditions for convergence
	
-  Consistency/convergence of PDFAs with Ptiman-Yor priors
	
-  Consistency/convergence for infinite dynamic Bayesian networks
	
-  Proof of convergence in risk under Kullback-Leibler loss
	
-  Extension to more complicated index sets
	
-  Large deviations for location of posterior in space of distributions
	
-  Gaussian process approximations to posterior distribution (from expanding existing LDP?)
	
 
-  Measuring dependence and effective sample size (DM, MS)
	
-  Model complexity for time series (DM, MS)
	
-  Bootstrap-type bounds on forecasting error (DM, RL)
-  Validity of cross-validation for mixing processes, e.g., based on Kontorovich-Ramanan concentration of measure results
-  Learning theory on mixtures of processes
	
	-  Construction of Rademacher bounds for predictive risk
	
-  Other bounds on predictive risk
	
 
-  Learning theory for infinite-memory prediction
	
	-  Characterization of uniform asymptotic-equipartion-style convergence (perhaps VC dimension of the functions X* -> P(X) ?)
	
-  Explicit risk bounds for same
	
 
Predictive-state reconstruction
Relevant papers: arxiv:cond-mat/9907176, arxiv:math.PR/0305160, arxiv:cs.LG/0406011, arxiv:nlin.AO/0409024, arxiv:nlin.CG/0508001, arxiv:1001.0036
-  Improved algorithms for time series (KLK, SS)
-  Classification of time series (KLK)
-  Bootstrap theory for uncertainty estimates (GDM)
-  BIC for tuning control settings
-  Ensemblification by randomizing over hypothesis tests
-  Automatic identification of order parameters: Given complexity field, what function of the immediate state best matches it?
Exponential families of stochastic processes
-  How far can they be justified as asymptotic approximations from large deviations principles?
-  Projectibility and the algebraic form of sufficient statistics (AR; see under network structure)
Regression
-  Lebesgue smoothing (GMG, LW)
-  Use of fused lasso to decide how much partial pooling to do in hierarchical
models
-  Using non-parametric smoothers to test parametric specifications (see above)
-  Distribution of typical regression coefficients for random low-dimensional
projections of sparse high-dimensional systems (i.e., what is the right null
model for linear regression?)
-  Asymptotic probabilistic analysis
-  What aspects are identifiable in pre-asymptotic regime? Discrimination from factor models
-  Adaptation to discrete-choice models, e.g. ideal point models, NOMINATE (JG)
-  LDP for stochastic automata
-  LDP for adaptive-population processes
-  Exponential-family connections (see under "exponential families" above)
Individual sequence learning
-  Growing ensembles
	
	-  Regret and risk under stationary sources (MS)
	
-  Regret bounds in terms of variation of losses (MS)
	
-  Tuning of epoch length, fixed share, weight of new model (MS)
	
-  Practical applications (AZJ, AC)
	
 
-  When does low regret imply a generalization error bound?
-  Working with infinite spaces of models
-  Working with limited feedback
Networks of information flow in neural (and other) systems
Relevant papers: arxiv:q-bio.NC/0506009, arxiv:q-bio.NC/0609008
-  Remapping in fMRI (CG, EM)
-  E-mail networks
Simulation-based inference
Especially indirect inference
-  Consistency conditions for indirect inference (LZ, MS)
-  Indirect inference with non-parametric auxiliary models (SH)
-  Indirect inference for network models (MF)
-  Indirect inference for agent-based models
-  Tractability of indirect inference with chaotic dynamics
-  "Approximate Bayesian computation" with non-parametric summaries; what advantages, if any, does ABC offer over indirect inference?
Power-law distributions
Relevant paper: arxiv:0706.1062
-  Consistency (and rate?) of Clauset's estimator of the tail cut-off
-  Semi-parametric estimation, with non-parametric density estimation below threshold and power-law tail; properties
-  Practical non-parametric density estimation with heavy tails (extending Markovitch and Krieger)
-  Test of Yule-Simon model for citations (AC)
-  Exact distributions for fluctuating feedback (NW)
-  Flesh out calculations about life-span distribution of findings
-  Compare to data from least-favorite field
-  Modifications to basic model, e.g., initial finding inhibits replication but excites testing of related hypotheses