Cosma Shalizi

Areas of Interest

This is a somewhat telegraphic list of my current projects, and ideas for projects. It's primarily intended for my students and potential students. Ones marked with initials have collaborators already, though if you find the idea interesting, don't hesitate to bring it up.

NOTE: I am not taking on any new students until the 2018--2019 academic year at the earliest.

Using mis-specified models

Bayesian convergence under mis-specification: specific applications, predictive properties (see below under learning theory)
Model checking to identify, measure, and if possible fix mis-specification, as in my paper with Gelman
- Efficient implementation of double bootstrap, cross-validation, or other bias corrections to posterior predictive tests
- Tests based on de-Baysed prediction intervals (a la Larry, following the "conformal predictors" crowd)
- Tests that tell us how to change the model, not just that something is wrong (as with CSSR, or causal discovery algorithms)
Ensemble methods (see below under growing ensemble)
Bootstrap mis-specification tests for parametric regression based on non-parametric smoothers: computational implementation, power (see relevant chapter in ADAfaEPoV)

Networks: structure

Cross-validation: theoretical properties of the "latin CV" of Dabbs and Junker
Bootstrapping networks
- adapting the "pigeonhole" bootstrap of Owen and Eckles
- The "empirical graphon" (AG)
- The snowball bootstrap of Eldardiry and Neville: This seems to over-sample high-degree nodes, but are there functionals it is good for? Are there corrections?
Smoothing adjacency matrices and graph-sequence limits (BK, LW)
Relation of network sufficient statistics to projectibility and prediction (AR)
- Are the only projectible ERGMs dyadic-independence models?
- Asymptotic distribution of MLE in projectible exponential families --- can the Gartner-Ellis result be extended to get a Gaussian distribution?
- Algebraic characterization of projectibility in ERGMs
- Ditto without exponential-family assumption (perhaps assuming completeness of sufficient statistics?)
Model selection (CM)
- For block models specifically
- More generally for network models, especially when graphs are sparse
Statistical approaches to discovering communities (more generally, blocks)
Detecting network change (CG, AT, DA, LW)
- Significance of fluctuations in network summary statistics
- Testing for differences in higher-level network structure
Effects of aggregating nodes on network inference (SM)

Networks: dynamics

Causal inference of contagion/influence on networks
- Use of community discovery (HW, MK, EM)
- What is identified by random-neighbor assignment?
- Bounds/partial identification (AT)
- What is the "largest" parameter identified in the usual case?
- Experimental design: when is it better to experiment on nodes vs. edges? (AR, VK)
- Analogues to "genomic control" to measure typical size of pure-homophily effects?
- Adaptation of "cryptic relatedness" measures from genetics?
Use of social networks as sensor networks (DA)
Distinguishing cultural from biological transmission (DA)
Distributed learning and problem-solving (HF)
Effects of network structure on institutional change (HF)

Unidentified models

Mathematical construction of maximal identified parameter
Application to social influence
Application to macroeconomic models
Connection to "partial identification" in Manski's sense?

Learning theory for stochastic processes

Consistency and convergence of Bayesian nonparametrics for stochastic processes
- More comprehensible ("primitive") sufficient conditions for convergence
- Consistency/convergence of PDFAs with Ptiman-Yor priors
- Consistency/convergence for infinite dynamic Bayesian networks
- Proof of convergence in risk under Kullback-Leibler loss
- Extension to more complicated index sets
- Large deviations for location of posterior in space of distributions
- Gaussian process approximations to posterior distribution (from expanding existing LDP?)
Measuring dependence and effective sample size (DM, MS)
- Estimating measures of weak dependence other than beta-mixing, along the lines of our estimation of beta-mixing coefficients
- Purely finite-dimensional bounds on generalization error (as opposed to the current bounds, which invoke functionals of the infinite-dimensional distribution)
Model complexity for time series (DM, MS)
- Rademacher complexity of time-series models
  - Estimation of empirical Rademacher complexity
  - Other possible noise-correlation notions of complexity
- Implicit constraints from stationarity
- Complexity of general state-space models
- Complexity of specific restricted forms of state-space model
Bootstrap-type bounds on forecasting error (DM, RL)
Validity of cross-validation for mixing processes, e.g., based on Kontorovich-Ramanan concentration of measure results
Learning theory on mixtures of processes
- Construction of Rademacher bounds for predictive risk
- Other bounds on predictive risk
Learning theory for infinite-memory prediction
- Characterization of uniform asymptotic-equipartion-style convergence (perhaps VC dimension of the functions X^* -> P(X) ?)
- Explicit risk bounds for same

Predictive-state reconstruction

Relevant papers: arxiv:cond-mat/9907176, arxiv:math.PR/0305160, arxiv:cs.LG/0406011, arxiv:nlin.AO/0409024, arxiv:nlin.CG/0508001, arxiv:1001.0036

Improved algorithms for time series (KLK, SS)
Classification of time series (KLK)
Bootstrap theory for uncertainty estimates (GDM)
BIC for tuning control settings
Ensemblification by randomizing over hypothesis tests
Automatic identification of order parameters: Given complexity field, what function of the immediate state best matches it?

Exponential families of stochastic processes

How far can they be justified as asymptotic approximations from large deviations principles?
Projectibility and the algebraic form of sufficient statistics (AR; see under network structure)

Regression

Lebesgue smoothing (GMG, LW)
Use of fused lasso to decide how much partial pooling to do in hierarchical models
Using non-parametric smoothers to test parametric specifications (see above)
Distribution of typical regression coefficients for random low-dimensional projections of sparse high-dimensional systems (i.e., what is the right null model for linear regression?)

Thomson's "sampling" model of psychological abilities

Asymptotic probabilistic analysis
What aspects are identifiable in pre-asymptotic regime? Discrimination from factor models
Adaptation to discrete-choice models, e.g. ideal point models, NOMINATE (JG)

Large deviations

LDP for stochastic automata
LDP for adaptive-population processes
Exponential-family connections (see under "exponential families" above)

Individual sequence learning

Growing ensembles
- Regret and risk under stationary sources (MS)
- Regret bounds in terms of variation of losses (MS)
- Tuning of epoch length, fixed share, weight of new model (MS)
- Practical applications (AZJ, AC)
When does low regret imply a generalization error bound?
Working with infinite spaces of models
Working with limited feedback

Networks of information flow in neural (and other) systems

Relevant papers: arxiv:q-bio.NC/0506009, arxiv:q-bio.NC/0609008

Remapping in fMRI (CG, EM)
E-mail networks

Simulation-based inference

Especially indirect inference

Consistency conditions for indirect inference (LZ, MS)
Indirect inference with non-parametric auxiliary models (SH)
Indirect inference for network models (MF)
Indirect inference for agent-based models
Tractability of indirect inference with chaotic dynamics
"Approximate Bayesian computation" with non-parametric summaries; what advantages, if any, does ABC offer over indirect inference?

Power-law distributions

Relevant paper: arxiv:0706.1062

Consistency (and rate?) of Clauset's estimator of the tail cut-off
Semi-parametric estimation, with non-parametric density estimation below threshold and power-law tail; properties
Practical non-parametric density estimation with heavy tails (extending Markovitch and Krieger)
Test of Yule-Simon model for citations (AC)
Exact distributions for fluctuating feedback (NW)

Sufficiency and process characterization

Density estimation on graphical models

Neutral model of inquiry

Flesh out calculations about life-span distribution of findings
Compare to data from least-favorite field
Modifications to basic model, e.g., initial finding inhibits replication but excites testing of related hypotheses