Department of Statistics Unitmark
Dietrich College of Humanities and Social Sciences

Nonparametric Methods

Non-parametric (or distribution-free) inferential statistical methods are mathematical procedures for statistical hypothesis testing which, unlike parametric statistics, make no assumptions about the probability distributions of the variables being assessed.

There are currently no projects for this area of research.

A Statistical Method for Estimating Luminosity Functions Using Truncated Data

The observational limitations of astronomical surveys lead to significant statistical inference challenges. One such challenge is the estimation of luminosity functions given redshift (z) and absolute magnitude (M) measurements from an irregularly truncated sample of objects. This is a bivariate density estimation problem; we develop here a statistically rigorous method which (1) does not assume a strict parametric form for the bivariate density; (2) does not assume independence between redshift and absolute magnitude (and hence allows evolution of the luminosity function with redshift); (3) does not require dividing the data into arbitrary bins; and (4) naturally incorporates a varying selection function. We accomplish this by decomposing the bivariate density φ(z,M) vialogφ(z,M)=f(z)+g(M)+h(z,M,θ), where f and g are estimated nonparametrically and h takes an assumed parametric form. There is a simple way of estimating the integrated mean squared error of the estimator; smoothing parameters are selected to minimize this quantity. Results are presented from the analysis of a sample of quasars.

ANOVA for diffusions

The paper examines the relationship among Ito processes from the angle of quadratic variation. The proposed methodology, "ANOVA for diffusions", allows drawing inference for a time interval, rather than for single time points. One of its applications is in fitness of modeling a stochastic process, it also helps quantify and characterize the trading (hedging) error in the case of financial applications.

The reason why the ANOVA permits conclusions over a time interval is that the asymptotic errors of the residual quadratic variation converge as a process (in time). A main conceptual finding was the clear cut effect of the two sources behind the asymptotics. The variation component (mixed Gaussian) comes only from the discretization error (in time discrete sampling). On the other hand, the bias depends only on the choice of estimator of the quadratic variation. This two-sources principle carries over to other criteria of goodness of fit, for example, the coefficient of determination.

Bayesian Empirical Likelihood

Empirical likelihood has been suggested as a data-based, nonparametric alternative to the usual likelihood function. Research has shown that empirical likelihood tests have many of the same asymptotic properties as those derived from parametric likelihoods. This leads naturally to the possibility of using empirical likelihood as the basis for Bayesian inference. Different ways in which this goal might be accomplished are considered. The validity of the resultant posterior inferences is examined, as are frequentist properties of the Bayesian empirical likelihood intervals.

Bayesian Time Series Modelling with Long-Range Dependence

We present a class of models for trend plus stationary component time series, in which the spectral densities of stationary components are represented via non-parametric smoothness priors combined with long-range dependence components. We discuss model fitting and computational issues underlying Bayesian inference under such models, and provide illustration in studies of a climatological time series. These models are of interest to address the questions of existence and extent of apparent long-range effects in time series arising in specific scientific applications.

CATS: Clustering After Transformation and Smoothing

CATS - Clustering After Transformation and Smoothing - is a technique for nonparametrically estimating and clustering a large number of curves. Our motivating example is a genetic microarray experiment but the method is very general. The method includes: transformation and smoothing multiple curves, multiple nonparametric testing for trends, clustering curves with similar shape, and nonparametrically inferring the misclustering rate.

Consistency of Posterior Distributions for Neural Networks

In this paper we show that the posterior distribution for feedforward neural networks is asymptotically consistent. This paper extends earlier results on universal approximation properties of neural networks to the Bayesian setting. The proof of consistency embeds the problem in a density estimation problem, then uses bounds on the bracketing entropy to show that the posterior is consistent over Hellinger neighborhoods. It then relates this result back to the regression setting. We show consistency in both the setting of the number of hidden nodes growing with the sample size, and in the case where the number of hidden nodes is treated as a parameter. Thus we provide a theoretical justification for using neural networks for nonparametric regression in a Bayesian framework.

Cosmic web reconstruction through density ridges: catalogue

We construct a catalogue for filaments using a novel approach called SCMS (subspace constrained mean shift). SCMS is a gradient-based method that detects filaments through density ridges (smooth curves tracing high-density regions). A great advantage of SCMS is its uncertainty measure, which allows an evaluation of the errors for the detected filaments. To detect filaments, we use data from the Sloan Digital Sky Survey, which consist of three galaxy samples: the NYU main galaxy sample (MGS), the LOWZ sample and the CMASS sample. Each of the three data set covers different redshift regions so that the combined sample allows detection of filaments up to z = 0.7. Our filament catalogue consists of a sequence of two-dimensional filament maps at different redshifts that provide several useful statistics on the evolution cosmic web. To construct the maps, we select spectroscopically confirmed galaxies within 0.050 < z < 0.700 and partition them into 130 bins. For each bin, we ignore the redshift, treating the galaxy observations as a 2-D data and detect filaments using SCMS. The filament catalogue consists of 130 individual 2-D filament maps, and each map comprises points on the detected filaments that describe the filamentary structures at a particular redshift. We also apply our filament catalogue to investigate galaxy luminosity and its relation with distance to filament. Using a volume-limited sample, we find strong evidence (6.1σ-12.3σ) that galaxies close to filaments are generally brighter than those at significant distance from filaments.

Cosmic web reconstruction through density ridges: method and algorithm

The detection and characterization of filamentary structures in the cosmic web allows cosmologists to constrain parameters that dictate the evolution of the Universe. While many filament estimators have been proposed, they generally lack estimates of uncertainty, reducing their inferential power. In this paper, we demonstrate how one may apply the subspace constrained mean shift (SCMS) algorithm (Ozertem & Erdogmus 2011; Genovese et al. 2014) to uncover filamentary structure in galaxy data. The SCMS algorithm is a gradient ascent method that models filaments as density ridges, one-dimensional smooth curves that trace high-density regions within the point cloud. We also demonstrate how augmenting the SCMS algorithm with bootstrap-based methods of uncertainty estimation allows one to place uncertainty bands around putative filaments. We apply the SCMS first to the data set generated from the Voronoi model. The density ridges show strong agreement with the filaments from Voronoi method. We then apply the SCMS method data sets sampled from a P3M N-body simulation, with galaxy number densities consistent with SDSS and WFIRST-AFTA, and to LOWZ and CMASS data from the Baryon Oscillation Spectroscopic Survey (BOSS). To further assess the efficacy of SCMS, we compare the relative locations of BOSS filaments with galaxy clusters in the redMaPPer catalogue, and find that redMaPPer clusters are significantly closer (with p-values <10-9) to SCMS-detected filaments than to randomly selected galaxies.

Inference for the dark energy equation of state using Type IA supernova data

The surprising discovery of an accelerating universe led cosmologists to posit the existence of "dark energy" - a mysterious energy field that permeates the universe. Understanding dark energy has become the central problem of modern cosmology. After describing the scientific background in depth, we formulate the task as a nonlinear inverse problem that expresses the comoving distance function in terms of the dark-energy equation of state. We present two classes of methods for making sharp statistical inferences about the equation of state from observations of Type Ia Supernovae (SNe). First, we derive a technique for testing hypotheses about the equation of state that requires no assumptions about its form and can distinguish among competing theories. Second, we present a framework for computing parametric and nonparametric estimators of the equation of state, with an associated assessment of uncertainty. Using our approach, we evaluate the strength of statistical evidence for various competing models of dark energy. Consistent with current studies, we find that with the available Type Ia SNe data, it is not possible to distinguish statistically among popular dark-energy models, and that, in particular, there is no support in the data for rejecting a cosmological constant. With much more supernova data likely to be available in coming years (e.g., from the DOE/NASA Joint Dark Energy Mission), we address the more interesting question of whether future data sets will have sufficient resolution to distinguish among competing theories.

Investigating galaxy-filament alignments in hydrodynamic simulations using density ridges

In this paper, we study the filamentary structures and the galaxy alignment along filaments at redshift z = 0.06 in the MassiveBlack-II simulation, a state-of-the-art, high-resolution hydrodynamical cosmological simulation which includes stellar and AGN feedback in a volume of (100 Mpc h-1)3. The filaments are constructed using the subspace constrained mean shift (SCMS; Ozertem & Erdogmus; Chen et al.). First, we show that reconstructed filaments using galaxies and reconstructed filaments using dark matter particles are similar to each other; over 50 per cent of the points on the galaxy filaments have a corresponding point on the dark matter filaments within distance 0.13 Mpc h-1 (and vice versa) and this distance is even smaller at high-density regions. Second, we observe the alignment of the major principal axis of a galaxy with respect to the orientation of its nearest filament and detect a 2.5 Mpc h-1 critical radius for filament's influence on the alignment when the subhalo mass of this galaxy is between 109 M h-1 and 1012 M h-1. Moreover, we find the alignment signal to increase significantly with the subhalo mass. Third, when a galaxy is close to filaments (less than 0.25 Mpc h-1), the galaxy alignment towards the nearest galaxy group is positively correlated with the galaxy subhalo mass. Finally, we find that galaxies close to filaments or groups tend to be rounder than those away from filaments or groups.

New image statistics for detecting disturbed galaxy morphologies at high redshift

Testing theories of hierarchical structure formation requires estimating the distribution of galaxy morphologies and its change with redshift. One aspect of this investigation involves identifying galaxies with disturbed morphologies (e.g. merging galaxies). This is often done by summarizing galaxy images using, e.g. the concentration, asymmetry and clumpiness and Gini-M20 statistics of Conselice and Lotz et al., respectively, and associating particular statistic values with disturbance. We introduce three statistics that enhance detection of disturbed morphologies at high redshift (z ˜ 2): the multimode (M), intensity (I) and deviation (D) statistics. We show their effectiveness by training a machine-learning classifier, random forest, using 1639 galaxies observed in the H band by the Hubble Space Telescope WFC3, galaxies that had been previously classified by eye by the Cosmic Assembly Near-IR Deep Extragalactic Legacy Survey collaboration. We find that the MID statistics (and the A statistic of Conselice) are the most useful for identifying disturbed morphologies.

We also explore whether human annotators are useful for identifying disturbed morphologies. We demonstrate that they show limited ability to detect disturbance at high redshift, and that increasing their number beyond ≈10 does not provably yield better classification performance. We propose a simulation-based model-fitting algorithm that mitigates these issues by bypassing annotation.

Non-parametric 3D map of the intergalactic medium using the Lyman-alpha forest

Visualizing the high-redshift Universe is difficult due to the dearth of available data; however, the Lyman-alpha forest provides a means to map the intergalactic medium at redshifts not accessible to large galaxy surveys. Large-scale structure surveys, such as the Baryon Oscillation Spectroscopic Survey (BOSS), have collected quasar (QSO) spectra that enable the reconstruction of H I density fluctuations. The data fall on a collection of lines defined by the lines of sight (LOS) of the QSO, and a major issue with producing a 3D reconstruction is determining how to model the regions between the LOS. We present a method that produces a 3D map of this relatively uncharted portion of the Universe by employing local polynomial smoothing, a non-parametric methodology. The performance of the method is analysed on simulated data that mimics the varying number of LOS expected in real data, and then is applied to a sample region selected from BOSS. Evaluation of the reconstruction is assessed by considering various features of the predicted 3D maps including visual comparison of slices, probability density functions (PDFs), counts of local minima and maxima, and standardized correlation functions. This 3D reconstruction allows for an initial investigation of the topology of this portion of the Universe using persistent homology.

NONPARAMETRIC CONFIDENCE SETS FOR DENSITIES

We present a method for constructing nonparametric confidence sets for density functions based on an approach due to Beran and Dümbgen (1998). We expand the density in an appropriate basis and we estimate the basis coefficients by using linear shrinkage methods. We then find the limiting distribution of an asymptotic pivot based on the quadratic loss function. Inverting this pivot yields a confidence ball for the density.

(Revised 10/04)

Nonparametric Density Estimation and Clustering in Astronomical Sky Surveys

We present a nonparametric method for galaxy clustering in astronomical sky surveys. We show that the cosmological definition of clusters of galaxies is equivalent to density contour clusters (Hartigan, 1975) \(S_c = \{ f > c \}\) where \(f\) is a probability density function. The plug-in estimator \(\hat S_c =\{ \hat f > c \}\) is used to estimate \(S_c\) where \(\hat f\) is the multivariate kernel density estimator. To choose the optimal smoothing parameter, we use cross-validation and the plug-in method and show that cross-validation method outperforms the plug-in method in our case. A new cluster catalogue based on the plug-in estimator is compared to existing cluster catalogs, the Abell and EDCCI. Our result is more consistent with the EDCCI than with the Abell, which is the most widely used catalogue. We present a practical algorithm for local smoothing and use the smoothed bootstrap to asses the validity of clustering results.

(Revised 10/04)

Nonparametric Inference for the Cosmic Microwave Background

The cosmic microwave background (CMB), which permeates the entire Universe, is the radiation left over from just 380,000 years after the Big Bang. On very large scales, the CMB radiation field is smooth and isotropic, but the existence of structure in the Universe - stars, galaxies, clusters of galaxies, -- suggests that the field should fluctuate on smaller scales. Recent observations, from the Cosmic Microwave Background Explorer to the Wilkinson Microwave Anisotropy Probe, have strikingly confirmed this prediction.

CMB fluctuations provide clues to the Universe's structure and composition shortly after the Big Bang that are critical for testing cosmological models. For example, CMB data can be used to determine what portion of the Universe is composed of ordinary matter versus the mysterious dark matter and dark energy. To this end, cosmologists usually summarize the fluctuations by the power spectrum, which gives the variance as a function of angular frequency. The spectrum's shape, and in particular the location and height of its peaks, relates directly to the parameters in the cosmological models. Thus, a critical statistical question is how accurately can these peaks be estimated.

We use recently developed techniques to construct a nonparametric confidence set for the unknown CMB spectrum. Our estimated spectrum, based on minimal assumptions, closely matches the model-based estimates used by cosmologists, but we can make a wide range of additional inferences. We apply these techniques to test various models and to extract confidence intervals on cosmological parameters of interest. Our analysis shows that, even without parametric assumptions, the first peak is resolved accurately with current data but that the second and third peaks are not.

Nonparametric Inference in Astrophysics

We discuss nonparametric density estimation and regression for astrophysics problems. In particular, we show how to compute nonparametric confidence intervals for the location and size of peaks of a function. We illustrate these ideas with recent data on the Cosmic Microwave Background. We also briefly discuss nonparametric Bayesian inference.

Rates of Convergence of Posterior Distributions

We compute the rate at which the posterior distribution concentrates around the true parameter value. The spaces we work in are quite general and include infinite dimensional cases. The rates are driven by two quantities: the size of the space, as measure by metric entropy or bracketing entropy, and the degree to which the prior concentrates in a small ball around the true parameter. We apply the results to several examples. In some cases, natural priors give sub-optimal rates of convergence and better rates can be obtained by using sieve-based priors.

(Revised 08/98)

Rodeo: Sparse Nonparametric Regression in High Dimensions

We present a method for simultaneously performing bandwidth selection and variable selection in nonparametric regression. The method starts with a local linear estimator with large bandwidths, and incrementally decreases the bandwidth in directions where the gradient of the estimator with respect to bandwidth is large. When the unknown function satisfies a sparsity condition, the approach avoids the curse of dimensionality. The method - called rodeo (regularization of derivative expectation operator) - conducts a sequence of hypothesis tests, and is easy to implement. A modified version that replaces testing with soft thresholding may be viewed as solving a sequence of lasso problems. When applied in one dimension, the rodeo yields a method for choosing the locally optimal bandwidth.

Sparse Nonparametric Graphical Models

We present some nonparametric methods for graphical modeling. In the discrete case, where the data are binary or drawn from a finite alphabet, Markov random fields are already essentially nonparametric, since the cliques can take only a finite number of values. Continuous data are different. The Gaussian graphical model is the standard parametric model for continuous data, but it makes distributional assumptions that are often unrealistic. We discuss two approaches to building more flexible graphical models. One allows arbitrary graphs and a nonparametric extension of the Gaussian; the other uses kernel density estimation and restricts the graphs to trees and forests. Examples of both methods are presented. We also discuss possible future research directions for nonparametric graphical modeling.

The Consistency of Posterior Distribtions in Nonparametric Problems

We give conditions that guarantee that the posterior probability of every Hellinger neighborhood of the true density tends to 1 almost surely. The conditions are (i) a smoothness condition on the prior and (ii) a requirement that the prior put positive mass in appropriate neighborhoods of the true density. The results are based on the idea of approximating the set of densities with a finite dimensional set of densities and then computing the Hellinger bracketing metric entropy of the approximating set. We apply the results to some examples.