Case Studies in Bayesian Statistics
Workshop 8 - 2005

September 17, 2005. 3:45 -- 5:20 pm Baker Hall Giant Eagle Auditorium
Carnegie Mellon University
Pittsburgh, PA

Inivted Session for Young Statisticians
Saturday, September 17
3:45 -- 4:05 pm S.T. Jensen
A Bayesian Hierarchical Model for Integrating Heterogeneous Biological Information
4:05 -- 4:25 pm Y. Liang, X. Cai and A.Kelemen
Bayesian State Space Models for Inferring Gene Interaction Networks
4:25 -- 4:45 pm J.R. Lockwood and L.T. Mariano
Bayesian Methods for Scalable Multi-Subject Value-Added Assessment
4:45 -- 5:05 pm J. Ranta, S. Lievonen and R.Maijala
Salmonella in egg production chain: a Bayesian farm to fork risk assessment
5:05 -- 5:25 pm R. Steele
Bayesian Prediction of Colorectal Cancer Treatment Efficacy: Difficulties of Classification with Small Samples
(NOTE: This schedule includes 5 minutes floor discussion for each talk.)

A Bayesian Hierarchical Model for Integrating Heterogeneous Biological Information

Shane T. Jensen

A substantial focus of research in molecular biology is the network of factors which control the involvement of different genes in living cells. Genes are regulated by proteins called transcription factors (TFs) that bind directly to DNA in close proximity to certain target genes and change the involvement of these genes in response to changing environmental conditions. Previous statistical approaches for identifying gene regulatory networks have used isolated information either from expression microarray data or Chromatin-Immunoprecipitation binding data. Although those approaches have been proven useful, their power is inherently limited by the fact that each data resource provides only partial information: expression data provides only indirect evidence of regulation (through mRNA levels), whereas ChIP binding data only provides physical binding information. Recent efforts on integrating these data types have involved very little systematic modeling, despite the fact that the Bayesian paradigm provides a natural framework for combining heterogeneous sources of information. We present a Bayesian hierarchical model and Gibbs sampling implementation that integrates gene expression data and ChIP binding data in a principled fashion. The gene expression data is modeled as a function of the unknown gene regulatory network which has an informed prior distribution based upon the ChIP binding data. Unlike previous methods, our Bayesian model does not require a priori clustering of the expression data and allows each gene to belong to multiple regulatory pathways. We applied our model to genome-wide ChIP binding data and 500 expression experiments in Yeast. Several validation analyses show that our predicted gene-TF relationships are more likely to be biologically relevant than using either source of information alone. In addition, our model incorporates the possibility of interactions between pairs of transcription factors, which allowed us to identify 84 TF pairs with synergistic effects on expression of target genes. Our Bayesian framework also allows for the inclusion of additional data sources, such as known or predicted transcription factor binding sites, which will be beneficial in upcoming applications to human and mouse studies where ChIP binding data is less reliable. This work was done in collaboration with Guang Chen in the Department of Bioengineering and Christian Stoeckert, Jr. in the Department of Genetics at the University of Pennsylvania. The application of our model to Yeast has been submitted for publication in BMC Bioinformatics.

Bayesian State Space Models for Inferring Gene Interaction Networks

Yulan Liang, Xueya Cai, Arpad Kelemen
{After the completion of the genome sequencing project, new computational challenges arouse in functional genomics, which include interaction network modeling, pathway discovery and function prediction. Time course gene expression experiments are often used in microarray study because knowing when and where a gene is expressed can provide a strong clue of its biological roles. The existing mathematical and pharmaceutical models for genomic temporal data have not incorporated the stochastic processes, time delay information, and they often treat the biological parameters as fixed values without probability and uncertainty involved in the estimations and model them in deterministic ways. Dynamic Bayesian clustering approaches have been developed, however, their autoregressive models with the time points have to be evenly time spaced. Also, the model requires stationary conditions, which are not present in microarray experiments. Some curve fitting (e.g. in terms of polynomials in time) methods have also been applied to microarray data and they may produce a good fit but do not fascinate prediction due to overfitting problems. Bayesian decomposition methods have also been developed for microarray data using matrix decomposition in a Bayesian setting. A difficulty is the need for estimation of the dimensionality. Dynamic state space models have greater flexibility in modeling non-stationary temporal microarray data; however, the standard Kalman filter methods rely on the linear state transitions and Gaussian errors. In this paper, we develop a Bayesian state space models to tackle some of these challenges and to infer and predict the transcription profiles and interaction networks derived from microarray experiment. The newly developed models have advantages in simplified estimation where non-standard distributions and non-linear models are more realistic in genomic data. Various models with different prior distribution and covariance matrices are investigated and Deviance Information Criteria is used for model selections and identifications. Computations of the marginal posterior distributions and their respective moments are performed by Monte Carlo Markov Chain (MCMC) simulation with Gibbs sampling. The computed covariance and correlation matrices were obtained for the constructions of the gene-gene, gene-time interaction networks and pathways. The performance of the developed models is evaluated through multi-tissue temporal affymetrix data sets that describe systemic temporal response cascades to therapeutic doses. Results show that our developed models can well capture the dynamics of gene expression and have great potential for biological applications.}

Bayesian Methods for Scalable Multi-Subject Value-Added Assessment

J.R. Lockwood, Louis T. Mariano
{In the education research and policy communities, there is increased interest in using so-called ``value-added assessment'' (VAA) for educator accountability and decision making. VAA is a general term referring to a class of statistical methods that use longitudinal student achievement data to make inferences about the effects of educational entities, such as schools and teachers, on student growth. The widespread appeal of VAA derives from its purported ability to purge student achievement of the effects of non-schooling inputs (such as socioeconomic status or parental education), thus allowing fair comparisons of educators teaching diverse populations of students. Although practitioners are anxious to explore whether VAA methods can live up to their promise, theoretical development of advanced VAA models has outpaced practical implementation. The statistical models producing value-added estimates of educator contributions to student learning require complex accounting of the linkage of students to teachers and schools over time as well as assumptions about how past educational experiences impact current achievement. These complexities challenge traditional likelihood estimation, particularly for large datasets which are becoming more common with the increased emphasis on standardized testing and data-driven decision making. In this talk, which summarizes the work by Lockwood et al (2005; conditionally accepted by the {\em Journal of Educational and Behavioral Statistics}), we introduce a general multivariate, longitudinal model for student outcomes and demonstrate how casting the model in the Bayesian framework bridges the gap between theory and practice. The model explicitly parameterizes the long-term effects of past teachers on student outcomes in future years, a topic of considerable interest that has not been examined empirically in the literature. We use the model to estimate teacher effects and the persistence of those effects over time using multi-subject student achievement data from a large urban school district. We discuss the computational and inferential advantages of the Bayesian approach to VAA, including its scalability to very large datasets, its natural mechanism for dealing with missing test score data, its facility with handling complex latent parametric structures, and its coherent framework for communicating uncertainty about estimated effects.

Salmonella in egg production chain: a Bayesian farm to fork risk assessment

Jukka Ranta, Satu Lievonen$ and Riitta Maijala
Food borne infections caused by salmonella still remain a source of concern for modern food industry as well as for authorities worldwide. To study the effect of various risk control options, attempts have been made to quantify the risks over the whole production chain from farm to fork. The use of Bayesian approach in Quantitative Risk Assessments (QRA) in this field has been fairly limited to uncertainty distributions of separate input parameters of mechanistic simulation models in which Monte Carlo forward simulation is widely used as a tool for propagation of uncertainty. Bayesian approach enables to update the prevalence estimates jointly over the production chain using several sources of data along the chain. For example, the prevalence estimates at a specific part of chain are influenced not only by directly related data but also by other prevalence data concerning earlier and/or later parts of production chain. Bayesian QRA of salmonella in Finnish egg production chain consists of modules for breeder flocks, production flocks, produced eggs, and consumption. The first two dynamic modules provide estimates for flock level prevalence whereas the third module deals with total and salmonella positive egg production. Eventually, a predictive distribution will be used for estimating the attributable number of human cases caused by salmonella in eggs. Each module describes a chain of events while the amount and quality of data can differ considerably between the modules. The complexity of flock life history specific data structures becomes a challenge due to the sharp pyramid type of production chain of eggs. One breeder flock can be a origin of a thousand production flocks. Thus individual flock life histories with latent infection states were modeled in the top of the pyramid whereas a stationary distribution model induced by a generic transition probability matrix was utilized for the summary data on production flocks. In addition to Bayesian inference concerning the current true prevalence, the assessment of interventions calls for a causal analysis of the effects along the production chain. The combined modules were used for assessing the effect of removal of detected positive flocks and the effect of alternative testing strategies. A manuscript and a research report is under construction. Our previous Bayesian modeling on QRA (on broiler production chain) has been published in the Journal of Risk Analysis and International Journal of Food Microbiology.

Bayesian Prediction of Colorectal Cancer Treatment Efficacy: Difficulties of Classification with Small Samples

Russell Steele
Pre-operative radiotherapy for rectal cancer improves the odds of overall patient survival and reduces local recurrence rates versus post-operative radiation or surgery alone. Additionally, high-dose radiation treatment may result in significant tumor downstaging, high rates of complete tumor regression and a reduction in surgical complexity. The ability to predict prior to treatment which tumors will be responsive to pre-operative radiotherapy would have a significant impact on patient selection. In this talk I will discuss several classification techniques (ranging from logistic regression to classification trees and neural networks) in the context of developing predictive protein biomarker models for the efficacy of colo-rectal cancer treatment. Although most classification and statistical learning approaches focus on problems with large numbers of features and observations, we will examine what types of inference can be drawn from applying sophisticated models to small datasets (fewer than 100 observations with only 5 features available for classification). Limited sample sizes often indicate a possible gain through the use of Bayesian methods, but the choice of prior distributions and the estimation of posterior distributions can often be difficult to implement (especially for models such as classification trees and neural networks). We'll compare frequentist and Bayesian results for classification trees and neural networks and the implications for model selection and prediction with small samples. In particular, we'll show how frequentist approaches via cross-validation with small samples can be far less informative than fully Bayesian model selection. Frequentist analyses of subsets of this dataset have already been accepted for publication in the Journal of Clinical Cancer Research and have been submitted to Cancer. Fully Bayesian model selection for this problem via neural networks with tempered MCMC methods and bridge sampling is the subject of a manuscript that will soon be submitted and a separate overview of the practical limitations of small sample classification will be submitted to the Journal of Classification for possible inclusion in the proceedings of the CSNA/Interface meeting held in St. Louis in June 2005.

The previous seven Workshops provided extended presentation and discussion on diverse topics.


Back to the top of the page
Back to the Bayes08 Home Page
Organized by:
Emery Brown Alicia Carriquiry Elena Erosheva
Constantine Gatsonis Robert Kass Herbie Lee
Isabella Verdinelli
Return to CSBS 8 Home Page