Inivted Session for Young Statisticians
(NOTE: This schedule includes 5 minutes floor discussion for each talk.)
A Bayesian Hierarchical Model for Integrating
Heterogeneous Biological Information
|
Shane T. Jensen
|
A substantial focus of research in molecular biology is the
network of factors which control the involvement of different
genes in living cells. Genes are regulated by proteins called
transcription factors (TFs) that bind directly to DNA in close
proximity to certain target genes and change the involvement of
these genes in response to changing environmental conditions.
Previous statistical approaches for identifying gene regulatory
networks have used isolated information either from expression
microarray data or Chromatin-Immunoprecipitation binding
data. Although those approaches have been proven useful, their
power is inherently limited by the fact that each data resource
provides only partial information: expression data provides only
indirect evidence of regulation (through mRNA levels), whereas
ChIP binding data only provides physical binding information.
Recent efforts on integrating these data types have involved very
little systematic modeling, despite the fact that the Bayesian
paradigm provides a natural framework for combining
heterogeneous sources of information.
We present a Bayesian hierarchical model and Gibbs sampling
implementation that integrates gene expression data and ChIP
binding data in a principled fashion. The gene expression data
is modeled as a function of the unknown gene regulatory
network which has an informed prior distribution based upon
the ChIP binding data. Unlike previous methods, our Bayesian
model does not require a priori clustering of the expression
data and allows each gene to belong to multiple regulatory
pathways. We applied our model to genome-wide ChIP binding
data and 500 expression experiments in Yeast.
Several validation analyses show that our predicted gene-TF
relationships are more likely to be biologically relevant than
using either source of information alone. In addition, our model
incorporates the possibility of interactions between pairs of
transcription factors, which allowed us to identify 84 TF pairs
with synergistic effects on expression of target genes. Our
Bayesian framework also allows for the inclusion of additional
data sources, such as known or predicted transcription factor
binding sites, which will be beneficial in upcoming applications
to human and mouse studies where ChIP binding data is less
reliable. This work was done in collaboration with Guang Chen
in the Department of Bioengineering and Christian Stoeckert, Jr.
in the Department of Genetics at the University of Pennsylvania.
The application of our model to Yeast has been submitted for
publication in BMC Bioinformatics.
|
|
Bayesian State Space Models for Inferring Gene Interaction Networks
|
Yulan Liang, Xueya Cai, Arpad Kelemen
|
{After the completion of the genome sequencing project, new
computational challenges arouse in functional genomics, which include
interaction network modeling, pathway discovery and function
prediction. Time course gene expression experiments are often used in
microarray study because knowing when and where a gene is expressed
can provide a strong clue of its biological roles. The existing
mathematical and pharmaceutical models for genomic temporal data have
not incorporated the stochastic processes, time delay information, and
they often treat the biological parameters as fixed values without
probability and uncertainty involved in the estimations and model them
in deterministic ways. Dynamic Bayesian clustering approaches have
been developed, however, their autoregressive models with the time
points have to be evenly time spaced. Also, the model requires
stationary conditions, which are not present in microarray
experiments. Some curve fitting (e.g. in terms of polynomials in time)
methods have also been applied to microarray data and they may produce
a good fit but do not fascinate prediction due to overfitting
problems. Bayesian decomposition methods have also been developed for
microarray data using matrix decomposition in a Bayesian setting. A
difficulty is the need for estimation of the dimensionality. Dynamic
state space models have greater flexibility in modeling non-stationary
temporal microarray data; however, the standard Kalman filter methods
rely on the linear state transitions and Gaussian errors. In this
paper, we develop a Bayesian state space models to tackle some of
these challenges and to infer and predict the transcription profiles
and interaction networks derived from microarray experiment. The
newly developed models have advantages in simplified estimation where
non-standard distributions and non-linear models are more realistic in
genomic data. Various models with different prior distribution and
covariance matrices are investigated and Deviance Information Criteria
is used for model selections and identifications. Computations of the
marginal posterior distributions and their respective moments are
performed by Monte Carlo Markov Chain (MCMC) simulation with Gibbs
sampling. The computed covariance and correlation matrices were
obtained for the constructions of the gene-gene, gene-time interaction
networks and pathways. The performance of the developed models is
evaluated through multi-tissue temporal affymetrix data sets that
describe systemic temporal response cascades to therapeutic
doses. Results show that our developed models can well capture the
dynamics of gene expression and have great potential for biological
applications.} |
|
Bayesian Methods for Scalable Multi-Subject Value-Added Assessment
|
J.R. Lockwood, Louis T. Mariano
|
{In the education research and policy communities, there is increased
interest in using so-called ``value-added assessment'' (VAA) for educator
accountability and decision making. VAA is a general term referring to a
class of statistical methods that use longitudinal student achievement
data to make inferences about the effects of educational entities, such as
schools and teachers, on student growth. The widespread appeal of VAA
derives from its purported ability to purge student achievement of the
effects of non-schooling inputs (such as socioeconomic status or parental
education), thus allowing fair comparisons of educators teaching diverse
populations of students.
Although practitioners are anxious to explore whether VAA methods can live
up to their promise, theoretical development of advanced VAA models has
outpaced practical implementation. The statistical models producing
value-added estimates of educator contributions to student learning
require complex accounting of the linkage of students to teachers and
schools over time as well as assumptions about how past educational
experiences impact current achievement. These complexities challenge
traditional likelihood estimation, particularly for large datasets which
are becoming more common with the increased emphasis on standardized
testing and data-driven decision making.
In this talk, which summarizes the work by Lockwood et al (2005;
conditionally accepted by the {\em Journal of Educational and Behavioral
Statistics}), we introduce a general multivariate, longitudinal model
for student outcomes and demonstrate how casting the model in the Bayesian
framework bridges the gap between theory and practice. The model
explicitly parameterizes the long-term effects of past teachers on student
outcomes in future years, a topic of considerable interest that has not
been examined empirically in the literature. We use the model to estimate
teacher effects and the persistence of those effects over time using
multi-subject student achievement data from a large urban school district.
We discuss the computational and inferential advantages of the Bayesian
approach to VAA, including its scalability to very large datasets, its
natural mechanism for dealing with missing test score data, its facility
with handling complex latent parametric structures, and its coherent
framework for communicating uncertainty about estimated effects.
|
|
Salmonella in egg production chain: a Bayesian farm to fork risk assessment
|
Jukka Ranta, Satu Lievonen$ and Riitta Maijala
|
Food borne infections caused by salmonella still remain a source
of concern for modern food industry as well as for authorities
worldwide. To study the effect of various risk control options,
attempts have been made to quantify the risks over the whole
production chain from farm to fork. The use of Bayesian approach
in Quantitative Risk Assessments (QRA) in this field has been
fairly limited to uncertainty distributions of separate input
parameters of mechanistic simulation models in which Monte Carlo
forward simulation is widely used as a tool for propagation of
uncertainty. Bayesian approach enables to update the prevalence
estimates jointly over the production chain using several sources
of data along the chain. For example, the prevalence estimates at
a specific part of chain are influenced not only by directly
related data but also by other prevalence data concerning earlier
and/or later parts of production chain.
Bayesian QRA of salmonella in Finnish egg production chain
consists of modules for breeder flocks, production flocks,
produced eggs, and consumption. The first two dynamic modules
provide estimates for flock level prevalence whereas the third
module deals with total and salmonella positive egg production.
Eventually, a predictive distribution will be used for estimating
the attributable number of human cases caused by salmonella in
eggs. Each module describes a chain of events while the amount and
quality of data can differ considerably
between the modules.
The complexity of flock life history specific data structures
becomes a challenge due to the sharp pyramid type of production
chain of eggs. One breeder flock can be a origin of a thousand
production flocks. Thus individual flock life histories with
latent infection states were modeled in the top of the pyramid
whereas a stationary distribution model induced by a generic
transition probability matrix was utilized for the summary data on
production flocks. In addition to Bayesian inference concerning
the current true prevalence, the assessment of interventions calls
for a causal analysis of the effects along the production chain.
The combined modules were used for assessing the effect of removal
of detected positive flocks and the effect of alternative testing
strategies. A manuscript and a research report is under
construction. Our previous Bayesian modeling on QRA (on broiler
production chain) has been published in the Journal of Risk
Analysis and International Journal of Food Microbiology.
|
|
Bayesian Prediction of Colorectal Cancer Treatment Efficacy: Difficulties of Classification with Small Samples
|
Russell Steele
| Pre-operative radiotherapy for rectal cancer improves the
odds of overall patient survival and reduces local recurrence rates
versus post-operative radiation or surgery alone. Additionally,
high-dose radiation treatment may result in significant tumor
downstaging, high rates of complete tumor regression and a reduction
in surgical complexity. The ability to predict prior to treatment
which tumors will be responsive to pre-operative radiotherapy would
have a significant impact on patient selection.
In this talk I will discuss several classification techniques (ranging
from logistic regression to classification trees and neural networks)
in the context of developing predictive protein biomarker models for
the efficacy of colo-rectal cancer treatment. Although most
classification and statistical learning approaches focus on problems
with large numbers of features and observations, we will examine what
types of inference can be drawn from applying sophisticated models to
small datasets (fewer than 100 observations with only 5 features
available for classification).
Limited sample sizes often indicate a possible gain through the use of
Bayesian methods, but the choice of prior distributions and the
estimation of posterior distributions can often be difficult to
implement (especially for models such as classification trees and
neural networks). We'll compare frequentist and Bayesian results for
classification trees and neural networks and the implications for
model selection and prediction with small samples. In particular,
we'll show how frequentist approaches via cross-validation with small
samples can be far less informative than fully Bayesian model
selection.
Frequentist analyses of subsets of this dataset have already been
accepted for publication in the Journal of Clinical Cancer Research
and have been submitted to Cancer. Fully Bayesian model selection for
this problem via neural networks with tempered MCMC methods and bridge
sampling is the subject of a manuscript that will soon be submitted
and a separate overview of the practical limitations of small sample
classification will be submitted to the Journal of Classification for
possible inclusion in the proceedings of the CSNA/Interface meeting
held in St. Louis in June 2005.
|
|
The previous seven Workshops provided extended
presentation and discussion on diverse topics.
|
|
Back to the top of the page |
Back to the Bayes08 Home
Page |
Organized by: |
|
|