The following abstracts have been accepted and will be part of the
poster session at the Bayesian Workshop.
Jean-Francois Angers and Atanu Biswas
Univerite de Montreal and Indeian Statistical Institute
Dep. de mathematiques et de statistique
C.P. 6128, succ "Centre-ville"
Montreal, Qc h3C 3J7
Applied Statistics Unit
Applied Statistics Unit
Indian Statistical Institute
203 B.T. Road, Calcutta - 700 035
Sudipto Banerjee and Bradley P. Carlin
Div. of Biostatistics, School of Public Health,
University of Minnesota
A460 Mayo Building
Minneapolis, Minnesota 55455
Michael Baron
Department of Mathematical Sciences,
University of Texas at Dallas
Richardson, TX 75083-0688
Sam Behseta and Robert E.Kass
Carnegie Mellon University
Department of Statistics
Pittsburgh, PA 15213
Halima Bensmail
Department of Statistics
University of Tennessee
Knoxville, TN 37996-0532
Peter Bouman, Vanja Dukic, and Xiao-Li Meng
University of Chicago, Harvard University
Keywords: AIDS, interval censoring, delay distributions, Bayesian inference, MCMC and epidemic models.
Eric T. Bradlow and David Schmittlein
The Wharton School
Can Cai
Carnegie Mellon University
Department of Statistics
Pittsburgh, PA 15232
Angela Maria de Souza Bueno
Universidade Federal de Santa Catarina
Departamento de Biologia Celular
Embriologia e Gentica, CCB Florianpolis
SC, Brazil. 88040-900
Fax: 55 048 331 9672.
Carlos Alberto de Bragana Pereira
Universidade de Sao Paulo
Departamento de Estatstica, IME. Cx. Postal 66281
Sao Paulo, SP, Brazil. 05389-970
M. Nazareth Rabello-Gay
Instituto Butantan, Laboratrio de Gentica
Avenida Vital Brasil, 1500
Sao Paulo, SP, Brazil. 05503-900
Julio Michael Stern
Universidade de Sao Paulo
Departamento de Estatstica, IME. Cx. Postal 66281
Sao Paulo, SP, Brazil. 05389-970
Key words: Cell proliferative indices; Micronucleated cells; Prior and posterior probabilities; Beta-(negative)Binomial distribution; Beta-Poisson distribution; Mixture of Beta distributions
Catherine A. Calder, David Higdon, Christopher Holloman
ISDS, Duke University
Box 90251
Durham, NC 27708
Yu-mei Chang, Daniel Gianola
Department of Animal Sciences
University of Wisconsin, Madison
Bjxrg Heringstad and Gunnar Klemetsdal
Department of Animal Science
Agricultural University of Norway
Meng Chen, Mario Peruggia
Department of Statistics
The Ohio State University
and Trisha Van Zandt
Department of Psychology
Columbus, OH 43210
Because of the sequential nature of the experiment and the fact that several replications of similar trials were conducted on each subject, the assumption of i.i.d. response times (often encountered in the psychology literature) is untenable. We consider Bayesian hierarchical models in which the response times are described as conditionally i.i.d. Weibull random variables given the parameters of the Weibull distribution. The sequential dependencies, as well as the effects of response accuracy, word characteristics, and subject specific learning processes are incorporated via a linear regression model for the logarithm of the scale parameter of the Weibull distribution.
We compare the inferences from our analysis with those obtained by means of instruments that are commonly used in the cognitive psychology arena. In both cases, we pay close attention to the quality of the fit, the adequacy of the assumptions, and their impact on the inferential conclusions. Finally, we discuss briefly the extent to which our approach can be generalized to other types of human response data.
Erin M. Conlon
Department of Statistics, Harvard University
One Oxford Street
Cambridge, MA 02138
Ellen M. Wijsman, Ellen L. Goode, Michael Badzioch and Gail P. Jarvik
University of Washington
Seattle, Washington
Mark Gibbs, Janet L. Stanford, Suzanne Kolb and Elaine A. Ostrander
Fred Hutchinson Cancer Research Center
Seattle Washington
Marta Janer and Leroy Hood
Institute for Systems Biology
Seattle, Washington
Samantha Cook
Harvard University, Department of Statistics
One Oxford Street
Cambridge, MA 02138
Ciprian Crainiceanu, David Ruppert, Jery Stedinger, and Christopher Behr
Cornell University
Department of Statistical Science, 301 Malott Hall, Cornell University
Ithaca NY 14853
Approach: Observed count data serves as the basis of a Generalized Linear Mixed Model (GLMM) with a hierarchical structure that includes sites, regions and an overall national average. Possible covariates include site characteristics, such as the category of the water source and the population served, and time dependent covariates including sampling date, flow rate, and water turbidity. A fully Bayesian approach is used for modeling and subsequent risk analysis. Markov Chain Monte Carlo (MCMC) simulation is employed to compute the posterior distributions of the parameters. A very powerful and flexible statistical software package called WinBugs is used for the Bayesian computations.
Results: Results illustrate the steps involved in parameter estimation, model selection, and risk assessment. The replicates generated by the simulation are used to describe parameter uncertainty and the predictive distribution of Cryptosporidium concentrations in the subsequent analysis of the cost-effectiveness of alternative EPA information collection strategies and treatment rules. Different distributions are used to model random effects (gamma or lognormal); some choices (gamma) allow the time-site effects to be integrated analytically (Poisson-gamma yields a negative-binomial distribution), which can affect the efficiency of the computations. Research addresses MCMC simulation performance. Examples are used to show the impact of hierarchical centering of site and regional random effects, centering and orthogonalization of covariates and the information content of data sets on MCMC mixing properties.
Keywords: Bayesian analysis, waterborne pathogens, Generalized Linear Mixed Model
ESCET. Universidad Rey Juan Carlos
28933 Mostoles. Madrid. Espaqa
Michele DiPietro
Department of Statistics
Carnegie Mellon University
5000 Forbes Avenue
Pittsburgh, PA 15213
Richard Evans and Helen Stein
Iowa State University and the Menninger Clinc
Ames IA
Marco Ferreira, Zhuoxin Bi, Mike West, Herbie Lee and David Higdon
Duke University
ISDS - Old Chemistry Bldg
Durham, NC 27708-0251
Christina L. Geyer
ISDS Duke University
Box 90251
Durham, NC 27708
BOSTON, MA 02215
Cong Han
Division of Biostatistics
University of Minnesota
Minneapolis, MN 55414
Kathryn Chaloner
School of Statistics
University of Minnesota
Minneapolis, MN 55414
Alan S. Perelson
Theoretical Biology and Biophysics
Los Alamos National Laboratory
Los Alamos, NM 87545
A Bayesian analysis of an HIV dynamic model using data from Perelson
et al. (1996, Science 271, 1582-1586) is
presented. The data are repeated measurements of plasma HIV RNA
concentrations of patients receiving protease inhibitor treatment. A
nonlinear mixed-effects model is introduced that explicitly models
variability between subjects. The prior distribution is based on the
scientific literature prior to 1996. Point and interval estimates are
reported for the rates of disappearance of viruses and virus-producing
cells. Issues of outliers and sensitivity to prior distribution are
also investigated.
Haran, Murali,
Carlin, Bradley P.,
Adgate, John L.,
Ramachandran, Gurumurthy,
Waller, Lance,
Gelfand, Alan E.
School of Statistics, University of
Division of Biostatistics and Division of Environmental and
Occupational Health, School of Public Health, University of
Statistics Department, Emory University,
Statistics Department, University of Connecticut
Jennifer Hill
Columbia University
School of Social Work
622 W. 113th St.
New York, NY 10027
Christopher Holloman, Dave Higdon, and Herbert Lee
Duke University
Box 90251
Durham, NC 27708-0251
Gabriel Huerta, Bruno Sanso and Jonathan R. Stroud
CIMAT, Universidad Simon Bolivar and University of Chicago
Apartado 402, Guanajuato, Gto. 36000, M\'exico
Apartado 890000, Caracas 1080-A Venezuela
5734 S. University Ave., Chicago, IL 60637, U.S.A,,
\noindent Key Words: Ozone time series; Spatio-temporal models; Bayesian modeling; Dynamic Linear models; Smoothed means; Predictive values
Telba Z. Irony, Gene Pennello, and Greg Campbell
In this presentation, we will report what has happened in the last four years, discuss the Bayesian techniques that have been successfully used, and consider the perspectives for the future. We will also discuss the advantages, difficulties, and appropriateness of the Bayesian approach in medical device clinical trials.
Lurdes Y.T. Inoue, Peter F. Thall, Donald A. Berry
Department of Biostatistics, MD Anderson Cancer Center, The University of Texas
1515 Holcombe Boulevard
Houston, TX 77030
Shane T. Jensen
Department of Statistics, Harvard University
One Oxford Street
Cambridge, MA 02138
Beatrix Jones
Department of Statistics, Penn State University
311 Thomas Building
University Park PA 16801
Marc Kennedy, David Higdon
National Institute of Statistical Sciences
PO Box 14006, Research Triangle Park, Durham, NC 27713
Two possible predictions are presented: The calibrated code prediction and the bias corrected prediction of reality. Each takes account of all relevant uncertainty, including remaining uncertainty in t after calibration. Posterior distributions of the tuning parameter and bias functions are also used to answer model validation questions.
A seedling's survival probability is the chance that it survives
from one year to the next. Seedlings that have survived through at
least one winter can be identified by the presence of bud scale
scars. We denote seedlings as either Old (having bud scale scars)
or New (not having bud scale scars). Old and New seedlings have
different survival probabilities but roughly speaking, the survival
probability of Old seedlings is not further explained by age.
Therefore we adopt a model with two parameters of interest ---
p_Old and p_New --- survival probabilities for
Old and New seedlings respectively.
The methods are illustrated on red maple (Acer rubrum) data
from the Long Term Ecological Research (LTER) site at Coweeta, in
the Appalachian Mountains of North Carolina. The data come from a
collection of plots, each containing a collection of 1m^2 quadrats
which are the units of analysis. Quadrats within a plot differ
slightly in seedling survival rates. Survival rates are also
affected by factors such as altitude, the presence of light gaps and
the presence of rhododendron cover. Our model contains fixed
effects altitude, gaps and rhododendron and random effects for
quadrat and year.
The data are subject to errors of various sorts. For red maple, one
of the most important is that red maple seedlings sometimes emerge
from the ground late in the Fall, after the census has been taken.
Thus Old seedlings in year j can be recorded even though there
were no New or Old seedlings in year j-1. Late emergence means
that N_i,j, the true number of New seedlings in quadrat i in
year j is unknown. To accomodate the uncertainty in N_i,j we
adopt a Poisson arrival rate model for New seedlings in which the
rate is subject to fixed and random effects.
In the course of illustrating the method we evaluate sensitivity to
the prior and to some modelling choices. We also quantify the
information gained by flagging and find that not flagging is often
the more attractive alternative. In the past it was not known how
to extract useful mortality information from unflagged data. Now
that a method has been illustrated, and because unflagged data is
much easier to collect, we may begin to see data sets of greater
spatial and temporal coverage that will increase our understanding
of seedling survival and ultimately our understanding of the
survival and spread of past, present and future populations of
Much of the work has already been completed, is available on the web
as ISDS Discussion Paper #99-33 and will be published in
JABES. Other parts are in Brian Beckage's completed PhD thesis
in Botany and will be published. The major work to be completed
before September is the inclusion of flagged and unflagged data and
multiple plots all in the same analysis.
In this article we measure the transmission of shocks by
cross-market correlation coefficients following Forbes and
Rigobon's (2000) notion of shift-contagion. Our main contribution
relies upon the use of traditional factor model techniques combined with
stochastic volatility models to study the dependence among
Latin American stock price indexes and the North American index. More
specifically, we concentrate on situations where the factor variances
are modeled by a multivariate stochastic volatility structure.
From a theoretical perspective, we improve currently
available methodology by allowing the factor loadings, in the factor
model structure, to have a time-varying structure and to capture
changes in the series' weights over time. By doing this, we believe
that changes and interventions experienced by those five countries
are well accommodated by our models which learns and adapts
reasonably fast to those economic and idiosyncratic shocks.
We empirically show that the time varying covariance
structure can be modeled by one or two common factors
and that some sort of contagion is present in most of
the series' covariances during periods of economical
instability, or crisis. Open issues on real time implementation
and natural model comparisons are thoroughly discussed.
Data is available concerning the turning proportions in the actual
neighborhood, as well as counts as to vehicular input into the
system, and internal system counts, during a day in May, 2000. Some
of the data is accurate (video recordings), but some is inaccurate
(observer counts of vehicles). The first goal is to incorporate both
types of data so as to derive the posterior distribution of turning
probabilities and of the parameters of the CORSIM input distribution.
The vehicles passing through an intersection are modeled with a
product multinomial distribution, with turning probabilities
specific to each intersection. The accurate data is introduced as
restrictions to the model, reducing the actual number of latent
variables. We perform an MCMC analysis to learn about the turning
probabilities at every intersection, latent counts at different
locations, bias parameters for the observers, interarrival rates,...adding
up to about 200 parameters in the network.
The posterior distribution on model inputs will then be used to study
sensitivity of the computer model predictions.
Studying the uncertainty in model
predictions is complicated by the fact that the CORSIM model
operates close to feasibility constraints, and these constraints
must be built into the uncertainty propagation through the model.
describe the development of statistical tools for carrying out full
Bayesian meta-analyses across studies with varying degrees of
exchangeability from study to study.
We analyze data from three studies carried out by the Cancer and
Leukemia Group B (CALGB): studies 8881, 9160, & 8541.
Using the data from the two earlier phase studies (8881 & 9160),
in which relatively frequent blood count monitoring took place,
we make more precise our inference in a large phase III study
in adjuvant breast cancer.
To achieve the desired borrowing strength across studies we use
hierarchical models with sub-models for each study. For each study
we define a population PK/PD model, i.e., a hierarchical model
to allow inference about PK/PD data across patients.
As part of these models we use flexible non-parametric random effects
distributions for patient specific random effects to accomodate
heterogeneity of the patient population and outliers.
The random effects distributions in
each of the studies are different, but it would be unreasonable to
assume them a priori independent. Thus we need a model for
dependent random probability measures, i.e., we require
dependent non-parametric Bayesian models.
We use a class of models based on the dependent Dirichlet process (DDP)
proposed in MacEachern (2001).
We used hierarchical nonlinear regression techniques with the
student's grade from each rater in each category as the
outcome. Associated with each rating is a "perceived" latent variable,
which is assumed to be centered at the "true" latent ability variable for
that individual in that category. These "true" ability variables are
"perceived" with a different error variance for each rater in each
category. We assumed that the "true" ability variables have a
multivariate normal distribution whose covariance matrix has the form of a
correlation matrix, so that the "true" ability variables are
Normal(0,1) a priori.
MCMC was necessary in order to work with the posterior distributions,
which are quite unwieldy. We used a method explained by Barnard,
McCulloch and Meng (1997) to sample from the distribution of the
covariance matrix. Our main interest was in the differences in the rater
variances, and in examining the posterior distribution of the covariance
matrix mentioned above. We conclude by showing some results.
To investigate this issue, we treat the locations of infected cells as
a realization from an inhomogeneous Poisson process and parameterize
the intensity of this process as a linear combination of Gaussian
densities. We then use the Metropolis algorithm to generate samples
from the posterior distriubtion of the intensity. With these sampled
intensities, we can compute how far infected cells spread out from
centers of clusters of infected cells, and since the lifespan of
infected cells is well known, we can assess the validity of our model
of viral spread by comparing the distances at which we witness
clustering to what we expect under our model. Although Ripley's $K$
function is widely used as a method for examining the clustering of
point processes, we show how this function can lead one astray for
this application, so we are forced to develop novel descriptive
statitics to investigate clustering. Moreover, with our technique, we
can assess if the clustering is likely to ha! ve resulted from a
homogeneous Poisson process, and our method easliy allows for model
diagnostics. We find that the simple model of viral spread is
consistent with the data-in fact, at the time of the statistical
analysis, the number of days from infection to sample collection was
withheld, but the method was able to uncover this time lag based only
on the model and the data.
The study, analysis, and prediction of consumer preferences play a
large role in Business today. Large parts of the field of data-mining
are devoted to the analysis of questions like:
(i) What characterizes consumer preferences? (i.e., how can
we profile consumers)
(ii) What characterizes the relationship between different
consumer preferences?, and
(iii) Which statistical models are appropriate for analyzing
consumer preferences and how can such 'appropriateness' be measured.
(iv) what role (if any) does consumer
reliability and the prior preferences of consumers
play in such an analysis.
In our paper we address these questions by focusing on the
characterization and prediction of brand label preferences. This is
achieved by formulating very general latent variable models
together with a methodology for choosing the right ones. These latent
variable models generalize other treatments of the subject by allowing
consideration of consumer reliability and
prior consumer preferences.
Jacob Laading, Tore Anders Husebo and Thor Aage Dragsten
Den norske Bank
Stranden 21
Oslo, Norway
After some experience with a credit risk (probability of default)
model, it was found that this model was wrongly calibrated. The model
is built in two segments (financial and non-financial), and while the
two separate models rank-ordered well, the absolute level of a model
combining the two was giving a clearly skewed estimate of the
portfolio risk. This work describes how short-term default data and
expert opinions were used in reestimating the model. Special emphasis
is put on the modeling process in the organization, where the
graphical model approach and the elicitation of expert opinions was
used for tutorial and trust-building purposes within the business
Statistical Modelling of Seedling Mortality
Michael Lavine, Brian Beckage and James Clark
Duke University
Box 90251
Durham NC 27708
Seedling mortality in tree populations limits population growth
rates and controls the diversity of forests. To learn about
seedling mortality, ecologists use repeated censuses of forest
quadrats to determine the number of tree seedlings that have
survived from the previous census and to find new ones. Typically,
newly found seedlings are marked with flags. But flagging is labor
intensive and limits the spatial and temporal coverage of such
studies. The alternative of not flagging has the advantage of ease
but suffers from two main disadvantages. It complicates the
analysis and loses information. The contributions of this paper are
(i) to introduce a method for using unflagged census data to learn
about seedling mortality and (ii) to quantify the information loss
so ecologists can make informed decisions about whether to flag.
Based on results presented here, we believe that not flagging is
often the preferred alternative. The labor saved by not flagging
can be used to better advantage in extending the coverage of the
A Flexible Convolution Approach To Modelling Spatial
Processes In Porous Media
Herbert Lee, Dave Higdon
Duke University
ISDS, Box 90251
Durham, NC 27708,
In situ cleanup of contaminated soil requires knowledge of the soil
permeability, a spatial process. Here we take a Bayesian approach to
allow straightforward estimation of uncertainty, and we demonstrate our
methodology with data from an actual flow experiment. A spatial
Gaussian Process can be represented as the convolution of a continuous
white noise process and a smoothing kernel, where the choice of kernel
relates to the covariogram of the process. In practice, a coarse
discrete approximation to the white noise process gives an efficient
and accurate method for generating realizations from the Gaussian
process. We expand upon this model by allowing the underlying process
to be other than white noise. For example, a Markov random field can
be convolved with a smoothing kernel to produce a new spatial process.
Evaluating the Impact of Environmental variables on Benthic Microinvertebrate Community via Bayesian Model Averaging
Ilya A. Lipkovich and Eric. P. Smith
Statistics Department Virginia Polytechnic Institute and State
406-A Hutcheson Hall
Blacksburg, VA 24061-0439 USA
Variable selection is one of the most important and controversial
issues in modern data analysis. In the study of relationships between
biological communities and environmental conditions, variable
selection is especially important as it guides decisions about
environmental management. Using a case study from Eastern Corn Belt
Plains Ecoregion (Norton, 1999) we use Bayesian Model Averaging (BMA)
to select interesting subsets of environmental variables (such as
metal composition, silt level, etc), that can impact the abundance of
benthic microinvertebrates taxa. We implement BMA for a multivariate
technique called Canonical Correspondence Analysis (CCA) and use its
results to represent sites, species and selected environmental
variables on a single ordination diagram (triplot) along with error
bars representing uncertainty due to both sampling variability and
model selection. BMA output can be also used to construct prediction
areas for new observations which allows ! the researcher to
evaluate the limits of impact due to possible changes in benthic
ecosystem variables. BMA provides data analysts with an efficient tool
for discovering promising models and obtaining estimates of their
posterior probabilities via Markov chain Monte Carlo (MCMC). These
probabilities are further used as weights for model averaged
predictions and estimates of the parameters of interest. As a result,
variance components due to model selection can be estimated and
accounted for, contrary to the practice of conventional data
analysis. In our study we adopt an approach to BMA called Model
Composition MCMC (MC^3, Madigan and Raftery, 1994) and we
implement BMA methodology by treating CCA within a general framework
of reduced rank regression for which we develop a Bayes Information
Criterion (BIC) approximation to posterior model probabilities in the
spirit of MC3. In addition to applying BMA to the case study, we
developed a general purpose Visual Basic macr! o that allows the user
to easily perform BMA with any data set of similar structure, and
produce various useful outputs for both full and reduced rank
multivariate regression, such as individual model weights, variable
activation probabilities, estimation of model selection and biplot and
triplot diagrams with error bars representing model selection
uncertainty associated with projections of individual sites and taxa.
Hierarchical Bayesian Methods for Estimating
Joint Contaminant Occurrence in Community Water Systems
J.R. Lockwood, Mark Schervish, Patrick Gurian, and Mitchell Small
The RAND Corporation (J.R. Lockwood), University of Texas at El Paso (Patrick Gurian)
and Carnegie Mellon University (Mark Schervish and Mitchell Small),,,
The 1996 amendments to the U.S. Safe Drinking Water Act mandate
revision of current maximum contaminant levels (MCLs) for various
harmful substances in community drinking water supplies. The choice
of a MCL for a given contaminant must balance the potential costs
and benefits of lowering exposure, which requires detailed
information about the occurrence of the contaminant and the costs
and efficiencies of the available treatment technologies. Although
community water systems must comply concurrently with the MCLs for
over 80 regulated substances, regulations generally are set one
contaminant at a time. The failure to consider the joint behaviors
of multiple contaminants during the regulatory process can lead to
mischaracterization of the actual costs and benefits. In order to
estimate more effectively the true costs and benefits of
simultaneous compliance with standards for several contaminants, the
U.S. Environmental Protection Agency is attempting to expand
existing regulatory evaluation methods to account for multiple
contaminants. Such technology requires not only the joint
consideration of treatment options, but also the joint occurrence
distributions of the contaminants. Our work focuses on the latter
topic, extending existing methods for modeling the distributions of
a single contaminant in community water system source waters to the
simultaneous consideration of multiple contaminants. We consider
alternatives for addressing the implementation difficulties inherent
in the multivariate setting, providing solutions of general
methodological interest. Through case studies involving arsenic,
sulfate, magnesium and calcium, we show how jointly modeling
contaminants provides better fit and predictive power than marginal
models, emphasizing how inferences about important regulatory
quantities can be improved through joint modeling. Our methods make
significant progress in redressing several shortcomings of existing
Hidden Markov Model Approach to Local and global Protein or DNA sequence Alignemnts (Pairwise and Multiple)
Tanya Logvinenko
Stanford University
Global and local sequence alignments are tools widely used in
biomedical research. But despite the long history there is a number of
short-comings in existent methods. Dynamic programming methods yield a
single optimal alignment which is highly dependent on the scoring
matrix and gap penalties used. We will describe Bayesian algorithms
for local and global pairwise sequence alignments (using Hidden Markov
Models) which will produce representable samples of alignments and
give posterior distribution of all the alignments considering the set
of different parameters used. To show the potential of these methods,
we apply them to identify regions of sequences conserved to different
degrees. We will present an extension of the algorithms to aligning
multiple sequences.
Comovements and Contagion in Emergent Markets: Stock Indexes Volatilities
Hedibert F. Lopes and Helio S. Migon
Federal University of Rio de Janeiro
Caixa Postal 68530
21945-970, Rio de Janeiro - BRAZIL
The past decade has witenessed a series of (well accepted and defined)
financial crises periods in the world economy. Most of these events are
country specific and eventually spreaded out across neighbor countries,
with the concept of vicinity extrapolating the geographic maps and
entering the contagion maps. Unfortunately, what contagion
represents and how to measure it are still unanswered questions.
The Hierarchical Rater Model: Accounting for
Information Accumulation and Rater Behavior in Constructed Response
Student Assessments
Louis T. Mariano
Carnegie Mellon University
Department of Statistics
Pittsburgh, PA 15213
Open-ended (i.e. constructed response) test items have become a stock
component of standardized educational tests. Responses to open-ended
items are usually evaluated by human ``raters'', often with multiple
raters judging each response. In this paper we contrast the FACETS
model (Linacre, 1989), a mixed-effects multivariate logistic regression
model that has been a a popular tool for modeling data from rated test
items, with a fully hierarchical Bayes model for rating data (the
hierarchical rater model, HRM, of Patz, Junker, Johnson and Mariano,
2000). The HRM makes more realistic assumptions about the dependence between
multiple ratings of the same student work, and thus provides a more
realistic view of the uncertainty of inferences on parameters and latent
variables from rated test items. A rigorous treatment of the approach
to dependence and uncertainty in each model is presented, followed by
two new applications of the HRM. The first application uses simulated
data to explore the accumulation of information under the HRM, under
various scenarios of rater performance (especially poor performance).
The second application shows how the HRM can be used to make inferences
about examinees, test items and raters, in a statewide mathematics exam
given in the State of Florida. In particular we explore the effect of
modality---the design for distributing items among raters---on the
severity and consistency of individual raters' performance.
Assessing and Propagating Uncertainty in Model Inputs in
Computer Traffic Simulators (CORSIM)
Molina, German
Institute of Statistics and Decision Sciences, Duke University
Durham, NC 27708-0251, USA
Bayarri, Susie
Dept of Statistics and Operations Research, Universitat de Valencia
Valencia, 46100, SPAIN
Berger, James
Institute of Statistics and Decision Sciences, Duke University
Durham, NC 27708-0251, USA
CORSIM is a large simulator for vehicular traffic, and is being
studied in regards to is ability to successfully model and predict behavior of
traffic in a 36 block section of Chicago. Inputs to the simulator
include information about street configuration, driver behavior,
traffic light timing, turning probabilities at each corner and
distributions of traffic ingress into the system.
Multiscale Relationships Between Coarse Woody Debris and Presence/Absence of Western Hemlock in the Oregon Coast Range
Vicente J. Monleon
PNW Research Station
USDA Forest Service
1221 SW Yamhill
Portland, OR 97205
Alix I. Gitelman
Department of Statistics
Oregon State University
44 Kidder Hall
Corvallis, OR 97331
Andrew Gray
PNW Research Station
USDA Forest Service
1221 SW Yamhill
Portland, OR 97205
This study examines the relationship between the abundance of coarse
woody debris (CWD) and the establishment of western hemlock ( Tsuga
heterophylla) at two different scales: microsite-level and stand-level
within the Oregon Coast Range. Western hemlock is a key structural
component of old-growth forests in the Pacific Northwest, typically
providing a multilayered canopy and contributing to the diversity of
tree ages. Forest managers are looking for ways to promote the
establishment of hemlock in the hope of accelerating the development
of old growth characteristics. Most ecological processes operate at
several scales. The establishment and survival of hemlock depends upon
finding suitable sites at the microsite-level ('safe sites'), which we
hypothesize to be characterized by a greater amount of CWD than the
rest of the stand. However, the total amount of CWD in the stand may
in turn determine the abundance of safe sites, and the lack of CWD may
result in hemlock growing in less desirable sites. We use a
hierarchical model to determine the relationship between the amount of
CWD and hemlock establishment at the microsite-level, and whether this
relationship itself depends upon the overall amount of CWD available
in the stand.
In each of 15 mature, unmanaged forest stands in the Oregon Coast
Range, points without hemlock saplings and points with hemlock
saplings were randomly selected. Each of these points represents the
microsite-level of the study. Around each sampled point, the area
covered by CWD was measured. In addition, a measurement of CWD for the
entire stand was obtained for each of the 15 stands. To understand the
relationship between CWD and hemlock presence/absence, we fit a series
of hierarchical logistic regression models that account for CWD at the
microsite-level alone and at both the microsite- and stand-levels. The
slope term of these regression models measures the relationship
between the odds ratio of hemlock sapling presence to hemlock sapling
absence, and the amount of CWD at the corresponding level or levels in
the hierarchy.
There is significant association between the amount of CWD and hemlock
establishment at the microsite-level, but this relationship does not
seem to depend on the total amount of CWD available in the stand. On
average, for each $0.1 m^2$ CWD per $m^2$ area increase in the amount of CWD,
the odds of finding a hemlock sapling are estimated to increase
2.45-fold (95\% posterior interval ranges from 1.47 to 3.96). This
relationship varies across the stands, from a low of an estimated 1.46
-fold increase to a high of an estimated 5.37-fold increase. These
results suggest that CWD can be used to help predict hemlock
presence/absence, and that management practices that increase the
amount of CWD in forest stands should be considered as potentially
beneficial to hemlock establishment.
Borrowing Strength: Incorporating Information from Early
Phase Cancer Clinical Studies into the Analysis of Large,
Phase III Cancer Clinical Trials
Peter Mueller
Gary L. Rosner
The University of Texas, M.D. Anderson Cancer Center
Maria de Iorio
Duke University
During the stages of drug development, clinical studies progress in
stages. Patients treated in early studies are necessarily monitored
more closely for obvious safety reasons. Aside from recording safety
data, clinical investigators also often collect information on the
pharmacokinetics of the agents under study. In later phase studies,
especially large randomized phase III studies, there is usually less
close monitoring of patients, either because of the difficult
logistics or cost or because enough is known about the safety of the
drug or drug combinations under study. Thus, early phase studies
typically collect more data per patient but treat relatively few
patients, compared to large randomized phase III studies.
Methods for combining the fuller data collected on patients
enrolled in earlier phase studies with sparse data collected as part
of a phase III study help us learn more about PK and PD variability in
the population.
Bayesian Analysis of Essay Grading
Stephen Ponisciak, Valen Johnson
ISDS, Duke University
Box 90251, Durham, NC 27708
An interesting problem in educational research is the rating of essays
by multiple raters, because each rater will tend to have a different
opinion regarding the characteristics of a good essay. Our dataset
consists of ratings assigned to essays written by 1200 subjects. Each
essay received six ratings from each of six raters - one global rating and
five sub-ratings. Each essay was rated by each rater in each category,
so the data is fully observed. Our analysis employs hierarchical
statistical methods with random effects, as described in Ordinal Data
Modeling (Johnson, V.E., and Albert, J.H., 1999) and ``On Bayesian
Analysis of Multirater Ordinal Data: An Application to Automated Essay
Grading,'' ( Journal of the American Statistical Association,
Johnson, V.E., 1996).
Multivariate Mixture Models: A Tool For Analyzing Gene Expression Data
Surajit Ray, Bruce Lindsay
Pennsylvania State University
325 Thomas Building
University Park, Pa-16801
``Understanding the Human Genome, shifts our medical attention from treating
mere symptoms and alleviating pain, to discovering and isolating the root
cause of certain diseases''. This was what Melissa Reyes, an 11th
grader from Florida, had to say in response to the question ``How is the
sequencing of human genome relevant to you ?''. DNA arrays have recently
emerged as a powerful new experimental technique for large-scale analysis of
gene expression and function ,which are not yet understood at the molecular
level. The Stanford yeast cell cycle data has been analyzed by scientists
using hierarchical and model-based algorithms to estimate the
number of clusters. In the mixture modeling literature, determination of the
number of components is a classic convex optimization problem. The gradient
check in the Nonparametric Maximum Likelihood Estimate(NPMLE) routines has provided an elegant tool for determining the
number of components in the univariate case. In our recent project, we
generalize the idea of NPMLE to Multivariate NPMLE and extract the number of clusters
in the high-dimensional expression data scenario. Assessment of the fitted
model is also investigated through AIC,BIC and kernel based quadratic
Estimation of Fetal Growth and Gestation
in Bowhead Whales
C. Shane Reese, James A. Calvin, John C. George, and Raymond J. Tarpley
Los Alamos National Laboratory, Texas A\&M University, North Slope Borough, Texas A\&M University
MS F600
Los Alamos, NM 87545
We consider the problem of estimating fetal growth and gestation for
bowhead whales, balaena mysticetus, of the Bering, Chukchi, Beaufort
Seas (BCBS) stock. This western Arctic population is subject to a
subsistence hunt by Eskimo whale hunters which is carefully monitored
via a quota system established by the International Whaling Commission
(IWC) and managed by the Alaska Eskimo Whaling Commission
(AEWC). Quota determination is assisted by biological information,
such as fetal growth and gestation, which is the basis of a population
dynamics model (PDM) used to estimate the annual replacement yield
(RY) of the stock. We develop a Bayesian hierarchical nonlinear model
for fetal growth with computation carried out via Markov Chain Monte
Carlo (MCMC) techniques. Our model allows for unique conception and
parturition dates, and provides predictive distributions for both
gestation length (mean of 14.0 months with 90% predictive interval of
(13.0, 15.2)) and conception dates (mean 24 March with 90% predictive
interval of (3 March, 13 April)). These results are also used to
propose estimates of geographic locations for both conception and
parturition. Finally, a sensitivity analysis indicated that caution
should be excercised in specifying some parameters related to the
growth rate, conception dates, and parturition dates.
The clustering of infected SIV cells in lymphatic tissue
Cavan Reilly, Ashley Haase, Timothy Schacker, David Krason, and Steve Wietgreft
University of Minnestoa, Reilly-Division of Biostatistics,
Haase and Wietgreft-Department of Microbiology,
and Schacker and Krason-Department of Infectious Diseases
A460 Mayo Bldg, MMC 303, 420 Delaware St. S.E., Minneapolis, MN 55455
While much research on the pathogenesis of HIV has been conducted,
there has been no research aimed at uncovering the manner in which the
virus spreads from one cell to another in an infected host. While some
have postulated complicated mechanisms by which the infection spreads,
we examined if a simple model of local spread is consistant with lymph
node samples obtained from a Rhesus macaque infected with SIV (a close
relative of HIV with a similar pathogenesis) a known number of days
prior to sample collection.
A Bayesian Analysis of Consumer Preferences
Marc Sobel and Indrajit Sinha
Dept of Statistics/Marketing; Fox School of Business and
1810 N. 13th Street; Temple University
Philadelphia, PA 19122
In a mall-intercept study of consumer preferences, each consumer fills
out a questionnaire containing 17 likert multiple choice questions
regarding (possibly different) product categories. The questions are
concerned with such variables as percieved risk, quality variation,
deal proneness, and store versus national brand preference. We focus
on modelling the relationship between store versus national brand
preference and the other preference variables and predicting the
former. Flexible classes of latent variable models are proposed
together with agreement measures which permit the selection of the
optimal model for the data from these classes.
A Bayesian Method for Using Administrative Records to Predict Census Day Residency
Elizabeth Stuart
Harvard University
Statistics Department
1 Oxford St.
Cambridge, MA 02138
Administrative records are a promising data source for estimating
census coverage or identifying people missed in the census. An
important unsolved problem in using records is determining which of
them correspond to people actually resident on Census Day. We propose
a Bayesian hierarchical model in which one level describes the
migration process, and the other describes the probabilities of
observation in each of the available record systems. The observation
model uses the full information in the records, including the dates
associated with the records and available covariate information, and
can accommodate a variety of record types, such as tax records,
Medicare claims, and school enrollment lists. In addition, multiple
record systems can be modeled concurrently simply by multiplying the
likelihood of observation for each type. A Gibbs sampler is used to
obtain estimates of the in- and out-migration dates, and thus an
estimate of the probability of residency in the area on Census Day.
This work extends the use of Bayesian methodology in the context of
capture recapture population estimation, and could be useful in the
context of an administrative records census, or as a way of expanding
the role of administrative records in triple system estimation. This
is joint work with Alan Zaslavsky.
Use of the Bayesian approach in the analysis of clinical trials in patients with advanced lung cancer
Franz Torres , Gisela Gonzalez G, Tania Crombet ,Agustin Lage
Center of Molecular Immunology
Calle 216 esq. 15 Atabey. C. Habana. Cuba
The use of Bayesian theory in small clinical trials is very
useful. Different kinds of information can be combined to the actual
experimental results and the inferences are drawn from posterior
distributions for the parameters given the data.
Two clinical trials on Non Small Cell Lung Cancer with a promising
vaccine are analysed. In these trials two adjuvants are tested and the
second trial adds a pretreatment with Cyclophosfamide. Immunogenicity
and survival are the endpoints of interest. The log hazard ratio was
analysed with an uninformative, sceptic and enthusiastic prior
distributions. The respective posterior distributions were obtained by
combination of the data with the respective prior. The second trial
used the posteriors of the first trial as its priors. Probabilities,
over the posterior distribution, of more than 5%, 10% and 15%
improvement in survival were calculated. The frequentist approach was
used also for survival comparison using the logrank test.
On the first trial there was a mild 0.52 probability for more than 5%
improvement in the log hazard ratio between the two adjuvants. Median
survival of 8.07m and 8.00m were obtained for no difference with the
logrank test. Considering all treated patients against the historical
controls a mild evidence was obtained with probability of 0.472 in the
sceptic scenario (Median of 8.00m and 5.67m). Analyzing the high
responders patients, it was obtained a high evidence of more than 5%
improvement with a probability of 0.799 with respect to controls in
the sceptic scenario.
Analyzing the second trial considering the accumulated evidence of
the previous trial; the two adjuvant comparison showed a mild
evidence, with probability of 0.47, for more than 5%
improvement. There was a moderate to high evidence for a 5%
improvement comparing all treated versus the controls. Considering the
high responders there are very high evidences of a 5% and 10%
improvement with probabilities of 0.799 and 0.922 in the uninformative
and sceptic scenarios in contrast to the frequentist approach with no
statistical difference.
Useful information about the response of the patients to the vaccines
as well as differences between the vaccines are captured using the
Bayesian approach with different scenarios. We propose to use the
frequentist and Bayesian approaches jointly in an analysis for a more
complete research conclusion.
Disclosure Risk and Information Loss in an Attitudinal Survey
Mario Trottini
Departamento de Estad\'istica e Investigac\'ion Operativa, Universitat de Val\`encia
Avenida Dr. Moliner, 50
46100 Burjassot, Valencia, Spain
M. Jes\`us Bayarri
Departamento de Estad\'istica e Investigac\'ion Operativa, Universitat de Val\`encia
Avenida Dr. Moliner, 50
46100 Burjassot, Valencia, Spain
Stephen E. Fienberg
Department of Statistics, Carnegie Mellon University
Pittsburgh, PA, 15213
Disclosure limitation denotes a set of techniques aimed to
protect confidentiality in the release of statistical data . The
problem is not trivial since protection of confidentiality should be
achieved in a way which is compatible with the agency's mission of
providing data users with good quality data. Many alternative
disclosure limitation techniques have been proposed and a key problem
is how to compare them in an efficient way. In this paper we
address this problem in a small-scale simulation study.
We use an adapted data set from an actual survey conducted by the
Institute for Social Research at York University (Fienberg, Makov and
Sanil, Journal of Official Statistics, Vol. 13, 1997).
The data set consists of 662 respondents, and 5 approximately
continuos variables: Age, Civil-Liberties, Canadian-U.S.
relationship, Income and Attitude (towards Jews) .
Alternative forms of data release are obtained contaminating the
original data with various amount of bias and noise. The statistical
agency has to decide which data is best to release.
The choice requires suitable criteria to assess to what extent the
released data can create harm to the data providers ( i.e. the
disclosure risk ) and to what extent the released data can be
beneficial to society (i.e. the data utility).
We show that existing measures of disclosure risk and data utility,
that only model the users' behavior, usually underestimate the
agency's uncertainty and better and more general measures can be
derived if a model for the statistical agency's behavior is also
Accounting for Pile-Up in the Chandra X-ray Observatory
David van Dyk and Yaming Yu
Department of Statistics, Harvard University
Alanna Connors
Eurica Scientific
Aneta Siemiginowska and Vinay Kashyap
Smithsonian Astrophysical Observatory
Harvard University
The Chandra X-ray Observatory was launched in July 1999 and boasts the
World's most powerful X-ray telescope. Chandra records the
binned time, energy, and location of high-energy photons that
arrive at its detector. Pile-up occurs in such X-ray detectors when
two or more (X-ray) photons arrive at the same location on the
detector during the same time bin. Such coincident events are counted
as a single higher energy event or are lost altogether if the total
energy goes above the on-board discriminatory. Thus, for bright
sources pile-up can seriously distort both the count rate and the
energy distribution. Accounting for pile-up is perhaps the most
important outstanding data-analytic challenge for Chandra. In this
poster, we describe how Bayesian hierarchical models can be designed
to account for pile-up in X-ray detectors and how they can be fit via
Markov chain Monte Carlo. To account for pile-up, we stochastically
separate a subset of the observed photons counts into multiple counts
of lower energy based on the current iteration of the particular
spectral/spatial model being fit. Because of the complexity of the
pile-up process this remains a challenging statistical task requiring
simulation of highly structured multi-modal
distributions. Nonetheless, the Bayesian framework is promising
because it allows the inclusion of other sources of information. For
example, event grades (i.e., a description of the likelihood of the
degree of pile-up based on the spatial distribution of the charge) can
be used to improve the fit.
Correction of Ocular Artifacts in the EEG using Bayesian Adaptive Regression Splines
Garrick Wallstrom and Robert Kass
Carnegie Mellon University
Department of Statistics
Pittsburgh, PA 15213
Ocular activity is a significant source of artifacts in the
electroencephalogram (EEG).
Regression upon the electrooculogram (EOG) is commonly used to correct
the EEG. It is known, however, that this approach also removes high-frequency
cerebral activity from the EEG. To counter this effect,
we used Bayesian Adaptive Regression
Splines (BARS) (DiMatteo (2001); DiMatteo, Genovese, and Kass (2001))
to adaptively filter the EOG of high-frequency activity before using
the EOG for correction. In a simulation study, this approach reduced
spectral error rates in higher frequency bands.
Who Did Nader Really Raid? A Bayesian Analysis of Exit Poll
Data from the 2000 US Presidential Elections
Lara J. Wolfson
Brigham Young University
Department of Statistics
Provo Utah 84602 USA
How strongly was the close outcome of the 2000 US Presidential
election influenced by the presence of third party candidates? In a
presidential race where the margin of difference in the popular vote
between the top two candidates was less than 1%, the votes that went
to third-party candidates could have influenced the outcome of "swing"
states in the electoral college. Media pundits have opined that the
presence of Green Party candidate Ralph Nader on the ballot took such
deciding votes away from Al Gore. These opinions derived both from
popular wisdom about who likely Green Party voters were, as well as
citing data from the 2000 exit polls conducted by the Voter News
Service (VNS). In this paper, a Bayesian model for utilizing the VNS
data to estimate the probable outcomes of the election in the absence
of third party candidates is proposed, showing that careful
examination of the exit poll data yields some startling results.
Identifying differentially expressed genes in cDNA microarray experiments: an application of Bayesian methods using noninformative priors
Xiao Yang, Keying Ye and Ina Hoeschele
Department of Statistics,Virginia Tech
VA 24060
Recent advancements in biotechnology have made it possible for
researchers to study the regulation and interactions simultaneously
for thousands of genes using DNA microarrays. Microarrays have
enormous potential of being applied in pharmaceutical and clinical
research. By comparing the transcriptional levels of genes in two
different tissue samples, coordinated expression patterns revealed
from microarrays provide clues about gene function and shed light on
complex biomolecular pathways and genetic circuits involved in complex
traits in many organisms. One of the core objectives of microarray
experiments is to identify those differentially expressed genes
through measured gene expression levels. However, raw measures (in
terms of intensities from two dyes) can be affected by many sources of
variations, which makes the inference about the fold change of gene
expression across samples almost infeasible based only on raw
intensity measurements. Prior normalizations (in terms of eliminating
systematic variations due dyes, slides, etc) have to be made before
any statistical method can be applied. This paper propose a Bayesian
method in the context of generalized Fieller-Creasy problem using
noninformative priors, while the parameter of primary interest being
the ratio of two population means. This paper is motivated by the
fact that measurements taken from microarray experiments are usually
not normally distributed, often heavily-tailed or skewed, and that the
Fieller-Creasy problem fits the objectives of cDNA microarray
experiments, since we are also interested in the inference of the fold
change of gene expressions across two tissue samples. We generalize
the Fieller-Creasy problem to the case of non-normal distributions,
such as the students t family. Results are compared with those from
other standard Bayesian methods, such as methods by Newton et
al. (2000) and by Baldi and Long (2001). Implications for future
studies are also discussed.
Probabilistic Methods for Robotic Landmine Search
Yangang Zhang, Mark Schervish, Ercan U. Acar and Howie Choset
Carnegie Mellon University
Pittsburgh, PA 15213
One way to improve the efficiency of mine search, compared with a
complete coverage algorithm, is to direct the search based on
the spatial distribution of the minefield. The key for the success of
this probabilistic approach is to efficiently extract the spatial
distribution of the minefield during the process of the search. In our
research, we assume that a minefield follows a regular pattern, which
belongs to a family of known patterns. A Bayesian
approach to pattern extraction is developed to extract
the underlying pattern of the minefield. The algorithm performs well in
its ability to catch the ``actual'' pattern in the situation where
placement and detector errors exist. And the algorithm is
efficient, therefore, online implementation of the algorithm on a mobile
robot is possible. Compared to the likelihood approach, the advantage
of using a Bayesian approach is that this approach provides information
about the uncertainty of the extracted ``actual'' pattern.