The following abstracts have been accepted and will be part of the poster session at the Bayesian Workshop.

NAMETITLE
Pam Abbitt Quantile Estimation for Soil Texture Profiles
Heidi W. Ashih
Giovanni Parmigiani
Conversion of Stroke Disability Measures: A Case Study in Bayesian nonparametric Analysis of Ordinal Categorical Data
Sanjib BasuBayesian Analysis of Circular Data
Susie Bayarri
A.M. Mayoral
Meta-analysis of (biased) experiments on the change of VBR in schizophrenics
Sudip BoseBayesian Analysis Of A Stock Selection Method Based On High-Yield Dow Jones Stocks
John Carlin
Assessing the homogeneity of three odds ratios: A case study in small-sample inference
Victor De Oliveira
Mark D. Ecker
Bayesian Hot Spot Detection in the Presence of a Spatial Trend: Application to Total Nitrogen Concentration in the Chesapeake Bay
D.J. DeWaal
J. Beirlant
Modelling Multivariate Data Containing Extremes
Illaria DiMatteo
Joseph B. Kadane
Vote Tampering in a District Justice Election in Beaver County, PA
Vanja Dukic
Joseph Hogan
A Bayesian Approach to Analyzing Embryo Viability and Uterine Receptivity Data from Studies of In Vitro Fertilization
Lynn E. Eberly
Bradley P.Carlin
A spatial application in Bayesian learning and identifiability
Richard Evans
Siu Hui
Joe Sedransk
Modeling Rates of Growth and Loss of Bone Density Using Order Restricted Inference and Stochastic Change Points
Soledad A. Fernandez
Rohan L. Fernando
Alicia L. Carriquiry
An Algorithm to Sample Marker Genotypes in a Pedigree with Loops
Robin Fisher
Jana Lynn Asher
Bayesian Hierarchical Modelling of U.S. County Poverty Rates
Daniela Golinelli
Peter Guttorp
Bayesian inference in hidden linear birth-death processes
Piet Groenewald
Carin Viljoen
A Bayesian model for the analysis of lactation curves of dairy animals
K. Hasegawa
Tran Van Hoa
R. Valenzuela
Bayesian analysis of the HOGLEX Demand Systems using Unit Record for Major ASEAN Countries: Thailand and the Phillippines
Dave Higdon
Richard E. Miller
A Hierarchical Model for the Genetical Analysis of Selection
Peter Hoff
Michael Newton
Richard Halberg
Identifying Carriers of a Genetic Modifier
Lurdes Y.T. Inoue
Peter Muller
Gary Rosner
Mark Dewhirst
A population model based smoothing for individual profiles
Matthew JohnsonThe Hierarchical Rater Model for Rated Test Items
Herbie Lee
Dave Higdon
Marco Ferreira
Mike West
Multi-Scale Modeling and Parallel Computing for Modeling Porous Flow
John R. Lockwood III
Mark J. Schervish
Patrick Gurian
Mitchell J. Small
Characterization of Arsenic Occurrence in US Drinking Water Treatment Facility Source Waters
Hedibert Freitas Lopes
Peter Muller
Gary Rosner
Multivariate mixture model in meta analysis for hematology data
Viridiana Lourdes
Mike West
James Burgess
Analysis of psychiatric patient return times in the VA hospital system
Steven N. MacEachern
Mario Peruggia
Bayesian Tools for EDA and Model Building: A Brainy Study
Thomas Nichols
Bill Eddy
Jay McClelland
Chris Genovese
Role of Context in Visual Perception: A Functional Magnetic Resonance Imaging Study
J. Lynn Palmer
Dennis D. Cox
Joaquin Diaz
Mark Munsell
Comparison of two phase I methods for high toxicity drugs
Cavan Reilly
Andrew Gelman
Post-Stratification without Population Level Information on the Post-Stratifying Variable in Political Polling
Fabrizio Ruggeri
Antonio Pievatolo
Bayesian Analysis of Failures in a Gas Distribution Network
Eiki SatakeProbability and law: A Bayesian approach
Paola Sebastiani
Marco Ramoni
Paul Cohen
Bayesian analysis of sensory inputs of a mobile robot
Paola Sebastiani
Marco Ramoni
Alexander Crea
Profiling customers from in-house data
Howard SeltmanHidden Stochastic Models for Biological Rhythm Data
Scott C. Schmidler
Jun S. Liu
Douglas L. Brutlag
Bayesian Protein Structure Prediction
Sandip Sinharay
Hal Stern
Variance Component Testing in Generalized Linear Mixed Models with an Application to Natural Selections Studies
Dalene Stangl
Lurdes Inoue
Steve Ponisciak
Meta-analysis for time-to-event data: Trade-offs with Discretization
J. Alex Stark
Kay Tatsuoka
Francoise Seillier-Moiseiwitsch
Sampling Methods for Phylogenetic Inference with Application to HIV
Chin-pei Tsai
Kathryn Chaloner
Using Prior Opinions to Examine Sample Size in Two Clinical Trials
David A. van Dyk
Vinay L. Kashyap
Aneta Siemiginowka
Alanna Conners
Analysis of High-Energy Spectra Obtained with the Chandra Satellite X-ray Observatory
Zeleke WorkuAn Application of Weibull and Cox to child survival analysis

Back to Bayes 99 Homepage




Conversion of Stroke Disability Measures: A Case Study in Bayesian Nonparametric Analysis of Ordinal Categorical Data

by
Heidi W. Ashih and Giovanni Parmigiani
ISDS
Duke University
Box 90251
Durham, NC 27708-0251
heidi@stat.duke.edu

Abstract:

Two of the most common measures of disability/handicap used in stroke are the Rankin Stroke Outcome Scale (R) and the Barthel ADL Index (B). The Rankin Scale, which was designed for applications to stroke, is based on assessing directly the global conditions of a patient. The Barthel Index, which was designed for more general applications, is based on a series of questions about the patient's ability to carry out 10 basic activities of daily living. The objective of our analysis is to provide a method for translating between B and R, or, more specifically, to estimate conditional probability distributions of each given the other. Subjects consisted of 459 individuals who sustained a stroke and who were recruited for the Kansas City Stroke Study, from 1995-1998. Patients were assessed with B and R measures 0,1,3 and 6 months after stroke. In addition, we incorporated a published 4x4 table cross-classifying patients by aggregate Rankin and Barthel scores.

Our estimation approach is motivated by four goals: (a) overcoming the difficulty presented by the fact that the two data sources report data at different resolutions; (b) interpolating the empirical counts to provide estimates of probabilities in regions of the table that are sparsely populated; (c) avoiding estimates that would conflict with medical knowledge about the relationship between the two tests and (d) estimate the relationship between Ranking and Barthel scores at three months after the stroke, while borrowing strength from measurements made at zero, one and six months.

We handled (a) via data augmentation. We addressed both (b) and (c) via an approach that recognizes the natural negative dependence in the tables, and captures it without making any restrictive parametric assumptions about the relationship among the cell probabilities. This relationship is described via a condition that we termed local association, which means that any two-by-two sub-table will show association (in this case negative) among the scores. We addressed (d) by postulating an autoregressive stochastic process for the cell probabilities. Each cell probability table is modeled as a Dirichlet distribution whose mean is given by the cell probability of the previous time point. The Dirichlet distribution also includes a dispersion parameter which, in our context, controls the average amount of variation that occurs in the table from one time interval to the next.

Work carried out in collaboration with P Duncan, D Matchar, and G Samsa.

Back to the top of the page






AN APPLICATION OF WEIBULL AND COX MODELS TO CHILD SURVIVAL ANALYSIS

by
Zeleke Worku
Technikon Natal
P.O. Box 953, Durban 4000, South Africa zelekew@umfolozi.ntech.ac.za
and
Dan DeWaal
University of the Orange Free State P.O. Box 339, Bloemfontein 9300, South Africa

Abstract:

Special regression models were constructed using the Weibull and Cox models to explain the relationship between the lifetime of under-five children and 9 explanatory variables, taking the presence of censored observations and truncation at the age of five years into account. The nine variables used for analysis were the literacy status of the mother, the income status of the mother, the place of delivery of the child, attendance of postnatal health care services by the mother, availability of a nearby health facility, the extent of malnutrition of the child, the presence of acute respiratory infectious diseases, and the age of the mother at first birth. A random sample of 4001 under-five children from the Maseru District of Lesotho was used for data analysis. The normal distribution could not be used to analyze the lifetime of childrren because the error terms were not distributed normally with mean 0 and variance sigma squared. As a result, the relationship between the lifetime of children and the 9 variables listed above had to be analyzed using Bayesian principles and Matlab programming, taking the presence of censored obseravations and the need to truncate the age at 5 years. To facilitate computation using Matlab for a PC, several subsamples of size 400 were drawn from the sample for data analyais. Using Bayesian inference, Weibull and Cox regression models were estimated. The Weibull model fitted the data better than the Cox model.

Back to the top of the page






Assessing the homogeneity of three odds ratios: A case study in small-sample inference

by
John B. Carlin
RCH Research Institute (University of Melbourne)
and University of South Florida Epidemiology and Biostatistics, MDC-56
University of South Florida
Tampa, FL 33612
j.carlin@medicine.unimelb.edu.au

Abstract:

In an experiment on the effects of varying ventilation regimes on lung damage in rabbits, six groups of between 6 and 8 animals each were compared, using a factorial treatment structure of 3 frequency levels crossed with 2 amplitudes. The resulting data were reduced to binary outcomes for each animal, producing a $3\times 2\times 2$ contingency table. Although the numbers were small, there appeared to be a large effect of amplitude at the two extreme frequency levels, but there were no failures at either amplitude in the middle frequency group. The question of interest was whether the data provided evidence that the effect of amplitude differed between the 3 frequencies and in particular whether the effect in the middle group was lower than in the two extreme groups. Various models were considered for the 3 odds ratios in question, all seeking to incorporate minimally informative prior assumptions. Because of the small numbers, sensitivity to prior distribution specifications was considerable and in particular we compared the effect of assuming independent prior distributions on each cell in the $3\times 2$ factorial with that of using a more structured prior distribution incorporating exchangeable row, column and interaction effects. The analysis provides a case study of the sensitivity of inferences in small samples, in a problem where the popular ``exact" frequentist approach, based on a null hypothesis of equality of the odds ratios, breaks down.

Back to the top of the page






Probability and law: A Bayesian approach

by
Eiki Satake
Emerson College, School of Communication Sciences and Disorders
100 Beacon Street
Boston, MA 02116
esatake@emerson.edu@

Abstract:

In his seminal work on formal reasoning, Polya (1968) attempts to establish a mathematical foundation for "plausible reasoning" based on syllogistic structures that serve as heuristics for analyzing patterns of reasoning, particularly inductive reasoning which he sees as "conspicuous" in every day life. Polya argues that the increment of the credibility of a conjecture always depends on the strength of circumstantial evidence. Kadesch's (1986) "new version" of Bayes' equation verifies the conclusions of Polya"s syllogisms and heuristics but not in a quantification of probabilities. The process described in this paper demonstrates how Polya's syllogistic/heutistic approach to plausible reasoning and Kadesch's "mathematical expressions" of Polya's ideas may be quantified using Empirical Bayes'Estimation through an example of criminal trial. The use of Bayesian statistics appears both valid and preferable in quantifying inductive reasoning patterns, such as those identified by Polya. They allow probability statements to be made about the plausibility of an event. Bayes' rule provides a methodology for revising prior probabilities in light of additional evidence and is eminently suitable in cases of circumstantial evidence. In cases of a crimanal trial with a jury, the prosecutor and defence attorney hold conflecting views and conjectures on the guilt or innocence of the defendent. In each case, the views and conjectures are subjective. The same is true of members of the jury. In a trial by jury, the role of the jury is to determine whether the evidence submitted is of sufficient strength to convict or aquit. In his discussion of judicial proof, Poyla suggests that the reasoning by which a jury arrives at its decision follows an inductive pattern analogous to scientific inquiry, where several consequences of a conjecture are successively tested and evaluated . In terms of patterns of plausible inference, continued verification of a conjecture renders the conclusion more credible. Both Polya and Kadesch believe that the patterns of reasoning used in these processes may be analyzed on logical and mathematical ways to assess probabilityof a conjecture. In general, Polya uses hypothetical syllogisms ("modus tollens") and heuristics, while Kadesch employs a Bayesian analysis. However, in a discussion of circumstantial evidence and reasoning in judicial matters, Polya employs the "calculus of probability" in an attempt to learn the credibility of evidence.

Back to the top of the page






Analysis of psychiatric patient return times in the VA hospital system

by
Virdiana Lourdes, Mike West
ISDS
Duke University Durham, NC
vl@stat.duke.edu
and
James Burgess
VA Management Sciences Group
Bedford, MA

Abstract:

As part of a long-term concern with measures of "quality-of-care" in the VA hospital system, the VA Management Sciences Group is involved in large-scale data collection and analysis of patient-specific data in many care areas. Among many variable of interest are observed times of "return to follow-up care" of individuals discharged following initial treatment. Follow-up protocols supposedly encourage regular and timely returns, and observed patterns of variability in return time distributions are of interest in connection with questions about influences on return times that are specific to individual patients, care areas, hospitals, and system-wide policy changes. The study reported here takes a look at such issues in the area of psychiatric and substance abuse patients across the nationwide system of VA hospitals. The study is realtively new and ongoing, and this paper presents the story of initial modelling and data analysis efforts. We report on our studies of discrete duration models that are designed to help us understand and estimate the effects on return times that are specific to individual hospitals -- the primary question for the VA -- in the context of a large collection of potential additional covariates. We adopt logistic regression models to describe discretised representations of the underlying continuous return time distributions. These models take into account categorical covariates related to the several socio-demographic characteristics and aspects of medical history of individual patients, and treat the primary covariate of interest -- the VA hospital factor -- using a random effects/hierarchical prior. Our model is analysed in parallel across a range of chosen "return time cut-offs", providing a nice analogue method of exploring and understanding how posterior distributions for covariate effects and hyperparameters vary with chosen cut-off. This perspective allows us to identify important aspects of the non-proportional odds structure exhibited by this very large and rich data set, by isolating important and interesting interactions between cut-offs and specific covariates. Summarisation of the sets of high-dimensional posterior distributions arising in such an analysis is challenging, and is most effectively done through sets of linked graphs of posterior intervals for covariate effects and other derived parameters. We explore and exemplify this work with a full year's data, from 1997. The paper also discusses additional questions of covariate selection via Bayesian model selection methods, and other practical issues.

Back to the top of the page






Modelling multivariate data containing extremes

by
D.J. De Waal
University of the Orange Free State
Department of Mathematical Statistics P.O. Box 339, Bloemfontein 9300, RSA
wwdw@wwg3.uovs.ac.za
and
J. Beirlant
University of Kuleuven
Department of Applied Maths
Celestijnenlaan 200B
Heverlee 3001, Belgium
jan.beirlant@wis.kuleuven.ac.be

Abstract:

The data consists of the sizes (X) and the values (Y) of 1914 diamonds sampled from the Damaya and Bougban deposits in Guinea, West Africa. The purpose is to model such multivariate data which contain extreme values such as some large diamonds with high values in order to construct similar data sets through simulations and to estimate small tail area (volume) probabilities. Several papers appeared in the literature on modelling bivariate extreme values data with emphasis on peaks over threshold methods. The model that we described here can be considered a general multivariate parametric model with four types of parameters of which one type contains the extreme value index. The MCMC algorithm is used for estimating these parameters using the maximal data information (mdi) prior. A simulation study on three variables is also given together with model checking tools and an illustration of extending the model to include covariates.

Back to the top of the page






Bayesian analysis of HOGLEX demand systems using unit records for major Asean Economies: Thailand and Phillippines

by
Hikaru Hasegawa
Faculty of Economics and Business Administration
Hokkaido University
Kita 9 Nishi 7
Kita-ku, Sapporo 060--0809
Japan
hasegawa@econ.hokudai.ac.jp
Tran Van Hoa
Department of Economics Wollengong
and
Rebecca Valenzuela
Melbourne Institute of Applied Economic and Social Research University of Melbourne
Australia

Abstract:

The HOGLEX demand system (Tran Van Hoa~\cite{TVH83,TVH85}) is integrable and completely general in the sense that it encompasses all other well-known demand systems in the literature on consumer behavior (Laitinen {\it et al.}~\cite{LTR83}). The HOGLEX studies to date have been based on conventional OLS or MLE methods and panel aggregate data. The paper elaborates on important subsets of the HOGLEX demand system and, using household expenditure unit records from two major ASEAN countries ({\it i.e.}, Thailand and the Philippines), estimates by the Bayesian method these subsets for a number of socio-demographic cohorts. We also estimate the models with measurement error in total expenditure and compare the results with those without measurement errors.

Back to the top of the page






Multivariate mixture model in meta analysis for hematology data

by
Hedibert Freitas Lopes
DME, Universidade Federal do Rio de Janeiro and
ISDS, Duke University
Rio de Janeiro, Brazeil
hedibert@stat.duke.edu
Peter Muller
ISDS
Duke University
Durham, NC 27708
pm@stat.duke.edu
and
Gary Rosner
Duke University Medical Center
Durham, NC 27708
gary.rosner@duke.edu

Abstract:

We consider Bayesian meta-analysis to combine data from two studies carried out by the Cancer and Leukemia Group B (CALGB). Our analysis is based on a pharmacodynamic study involving longitudinal data consisting of hematologic profiles, such as blood counts measured over time, of cancer patients undergoing chemotherapy. In both studies, we analyze the natural logarithm of each patient's white blood cell count (WBC) over time to characterize the toxic effects of treatment. The WBC counts are modeled through nonlinear hierarchical model that gather the information from both studies. Basically, this is done by allowing the parameters defining the nonlinear structure for each patient to depend on two mixture of multivariate normals. The first mixture is shared common to both studies, while the second mixture is study specific and captures the variability intrinsic within patients from the same study. The proposed methodology is broad enough to embrace current hierarchical models and it allows for {\em borrowing-strength} between studies in a new and simple way. The development of MCMC techniques is flexible enough to allow for posterior predictive inference of new patients and to account for model uncertainty with respect to the number of components of the mixture, for instance through a reversible jump algorithm.

Back to the top of the page






Bayesian Inference in Hidden Linear Birth-Death Processes

by
Daniela Golinelli and Peter Guttorp
Department of Statistics
University of Washington Seattle, WA 98195
golinell@stat.washington.edu

Abstract:

Many processes in biology, ecology and physics are modeled as continuous time stochastic processes. In the literature, these models are rarely fit to data. This can partly be due to not being able to observe such processes completely, so inference on the parameters of interest becomes very involved. The reasons for observing the population only partially may be due to efficiency or to physical constraints, e.g. when the population of interest resides in a living body. In such a context only samples or subsets of the population may be observed at discrete times. This is the situation that arises when studying hematopoiesis, the process of blood cell production. The interest here is on having a better understanding of hematopoietic stem cell (HSC) kinetics. HSCs are primitive blood cells that support the entire blood and immune system. We focus on the problem of making inference in hidden birth-death processes, since a similar model, although somewhat more complicated, has been used to model HSC behavior. The process is hidden because we observe only a probabilistic function of the birth-death process states at given observation times. We consider two cases. In the first case, the hidden process is a one-dimensional birth-death process and the observations are Poisson with rate given by a constant proportion of the hidden population size. In the second, the hidden process is a two-dimensional birth-death process, since, similarly to the HSC example, we assume that half of the cells can be genetically marked with a neutral marker. Hence, the observations are binomial, where the probability of success is given by the proportion of the marked cells in the hidden population at the observation times. In both cases, the goal is to provide reasonable estimates of the birth and death rates. A more classical approach to this inferential problem, that makes use of the Forward-Backward algorithm, does not provide a satisfactory solution. The two main reasons of its failure are: the infinite number of possible hidden states, and the fact that the hidden process is continuous in time while the observations are taken only at discrete times. Using this approach involves the computation of transition probability matrices of large dimensions. Since we do not know how many states were visited by the hidden process during the observation period, we are forced to put a bound on the size of the state space. These transition probabilities are very unstable when the observation times are far apart and computationally expensive. However, a Bayesian approach, together with MCMC methods, seems to overcome the problems stressed above and provide reasonable estimates for the parameters of interest. Simulation results show that the true parameter values fall in regions of high posterior probability. This method shows great promise for solving an exciting problem in hematology.

Back to the top of the page






Bayesian analysis of sensory inputs of a mobile robot

by
Paola Sebastiani
Statistics Department
The Open University
p.sebastiani@open.ac.uk
Marco Ramoni
Knowledge Media Institute
The OpenUniversity
m.ramoni@open.ac.uk
and
Paul Cohen Department of COmputer Science
University of Massachusetts
Amherst, MA
cohen@cs.umass.edu

Abstract:

The goal of this work is to enable mobile robots to learn the dynamics of their activities. Our robot --- a Pioneer 1 --- is a small platform with two drive wheels and a trailing caster, and a two degree of freedom paddle gripper. For sensors, the Pioneer 1 has shaft encoders, stall sensors, five forward pointing and two side pointing sonars, bump sensors, a pair of infrared sensors at the front and back of its gripper, and a simple vision system that reports the location and size of color-coded objects. Our configuration of the Pioneer 1 has roughly forty sensors, though the values returned by some are derived from others. During its interaction with the world, the robot records the values of about 40 sensors every 1/10 of a second. In an extended period of wandering around the laboratory, the robot will engage in several different activities --- moving toward an object, losing sight of an object, bumping into something --- and these activities will have different sensory signatures. It is important to the goals of our project that the robot's learning should be {\em unsupervised}, which means we do not tell the robot when it has switched from one activity to another. Instead we define a simple event marker --- simultaneous change in three sensors --- and we define an {\em episode} as the period between event markers. The available data is then a set $S=\{ S_i \}$ of $m$ episodes for each sensor, and some episodes represent the same dynamics. The statistical problem is to model the episodes dynamics and then cluster the episodes that represent the same dynamics to learn prototype experiences. The solution we have developed is a Bayesian algorithm for clustering by dynamics. We model the dynamic of each episode as a discrete Markov chain (MC) and our algorithm learns MC representations of the episodes dynamics and then clusters the episodes that give rise to similar dynamics. The task of the clustering algorithm is two-fold: find the set of clusters that gives the best partition according to some measure, and assign each MC to one cluster. A partition is an assignment of MCS to clusters such that each episode belongs to one and only one cluster. The novelty of our approach is to regard the task of clustering MCS as a Bayesian model selection problem. In this framework, the model we are looking for is the most probable way of partitioning MCS according to their similarity given the data. We use the posterior probability of a partition as scoring metric to assess its goodness of fit. As the number of possible partitions grows exponentially with the number of MCS to be considered, we have a heuristic method to restrict the search space. The algorithm performs a bottom-up search by recursively merging the closest MCS (representing either a cluster or a single episode) and evaluating whether the resulting model is more probable than the model where these MCS are separated. When this is the case, the procedure replaces the two MCS with the cluster resulting from their merging and tries to cluster two other MCS. Otherwise, the algorithm tries to merge the second best, the third best, and so on, until the set of pairs is empty and, in this case, returns the most probable partition found so far. This clustering method has allowed the robot to learn and discriminate significant dynamics as passing an object on the left, or bumping into an object.

Back to the top of the page






A Bayesian model for the analysis of lactation cures of dairy animals

by
Piet Groenewald and Carin Viljoen
University of the Orange Free State
Department of Mathematical Statistics
P.O. Box 339, Bloemfontein 9300
RSA
wwpg@wwg3.uovs.ac.za

Abstract:

The data consists of the milk records of a number of Saanen dairy goat herds, recorded over a period of two seasons. Farmers recorded, on certain test days during the lacatation period, the milk production as well as milk composition as to fat and protein content of each animal in the herd. The data contains the records of 493 animals, 262 of which were recorded for both seasons. The purpose of the study is to determine the effect of certain covariates on the characteristics of the lactation curve. The covariates used are the season, the lactation number of the animal and the time during the season at which lactation starts. Some of the pertinent characteristics are the peak milk yield, time of peak milk yield, total milk production and the relationship between milk production and milk composition as well as between the lactation curves of the same animal in successive seasons. A hierarchical Bayes model is proposed, with Wood's model, a three--parameter Gamma curve, as the observation model for milk production as well as for the milk composition, normal/Wishart priors for the observation model parameters and noninformative second--stage priors. By means of the Gibbs sampler, the posterior distributions of the quantities of interest are obtained, and they clearly illustrate the significant effect of some of the covariates on the characteristics of the lactation curve. The analysis also enable us to estimate the lactation characteristics of untested animals, predict future characteristics and identify exceptional animals. This is an ongoing project, and a number of issues, important to dairy goat breeders, still need to be examined.

Back to the top of the page






A population model based smoothing for individual profiles

by
Lurdes Y.T. Inoue, Peter Muller, Gary Rosner and Mark Dewhirst
Duke University
P.O. Box 90251
Durham, NC
lurdes@stat.duke.edu, pm@stat.duke.edu, grosner@cstatz.mc.duke.edu, dewhirst@radonc.duke.edu

Abstract:

This research is motivated by experiments evaluating the hemodynamic effects of various agents in tumor-bearing rats. In one set of experiments, the mice breathed room air, followed by carbogen (a mixture of pure oxygen and carbon dioxide). Interest focuses on answering the questions: Do individual profiles change once the breathing mixture changes? How does the location of the tumor alter the effect of carbogen on hemodynamics? Do tumors respond to carbogen differently than normal muscle tissue? We propose a model for longitudinal data with random effects which includes model based smoothing of repeated measurements over time, implemented with a flexible state space model. Submodels for repeated measurements on different individuals are hierarchically linked to allow borrowing strength across the population and formal inference about the effects of individual-specific covariates. The model is appropriate for the analysis of repeated measurement data when no convenient parametrization of the longitudinal data is suggested by the underlying application or exploratory data analysis, and the only available information is related to smoothness of measurements over time. The experimental responses are longitudinal measurements of oxygen pressure measured in tissue, heart rate, tumor blood flow, and mean arterial pressure. The nature of the recorded responses does not allow any meaningful parametric form for a regression of these profiles on time. Additionally, response patterns differ widely across individuals. Therefore, we propose a non-parametric regression to model the profile data over time.

Back to the top of the page






Meta-analysis for time-to-event data: Trade-offs with discretization

by
Dalene Stangl, Lurdes Inoue and Steve Ponisciak
ISDS, Duke University
Box 90251
Durham, NC 27708-0251
dalene@stat.cmu.edu

Abstract:

Many meta-analyses in clinical trials and health policy examine time-to-event outcomes. However, these studies often rely on published summary data that presents the outcome as a discrete variable observed at a single point in time. In other studies the continuous time-to-event outcome is modeled, but controversy over the use of fixed versus random-effect parametric and semiparametric models ensues. This paper examines the trade-offs between these modeling choices.

Back to the top of the page






Profiling customers from in-house data

by
Paola Sebastiani
The Open University
p.sebastiani@open.ac.uk@
Marco Ramoni and Alexander Crea
Knowledge Media Institute
The Open University
m.ramoni@open.ac.uk, a.crea@open.ac.uk

Abstract:

A typical problem of mailing campaigns is the low response rate. Recent studies have shown that adding incentives or gifts in the mailing can increase the response rate. This is the strategy implemented by the Paralyzed Veterans of America (PVA) in the June '97 renewal campaign. The mailing included a gift of personalized name and address labels plus an assortment of 10 note cards and envelopes. Each mail cost the charity 0.68 dollars and resulted in a response rate of about 5\%. Since the donations received by the respondents ranged between 2 and 200 dollars, and the median donation was 13 dollars, it is important to decide when and why it is worth pursuing the campaign, on the basis of the information available from in-house data. Last year, PVA made available a database of about 100,000 cases and 470 variables for the so called {\sf KDD Cup}: a contest in which both commercial and research prototypes for Knowledge Discovery in databases ({\sf KDD}) were invited to build a model maximizing the profit from the renewal mailing. The database consists of variables measuring directly donors features (as donors history, age, income etc) and variables collected from the 1990 US Census to characterize the donors neighborhood as well as socio, economic, urbanicity and ethnicity indicators. The winner was the company GainSmart that modeled independently the probability of response via logistic regression and the donation amount via multiple linear regression. The two models can be used jointly to decide when it is worth pursuing the mailing, by evaluating the expected profit. However, the two models do not provide much insight about the relationships between the variables that appear to affect mostly both response rate and donation amount. For example, one may be interested in profiling the donor to be targeted in the next campaigns or try to understand the donor behavior. This is the objective of our work. By extending the approach of winner GainSmart, we build three dependency models. The first one ({\sf Response-net}) models the dependence of the probability of response to the mailing campaign on the independent variables in the database removed of the variables collected from the 1990 US Census. The second one ({\sf Donation-net}) models the dependence of the donation amount and it is built by using only the 5\% respondents to the mailing campaign. The third model is a Naive Bayes classifier that models one global indicator of the socio-economic status, the urbanicity, ethnicity and a variety of other demographic characteristics as summary of the variables collected from the 1990 US Census. The three models are Bayesian Networks %\cite{Pearl88} (Pearl, 1988) induced from data using {\sf Bayesware Discoverer} a commercial product for the induction of Bayesian Networks from possibly incomplete data produced by {\sf Bayesware Ltd}. \bkd\ induces Bayesian Networks from data using Bayesian methods, as described for example in %\cite{Ramoni.Idabook99}, Ramoni and Sebastiani (1999) and implements the {\em Bound} and {\em Collapse} method of %\cite{Ramoni.JIDA98} Ramoni and Sebastiani (1998) to compute a first order approximation of the scoring metric when data are incomplete %\cite{Ramoni.UAI97}. (Ramoni and Sebastiani (1997). Bayesian networks provide a compact and easy-to-use representation of the probabilistic information conveyed by the data. The network structure aids one to understand the dependencies among the variables. However, the network structure is only one aspect of the knowledge represented. By quering the network, one can investigate different relationships between the variables, as well as making prediction and explanation. For example, the BBN {\sf Response-net} shows that the probability of a donation is directly affected only by the wealth rating and the number of lifetime gifts to card promotions prior the mailing campaign. Whether a donor responds is independent of all the variables in the net given the wealth rating and the number of lifetime gifts, so that these variables are sufficient to predict the donors response. However, by querying the network, we can profile respondents who, for example, appear to be are most likely elder females, with an average household income, who live in an area with a percentage of Vietnam veterans between 25 and 50. The BBN {\sf Donation-net} shows that the donation amount is directly affected by the last gift prior the mailing campaign and the number of times the donor responded to mail order offers from news and financial publications. Apparently, donors tend to maintain the gift amount constant and their constancy is directly proportional to the number of times they responded to similar mail offers. The last model --- the Naive Bayes classifier --- is an ancillary model to show the predictive capability of one global indicator as a classification of the variables collected from the 1990 US census. The large predictive capability of this indicator sustains the decision to remove the US census variables from the database.

Back to the top of the page






Modeling Rates of Growth and Loss of Bone Density Using Order Restricted Inference and Stochastic Change Points

by
Richard Evans
University of Arkansas
4301 W. Markham Street
Slot 781
Little Rock AR 72205
evansrichadb@exchange.uams.edu
Siu Hui
Indiana University
and
Joe Sedransk
Case Western Reserve University

Abstract:

Bone mineral density (BMD) at the spine and at the hip is central to the managment of osteoporosis because fractures at the spine are the most prevalent while fractures at the hip are the most debilitating. It has been shown that BMD predicts fractures and that the rate of bone growth and loss with age are important determinants of osteoporosis in old age. In order to develop therapies for osteoporosis we need to understand the pattern of bone growth and loss as people age. The problem is to estimate the ages cooresponding to the changepoints demarking the stages in the pattern of age specific mean rates of change of BMD, $\mu(t)$, $t=8, \dots , 80$ where $t$ is age in years. In this paper we assume the condition that the $\mu(t)$ behave according to $\mu(8)<\dots<\mu(t_1)>\dots>\mu(t_2)=0>\dots>\mu(t_3)<\dots<\mu(80)$, and provide inference for the changepoints $t_1$, $t_3$, and $t_3$. The constrained parameter Gibbs sampler suggested by Gelfand, Smith, and Lee (1992) will not work for this problem because $t_1$, $t_3$, and $t_3$ are uniquely determined conditional on $\mu(t)$, and the order restriction of $\mu(t)$ is determined conditional on $t_1$, $t_3$, and $t_3$. Instead, we use the Metropolis Algorithm. Sampling the order restricted $\mu(t)$ is facilitated by transforming the $\mu(t)$ to $z(t)$, where the $z(t)$ do not have an order restriction. For example $\mu(t)=\sum_{j=1}^{t}e^{z(j)}$, $t Back to the top of the page






Using Prior Opinions to Examine Sample Size in Two Clinical Trials

by
Chin-pei Tsai and Kathry Chaloner
School of Statistics
University of Minnesota
352 Classroom Office Building
199 Buford Ave.
St. Paul MN 55108
cptsai@stat.umn.edu, kathryn@stat.umn.edu

Abstract:

Two examples of large clinical trials for the treatment of advanced HIV disease are described. For the two trials Chaloner and Rhame (in ``Ethical and Statistical Reasons for Quantifying and Documenting Prior Opinions'' manuscript (1999)) elicited prior opinions from over 50 HIV clinicians. Their prior opionions are used here for design: the sample size for reaching consensus with high probability is calculated. Consensus is said to occur when all clinicians have posterior opinions which would lead to prescribing the same treatment. Posterior beliefs are calculated using a simple linear Bayes approximation. In addition plots are given for determining parameter values for which a particular sample size is sufficient for consensus to be reached with high probability. These calculations are useful tools at the design stage and are easy to implement.

Back to the top of the page






Sampling Methods for Phylogenetic Inference with Application to HIV

by
J. Alex Stark, Kay Tatsuoka
National Institute of Statistical Sciences
Research Triangle Park, NC
stark@niss.org
and
Francoise Seillier-Moiseiwitsch
Department of Biostatistics
UNC Chapel Hill
Chapel Hill, NC

Abstract:

Reconstructing phylogenies (evolutionary histories) of proteins from a set of observed sequences is an important statistical modelling task in molecular evolution. Maximum likelihood and Bayesian methods have been used for phylogenetic inference. We review Markov chain Monte Carlo schemes that have been developed, and discuss the application of these to HIV data. We consider the problem of analysing the results from one such sampler and describe in detail a method for determining a consensus labelled history. This is illustrated with an example of HIV protease.

Back to the top of the page






Comparison of two phase I methods for high toxicity drugs

by
J. Lynn Palmer and Mark Munsell
UT M.D. Anderson Cancer Center
Departmentof Biostatistics Houston, TX 77030 jlp@odin.mdacc.tmc.edu

Abstract:

In cancer clinical trials, the usual Phase I methodology used to determine appropriate dosages for Phase II studies is some variant of the usual 3+3 design. In this method 3 patients are entered at a specific dose level then 3 more at a higher dose level until 1 or more patients experience toxicity. This method usually selects as a Maximum Tolerated Dose (MTD) a dose level that results in 25% to 33% of patients experiencing toxicity. However, in some situations, a higher 'optimal' level of toxicity must be found, as in when the toxicity is considered mild or fully reversible, or in the situation when no toxicity arises the patient may be at a higher risk. The latter situation occurred in a bone marrow transplant application where the major toxicity was graft-versus-host disease and a higher than standard toxicity rate was necessary. Two methodologies are considered and compared through the use of simulated data: the Continual Reassessment Method (CRM) and a variation of the standard 3+3 methodology which also includes 12 patients treated at the MTD for somewhat higher precision. The two methods alternated between being defined as the 'best' method to use when this definition was based on expected toxicity levels at given doses.

Back to the top of the page






Multi-Scale Modeling and Parallel Computing for Modeling Porous Flow

by
Herbert Lee, Dave Higdon, Marco Ferreira and Mike West
Duke University
ISDS, Box 90251
Durham, NC
herbie@stat.duke.edu, higdon@stat.duke.edu, marco@stat.duke.edu, mw@stat.duke.edu

Abstract:

Conventional deterministic simulation models of fluid flow in porous media require very high-dimensional parameters as inputs -- these parameters are permeability and porosity tensor fields. Variations in such parameters impact on multiple scales since fine-scale variations in their values can have key large-scale effects on fluid flow predictions. Critical interests lie in modeling and accounting for uncertainty about such high-dimensional parameters, and in studying the effects of such uncertainties on deterministic simulations of fluid flow as an aid to practical problems such as contaminant clean-up and oil reservoir exploration. Unfortunately, the determination of these parameters is radically ill-posed, relevant "hard" data is often limited and use must be made of various sources of indirect data and expert opinion. In connection with a large collaborative project involving statisticians, mathematicians, computer scientists and engineers, we are developing high-dimensional, multi-scale Markov random fields as prior models for permeability and other parameters. Novel multi-scale modeling ideas attempt to account for relationships between permeabilities on discrete grids at different levels of physical resolution. This structure also allows computations to be handled by a parallel computing machine, increasing the size of problems that can be feasibly tackled, and also, at least in prospect, introducing novel dimensions to the statistical thinking about just what can and cannot be done in challenging, large-scale problems.

Back to the top of the page






Identifying Carriers of a Genetic Modifier

by
Peter Hoff, Michael Newton and Richard Halbert
University of Wisconsin, Madison
Department of Statistics
1210 W. Dayton
Madison, WI 53706-1685
hoff@stat.wisc.edu

Abstract:

People with familial adenomatous polyposis (FAP) develop hundreds of tumors of the colon which, if left untreated, eventually progress to become carcinomas. The disease is caused by the inheritance of a single mutant allele of the \emph{APC} gene. The \emph{Min} mutation in the mouse homologue of \emph{APC} causes a phenotype very similar to human FAP. Mice with the \emph{Min} mutation thus provide a model for studying this type of inherited colon cancer. In a mutagenesis experiment, a mouse is obtained which shows signs of carrying a mutation reducing the tumor-causing effects of \emph{Min}. In order to map the location of this modifier gene, it is necessary to breed and identify a group of animals carrying the modifier. Although inheritance of the modifier is not directly observable, animals resulting from a breeding experiment carry the modifier with known prior probabilities. Conditional upon the unobserved pattern of inheritance, the animals are modeled as having tumor counts distributed according to either a carrier or a non-carrier distribution. Our goal is to identify likely carriers and non-carriers of the modifier, assuming only that the tumor count distributions are stochastically ordered. We take a nonparametric Bayesian approach by putting a prior on the space of pairs of stochastically ordered distributions, and develop a Markov Chain for estimating posterior quantities of interest.

Back to the top of the page






Bayesian Hot Spot Detection in the Presence of a Spatial Trend: Application to Total Nitrogen Concentration in the Chesapeake Bay

by
Victor De Oliveira
Departmento de Computo Cientifico y Estadistica
Universidad Simon Bolivar
Caracas, Venezuela
vdo@cesma.usb.ve
and
Mark D. Ecker
Department of Mathematics University of Northern Iowa
Cedar Falls, IA
ecker@math.uni.edu

Abstract:

In the Chesapeake Bay, a decreasing gradient of total nitrogen concentration extends from the highest values in the north at the mouth of the Susquehanna River to the lowest values in the south near the Atlantic Ocean. We propose an attractive model for these data with right skewed sampling distributions by coupling the Box-Cox family of power transformations with a spatial trend in a random field model. This is done by using a Bayesian Tranformed Gaussian random field, as proposed by De Oliveira, Kedem and Short 1997), where we extend this model to the case when data contain measurement error and propose a new Monte Carlo algorithm to perform the necessary inference and prediction. Also, we propose a new definition of `hot spot' that generalizes previous definitions and is appealling for processes with a spatila trend.

Back to the top of the page






An Algorithm to Sample Marker Genotypes\\ in a Pedigree with Loops

by
Soledad A. Fernandez, Rohan L. Fernando and Alicia L. Cariquiry
Statistics Department
Iowa State University
202C Snedecor Hall
Ames, IA 50014
soledad@iastate.edu

Abstract:

Markov chain Monte Carlo (MCMC) methods have been recently proposed to overcome computational problems in linkage analysis. This approach consists in sampling genotypes at the marker and trait loci. It has been shown that the Markov chain that corresponds to the scalar Gibbs sampler may not be irreducible when the marker locus has more than two alleles. This problem does not arise if the marker genotypes are sampled jointly from the entire pedigree. When the pedigree does not have loops, a joint sample of the marker genotypes can be obtained efficiently by using a modification of the Elston-Stewart algorithm. When the pedigree has many loops, obtaining a joint sample may be very time consuming. We propose a method for this situation, in which we sample genotypes from a pedigree so modified as to make joint sampling efficient. These samples, obtained from the modified pedigree, are used as candidate draws to be accepted or rejected in the Metropolis-Hastings algorithm. The efficiency of this strategy is compared to the efficiency of other approaches in the literature.

Back to the top of the page






A Bayesian Approach to Analyzing Embryo Viability and Uterine Receptivity Data from Studies of In Vitro Fertilization

by
Vanja Dukic
Division of Applied Mathematics
vanja@stat.brown.edu
and
Joseph Hogan Center for Statistical Sciences
Brown University
Providence, RI 02912

Abstract:

In Vitro Fertilization and Embryo Transfer (IVF-ET) is considered a method of last resort for treating infertility. Oocytes taken from a woman are fertilized in vitro, and one or more resulting embryos are transferred into the uterus. An important outcome of interest in IVF studies is embryo implantation. A widely used model, known as the EU model, postulates that implantation probability is explained by a combination of embryo viability (E) and uterine receptivity (U). Specifically, the model assumes that uterine receptivity is characterized by a latent binary variable U, and that the number of viable embryos among those selected for transfer, E, is binomial. The observed number of implantations among the transferred embryos is the product of E and U. The case study concerns estimating the effect of hydrosalpinx on embryo implantation in patients undergoing IVF. Hydrosalpinx is a build-up of an embryotoxic fluid in the fallopian tubes, which sometimes leaks to the uterus and may reduce the likelihood of implantation. It is generally understood that hydrosalpinx does not affect embryo viability, which is determined prior to transfer; rather, it affects implantation rate by compromising the uterine environment. Affected tubes can be treated surgically, but the procedure may result in permanent damage to the tubes and loss of their functionality. Among IVF practitioners and researchers, there exists considerable disagreement about whether hydrosalpinx reduces implantation rates enough to warrant surgery as a general treatment. Perhaps owing to high risk associated with surgery, very little data from clinical trials is available; many of the arguments for and against the use of surgery rest on findings from observational studies. Few of these studies are analyzed using methods for correlated data, and many do not even employ covariate adjustment. Our case study is based on data from an observational study of 288 women undergoing IVF because of tubal disease. We use our hierarchical version of the EU model to assess the effect of hydrosalpinx on implantation by estimating its effect on uterine receptivity. When some subjects have zero implantations, the EU model is not fully identified and informative prior distributions are required for key parameters. Our analysis uses informative priors constructed from previous studies, and examines sensitivity to choice of prior distribution. The analysis also indicates substantial subject-level heterogeneity with respect to embryo viability, suggesting the importance of using a multi-level model.

Back to the top of the page






Bayesian Hierarchical Modelling of U.S. County Poverty Rates

by
Robin Fisher and Jana Asher
HHES Division, 1065-3
U.S. Census Bureau
Washington, DC 20233-8500
jana.l.asher@ccmail.census.gov

Abstract:

The U.S. Census Bureau Small Area Income and Poverty Estimates program produces biennual intercensal estimates of the povery rates and counts of poor within counties for use in determining the allocation of federal funds to local jurisdictions. Our main consumer is the Department of Education, which indirectly uses these estimates to distribute approximately \$7 billion of Title I funds annually. Numbers of poor are currently modelled through an empirical Bayes estimation method centered on a linear regression; the dependent variable is a log transformation of the three-year average of the March Current Population Survey (CPS) estimate of the number of poor for each county, and the independent variables are log transformations of administrative data such as the number of poor from the previous decennial census, the number of poor as aggregated from tax return data, and the food stamp participation rate for each county. We assume the variability of the CPS estimates is the sum of a model error term with constant variance, and a sampling error term whose variance is proportional to the inverse of a power of the CPS sample size. Maximum likelihood estimation is used to jointly determine the values of the regression coefficents and the sampling variance components. Problems with the current estimation technique include a loss of data points due to the log transformation for counties whose CPS sample of poor is zero, and the requirement of using decennial census data to estimate the model error variance term.\\ To eliminate these problems and improve the overall quality of our estimates, we have developed a hierarchical Bayesian model which assumes the observed number of poor is a scaled binomial random variable given the underlying poverty rate. This poverty rate, in turn, has a beta prior which relies on a set of parameters that includes the regression coefficents for the administrative data. Posterior probability distributions for regression parameters, variance parameters, and true proportion poor are generated using Markov Chain Monte Carlo techniques. We will discuss the Bayesian model and compare the results of the original and new estimation methods.

Back to the top of the page






Analysis of High-Energy Spectra Obtained with the Chandra Satellite X-ray Observatory

by
David A. van Dyk
Department of Statistics
Harvard University
vandyk@stat.harvard.edu
Vinay L. Kashyap and Aneta Siemiginowka
Harvard-Smithsonian Center for Astrophysics and
Alanna Connors
Department of Astronomy
Wellesly College

Abstract:

In this paper, we employ modern Bayesian computational techniques (e.g., EM-type algorithms, the Gibbs sampler, and Metropolis-Hastings) to fit new models for low-count, high-resolution astrophysical spectral data. Our methods will be useful not only for the Chandra X-ray Observatory (launched by the Space Shuttle {\it Columbia}, July, 1999), but also for such new generation telescopes as XMM, Constellation-X and GLAST. This application demonstrates the flexibility and power of modern Bayesian methodology and algorithms to handle highly hierarchical models that account for the complex structure in the collection of high-quality spectra. Current popular statistical analysis for such data typically involves Gaussian approximations (e.g., chi squared fitting) which are not justifiable for the high-resolution low-count data that will soon be available. In contrast, we explicitly model photon arrivals as a Poisson process and, thus, have no difficulty with high resolution data Our models also incorporate instrument response (e.g. via a response matrix and effective area vector) and background contamination of the data. In particular, we model the background as the realization of a second Poisson process, thereby eliminating the need to directly subtract off the background counts and the rather embarrassing problem of negative photon counts. The source energy spectrum is modeled as a mixture of a generalized linear model which accounts for the spectral continuum plus absorption (i.e,. stochastic partial censoring) and several (Gaussian) line profiles. Using several examples, we illustrate how Bayesian posterior methods can be used to compute point estimates of the various model parameters as well as compute error bars on these estimates. .

Back to the top of the page






Meta-analysis of (biased) experiments on the change of VBR in schizophrenics

by
M.J. Bayarri
University of Valencia
Av. Dr. Moliner 50
Burjassot, Spain
susie.bayarri@uv.es
and
A.M. Mayoral
Miguel Hernandez University
Av. Ferrocarril s/n
03202 Elche
Spain
asun.mayoral@umh.es

Abstract:

The degenerating effect of schizophrenia in the brain is an issue attracting considerable interest in the psychiatric literature. In particular, several experiments studying the possible changes in brain morphology of schizophrenics have been reported. The quality, measurement techniques, diagnostic tools and measure of brain 'size' considered were widely different among those experiments. Seven of them investigating the change in the Ventricular Brain Ratio (VBR) of schizophrenics and for which complete data existed, were retrieved from the literature . While some of them defended the theory that VBR tends to shrink in schizophrenics, some others defended precisely the opposite. In this paper, a Bayesian meta-analysis of these seven experiments is performed and inferences about the VBR change per year addressed . Possible biases due to diagnosis procedure and diagnosis criteria are taken into account.

Back to the top of the page






Bayesian Protein Structure Prediction

by
Scott C. Schmidler, Jun S. Liu and Douglas L. Brutlag
Section on Medical Informatics
and Dept. of Statistics and Biochemistry Stanford University
Medical School Office Bldg, X215
Stanford, CA 94305
schmidler@smi.stanford.edu

Abstract:

The Human Genome Project estimates that sequencing of the entire complement of human DNA will be completed in the year 2003. At the same time a number of complete genomes for pathogenic organisms are already available, with many more under way. Widespread access to this data promises to revolutionize areas of biology and medicine, by providing fundamental insights into the molecular mechanisms of disease and pointing the way to the development of novel therapeutic agents. Before this promise can be fulfilled however, a number of significant hurdles remain. Each individual gene must be located within the 3 billion bases of the human genome, and the functional role of its associated protein product identified. This process of characterization of function, as well as later development of pharmaceutical agents to affect that function, is aided greatly by knowledge of the 3-dimensional structure into which the protein folds. While the sequence of the protein can be determined directly from the DNA of the gene which encodes it, prediction of the 3-dimensional structure of the protein from that sequence remains one of the great open problems of science. Moreover, the scale of the problem (the human genome is projected to contain approximately 100,000 genes) necessitates the development of {\em computational} solutions which capitalize on the laboriously acquired experimental structure data. We describe our work on Bayesian models for prediction of protein structure from sequence, based on analysis of a database of experimentally determined protein structures. We focus on a well-known simplification of the problem which attempts to predict the "secondary structure" of the protein, by identifying regions in a protein sequence which take on regular local conformations in the (unobserved) 3-dimensional structure. We define joint probability models for sequence and structure which capture fundamental aspects of the folding process such as hydrophobicity patterns and side chain interactions. A simple model assuming conditional independence of local "segments" is developed which allows efficient calculation of predictive quantities using dynamic programming algorithms based on graphical Markov models. This approach is shown to perform at the level of the best available methods in the field via extensive cross-validation experiments. We then show how the model is naturally extended to include non-local sequence interactions arising from 3-dimensional folding of the protein. With the use of MCMC simulation techniques, such extended models may allow us to go beyond the realm of secondary structure prediction to the location of contacts in the folded 3-d structure, hence narrowing the gap between sequence and completely folded structure.

Back to the top of the page






A Bayesian Analysis Of A Stock Selection Method Based On High-Yield Dow Jones Stocks

by
Sudip Bose
Dept. of Statistics
The George Washington University
2201 G. Street, NW
Washington, DC 20052
sudip@gw.edu

Abstract:

We carry out a Bayesian analysis of the performance of a stock selection method based on selecting high-yielding stocks from the Dow Jones Industrial Index (the Dow 30). In particular we compare its return to that of the Dow Jones Index itself. Ever since John O'Higgins book "Beating the Dow" which described three simple, related strategies based on Dow Jones stocks that combined high yield with low price, there has been much interest in such strategies, and mutual-fund families have even formed investment trusts for the general public based on such methods. We examine whether a particular such strategy does indeed outperform the Dow, and check for robustness of our conclusions.

Back to the top of the page






A spatial application in Bayesian learning and identifiability

by
Lynn E. Eberly and Bradley P. Carlin
Division of Biostatistics
School of Public Health
University of Minnesota
420 Delaware St. SE, Box 303
Minneapolis, MN 55455
lynn@biostat.umn.edu

Abstract:

In this paper we analyze a dataset of county-level lung cancer rates in the state of Ohio during the period 1968--1988, and their possible relation to a nuclear fuel reprocessing plant near the Cincinnati metro area. In order to answer important related questions in environmental justice, complex spatio-temporal models are required. Here, separate random effects for capturing unstructured heterogeneity and spatial clustering are of substantive interest, even though only their sum is well-identified by the data. Often the quantity of interest is the posterior empirical proportion of variability due to each effect. Because of its empirical nature, it is not immediately clear from what sort of prior this quantity is derived. We investigate whether or not our data can even inform about this quantity, i.e., if there is Bayesian learning occurring. We conclude Bayesian learning is possible in these settings, and discuss the impact of this result on our particular example, as well as on the practice of Bayesian spatial data analysis more generally.

Back to the top of the page






A Hierarchical Model for the Genetical Analysis of Selection

by
Dave Higdon and Richard E. Miller
Institute of Statistics and Decision Sciences
and Zoology Duke University
Box 90251
Durham, NC 27708-0251
higdon@stat.duke.edu

Abstract:

Selection can be defined as the covariance between relative fitness and traits within a population of individuals. However, this covariance is often greatly influenced by the dependence of the focal traits and fitness on the local environment. An alternative approach recently developed by Rausher has shown that a genetical analysis corrects for this potential confounding influence of local environment. Traditional application of this method has been to first estimate breeding values for both the traits and fitness using family means and then use these values in the selection analysis. Here we present an alternative statistical approach developing a Bayesian hierarchical model for the genetical analysis of selection. We consider a case study that investigates the pattern of selection across four environments in a population of the common morning glory, Ipomoea purpurea. The focal trait here is plant size measured as leaf area late in the growing season. Fitness is estimated both as seed number (female component of fitness) and flower number (male component of fitness). The four environments are competition treatments implemented to create a gradient from high to low competition. The plants used in this experiment come from a half-sib breeding design producing progeny of known parentage. The large number of individuals used in this experiments ($N = 3,240$) requires a large planting area, therefore sources of unmeasured environmental variation need to be accounted for when estimating the breeding values for the traits and fitness. Our Bayesian hierarchical model comes in two stages: \begin{itemize} \item a model for the data to estimate breeding values for plant size and fitness; \item a model for the relationship between plant size and fitness estimated as breeding values. \end{itemize} Such models have previously been tackled in two separate estimation steps: obtraining family estimates for fitness and the traits; and examining the relationship between size and fitness using standard regression techniques involving these estimates. This may not properly account for structure as well as uncertainty in the problem. Our Bayesian hierarchical modeling approach handles both stages of this model simultaneously, allowing information, structure, and uncertainty to propagate between its various stages. In addition to its hierarchical structure, this model features a bivariate spatial model to more realistically account for environmental covariation in plant size and fitness.

Back to the top of the page






Bayesian Analysis of Failures in a Gas Distribution Network

by
Fabrizio Ruggeri and Antonio Pievatolo
CNR-AIMI
Via Ampere 56
20131 Milano, Italy
fabrizio@iami.mi.cnr.it

Abstract:

We present some findings from a consulting job on gas escapes in an urban gas distribution network. Poisson processes are used to describe the escapes in both iron cast (homogeneous process) and steel pipelines (nonhomogeneous process). Particular attention is devoted to the elicitation of the experts' opinions. Both design and maintenance of the network are considered, focussing on the effects of some quantities like different sources of corrosion and location.

Back to the top of the page






Bayesian Tools for EDA and Model Building: \\ A Brainy Study.

by
Steven N. MacEachern and Mario Peruggia
The Ohio State University
Department of Statistics
Columbus, OH 43210
peruggia@stat.ohio-state.edu

Abstract:

We consider a strategy for Bayesian model building that begins by fitting a simple, default model to the data. Numerical and graphical exploratory tools, based on summary quantities from the default fit, are used to assess the adequacy of the initial model and to identify directions in which the fit can be refined. We apply this strategy to build a Bayesian regression model for a classic set of data on brain and body weights of mammalian species. We discover inadequacies in the traditional regression model through use of our exploratory tools. More sophisticated models point the way toward judging the adequacy of a theory on the relationship between body weight and brain weight, and also bear on the timeless question ``Do we have big brains?''

Back to the top of the page






Post-Stratification without Population Level Information on the Post-Stratifying Variable in Political Polling

by
Cavan Reilly and Andrew Gelman
Department of Statistics, Columbia University
618 Mathematics NY,Ny 10027
cavan@stat.columbia.edu

Abstract:

We develop a new method for the construction of more precise estimates of a collection of population means using information about a related variable in the context of repeated sample surveys, and we illustrate this method using presidential approval data (our related variable is political party identification). We use post-stratification to construct these improved estimates, but since we don't have population level information on the post-stratifying variable, we construct a model for the manner in which the post-stratifier develops over time. In this manner, we obtain more precise estimates without making possibly untenable assumptions about the dynamics of our variable of interest, the presidential approval rating.

Back to the top of the page






Variance Component Testing in Generalized Linear Mixed Models
with an Application to Natural Selection Studies

by
Sandip Sinharay and Hal Stern
Iowa State University
Department of Statistics
Ames IA 50011
ssray@iastate.edu

Abstract:

A generalized linear mixed model is applied to data from a natural selection study where the probability of an animal's survival is modeled as a function of family (random effect) and physical characteristics (fixed effects). In this application, the magnitude of the variance component corresponding to the random effect is of scientific interest as are the fixed effects. A number of approaches are reviewed for approximating the Bayes factor comparing the model with and without random effects. We also use simulated data to assess the performance of the different approaches.

Back to the top of the page






Role of Context in Visual Perception: A Functional Magnetic Resonance Imaging Study

by
Thomas Nichols, Bill Eddy, Jay McClelland and Chris Genovese
Department of Statistics and
Center for Neural Basis of Cognition
Carnegie Mellon University
5000 Forbes Avenue
Pittsburgh, PA 15213
nicholst@stat.cmu.edu, bill@stat.cmu.edu,jlm@cnbc.cmu.edu,genovese@stat.cmu.edu

Abstract:

We used Functional Magnetic Resonance Imaging (fMRI) to investigate competing models of visual perception. One model, the ``bottom up'' model, predicts that context plays a post-perceptual role, outside the primary visual cortex (V1). The other model, the interactive activation model, predicts that context actually alters perception, a process that occurs in V1. We use an intriguing perceptual effect to look for context-dependent changes in V1. Working closely with a cognitive psychologist, we developed ambiguous visual stimuli (words flashed for 14 ms) that could be selectively disambiguated by context (sentences missing an obvious last word). Working with MR physicists we created custom stimuli presentation software and hardware and determined MR acquisition parameters appropriate for high temporal resolution imaging of V1. We used comprehensive modeling for both preprocessing and inference: For preprocessing we used the methods implemented in FIASCO (Functional Imaging Analysis Software---Computational Oleo) to remove systematic variation due to both the imaging hardware and the subject; for inference we used both basic classical inference tools and the Bayesian models available in BRAIN (Bayesian Response Analysis and Inference for Neuroimaging). The modeling framework in BRAIN, presented in the 4th Case Studies in Bayesian Statistics Workshop, was originally intended for block-design fMRI (where stimuli are presented over 30 second to 2 minute intervals) but we show that it can be applied to the new event-related fMRI (where transient stimuli are used). We also demonstrate how the multiple comparison problem can be ameliorated by using functionally-defined masks, which constrain the analysis to a small anatomical region.

Back to the top of the page






Quantile Estimation for Soil Texture Profiles

by
Pam Abbitt
North Carolina State University
CAMPUS BOX 8203
RALEIGH, NC 27695
abbitt@stat.ncsu.edu

Abstract:

{The MLRA (Major Land Resource Area) 107 pilot project involved implementation of a multi-phase probability sampling design for updating the soil surveys for two counties in western Iowa. Many of the data collection items in the survey are recorded for each horizon (or layer) of soil. We consider estimation of quantiles for soil texture profiles using a hierarchical model and data from the pilot project. Soil horizon profiles are modeled as realizations of Markov chains. Conditional on the horizon profile, transformed field and laboratory determinations of soil texture are modeled as multivariate normal. The posterior distribution of unknown model parameters is numerically approximated using a Gibbs sampler. The hierarchical model provides a comprehensive framework which may be useful for analyzing many other variables of interest in the pilot project.}

Back to the top of the page






The Hierarchical Rater Model for Rated Test Items

by
Matthew Johnson
Carnegie Mellon University
Division of Statistics
Pittsburgh, PA 15213
masjohns@stat.cmu.edu

Abstract:

Multiple ratings of open-ended test items requiring subjective scoring [e.g. essays, student artwork] have become increasingly popular for many standardized tests [e.g. the Advanced Placement exams of the Educational Testing Service]. The increasing prevalence has posed a challenge in settings where item response theory (IRT) is the primary method of analysis and/or scoring (Bock, Brennan, and Muraki, 1998). Patz (1996) introduced and Junker and Patz (1998) developed the Hierarchical Rater Model (HRM) in part to address the difficulties created by IRT's strong conditional independence assumptions. The HRM treats examinee responses to open-ended items as {\em unobserved} discrete variables, and it explicitly models the ``proficiency'' of raters in assigning accurate scores as well as the proficiency of examinees in providing correct responses. As a result, the HRM overcomes the problems of double counting information in multiple ratings in much the same way that traditional IRT overcomes the problem of double counting information in multiple student responses---by introducing an unobserved (latent) variable (the ``true'' item score) that explains the dependence present in multiple ratings of the same student response. We will describe the HRM in detail, compare it to alternative approaches, and present several applications in which the HRM is fitted using Markov Chain Monte Carlo techniques (e.g. Patz and Junker, 1997a,b).

Back to the top of the page






Characterization of Arsenic Occurrence in US Drinking Water Treatment Facility Source Waters

by
John R. Lockwood III, Mark J. Schervish, Patrick Gurian, Mitchell J. Small
Carnegie Mellon University
Department of Statistics
Pittsburgh, PA 15213
jlock@stat.cmu.edu

Abstract:

The 1996 amendments to the US Safe Drinking Water Act (SDWA) mandate revision of current maximum contaminant levels (MCLs) for various harmful substances in public drinking water supplies. The determination of a revised MCL for any contaminant must reflect a judicious compromise between the potential benefits of lowered exposure and the feasibility of obtaining such levels. This evaluation is made as part of a regulatory impact assessment (RIA) requiring detailed information about the occurrence of the contaminant and the costs and efficiencies of the available treatment technologies. Our work focuses on the first step of this process, using a collection of data sources to model arsenic occurrence in treatment facility source waters as a function of system characteristics such as source water type, location and size. We fit Bayesian hierarchical models to account for the spatial aspects of arsenic occurrence as well as to characterize uncertainty in our estimates. After model selection based on cross-validation predictive densities, we use a national census of treatment systems and their associated covariates to predict the national distribution of raw water arsenic concentrations. We then examine the relationship between proposed MCLs and the number of systems requiring treatment and identify classes of systems which are most likely to be problematic. The posterior distribution of the model parameters, obtained via Markov Chain Monte Carlo, allows us to quantify the uncertainty in our predictions.

Back to the top of the page






Hidden Stochastic Models for Biological Rhythm Data

by
Howard Seltman
Carnegie Mellon University
Division of Statistics
Pittsburgh, PA 15213
hseltman@stat.cmu.edu

Abstract:

Parameters of the biological rhythm of the hormone cortisol are estimated using a compartmental model with a hidden stochastic process. The physiological processes of secretion and elimination are separately modeled, and the net concentration is obtained by the convolution of secretion and elimination. Basal and active rates of secretion are represented as a two state hidden Markov chain. The transition probability from basal to active states is modeled as a logit cosinor curve. The development of a Markov chain Monte Carlo procedure to sample the posterior distribution is presented. The use of a compartmental model with a periodic hidden stochastic process offers a new approach that directly allows testing of hypotheses relating to alterations in the underlying physiological components of a biological rhythm.

Back to the top of the page






Vote Tampering in a District Justice Election in Beaver County, PA

by
Illaria DiMatteo and Joseph B. Kadane
Carnegie Mellon University
Division of Statistics
Pittsburgh, PA 15213
dimatteo@stat.cmu.edu

Abstract:

In an election on November 2, 1993, as a result of a first count, candidate Joseph Zupsic defeated, by 36 vote margin, Dolores Laughlin for the office of District Justice in Beaver County, PA. After a second count, requested by the apparently defeated candidate, the result of the election was reversed: Mrs. Laughlin beat Mr. Zupsic by a 46 vote margin. How can it be determined if the change in votes between the two counts was actually at random or a result of vote tampering? The data can be thought as a realization of an aggregate Markov chain in which only the aggregate behavior of the ballots can be observed in each precinct (namely the total number of ballots for each candidate after each count). In estimating the transition probabilities of a single ballot we augment the data. We also use the other races in the election to determine the "normal" errors in the count-mechanism. A hierarchical model is used to describe these data, and Markov Chain Monte Carlo is used to estimate the posterior distribution of the parameters in the model. The conclusions we draw lead us to believe that vote tampering happened between the two counts. Furthermore the conclusions of our model agree with the legal decision made in the case.

Back to the top of the page






Bayesian Analysis of Circular Data

by
Sanjib Basu
Northern Illinois University
Division of Statistics
DeKalb, IL 60115
basu@niu.edu

Abstract:

Circular data represent directions in two-dimensions. Such data arise, for example, in vanishing directions of pigeons when they are released some distance away from their ``home''. The underlying scientific question relates to how these birds orient themselves. Are they flying towards their ``home-direction''? Unimodality here would imply that pigeons have a preferred vanishing direction and is a hypothesis of considerable scientific interest. As another example, several stations measure the mean wave direction every hour which corresponds to the dominant energy of the period. The wave directions depend on weather conditions, ocean currents and many other natural factors. The daily variation of the wave directions is an example of circular data on a 24-hour cycle. The hypothesis of unimodality here would imply that there is an overall preferred direction around which the daily variations of the wave directions are distributed. We propose a Bayesian test for unimodality of circular data using mixtures of von-Mises distribution as the alternative. The proposed test is based on Markov Chain Monte-Carlo methodology. We illustrate its application in several examples.

Back to the top of the page

Back to Bayes 99 Homepage