This report surveys a few statistical models and computational methods that might be used to underlie an assessment system designed to yield inferences that are relevant to a rich cognitive theory of instruction, learning or achievement. That is to say, we are interested in a deeper account of features of student performance than which items or tasks a student got right, and we are interested in a richer description of these features than a number-right score.
All of the models discussed in this report, from factor analysis to item response theory (IRT) models, latent class models, Bayesian networks and beyond, should be thought of as special cases of the same hierarchical model-building framework, well-illustrated recently by the textbook of Gelman, Carlin, Stern and Rubin (1995). They are models with latent variables of persistent interest, which may vary in nature---continuous, ordered, dichotomous, etc.---as well as in relationship to one another depending on the cognitive theory that drives the model, that are posited to drive probabilities of observations, which may also vary in nature and interrelationship. The functions that link observations to latent variables vary as appropriate to the nature and interrelationships of the observed and latent variables, and the models are made tractable by many assumptions of conditional independence, especially between observations given latent variables.
In the past these various models have been treated and discussed as if they were quite distinct. This is due in part to the fact that until recently computational methods have lagged far behind model building so that for years the sometimes ideosycratic estimation methods first seen as making each model tractible in applications have ``stuck'' to the model, enhancing the appearance of distinction among the models; and in part to the historical accident that these various models were originally proposed to solve rather different-sounding problems. Rapid progress over the past two decades in computational methods fueled by faster computing machinery, and a better sense of the wide applicability of a core methodology to problems from human demography to theoretical physics fueled by a revolution in communication within and between the disciplines, has encouraged and confirmed the view that most statistical models arise from a single framework in which the model is built up from a hierarchy of conditional probability statements.
Within the hierarchical model building framework, this report tries to illustrate the continuum from IRT-like statistical models, that focus on a simple theory of student performance involving one or a few continuous latent variables coding for ``general propensity to do well'' on the one hand, to statistical models embodying a more complex theory of performance, involving many discrete latent variables coding for different skills, pieces of knowledge, and other features of cognition underlying observable student performance, on the other. Some of the most interesting work on these latter types of models has been done in the context of intelligent tutoring systems (ITS's), and related diagnostic systems for human teachers, where a finer-grained model of student proficiency is often needed to guide the tutor's next steps. However, this extra detail in modeling can mean that no one inference is made very reliably. While this may be an acceptable price to pay in an ITS where the cost of underestimating a student's skills may be low (perhaps costing only the time it takes the student to successfully complete one or two more tasks accessing a particular skill, to raise the ITS's confidence that that particular skill has been mastered), low reliability is not acceptable in high-stakes, limited testing-time assessments, in which e.g. attendance in summer school, job or grade promotions, and entrance into college or other educational programs, may be affected by inferences about the presence or absence of particular skills and knowledge.
It is critically important to attend to the specific purposes of assessment in thinking about these various modeling approaches. Mislevy (e.g. Mislevy, 1994) has proposed an approach the problem of managing the complex decisions one must make in developing an assessment system. One begins with the purpose of the assessment within the context of a particular application or task domain, and identifies a set of desired inferences to be made, as well as a framework for thinking about the domain of the assessment: what aspects of the domain are important, the scope and breadth of the assessment within within the larger domain, grain size of both tasks and underlying cognitive features, etc. These desired inferences and initial framework lead to a model for student proficiency. Finally, understanding the purpose, the domain in light of purpose, and the student model as constructed to allow desired inferences to be made, the task and evidence models can be constructed. From the desired inferences, one determines the kinds of observable evidence needed for making these inferences, and finally designs tasks that can be used to afford opportunities for gathering needed evidence. This paper illustrates the interplay between model detail and reliability of inference. Balancing the detail of modeling suggested by the purposes of assessment and theory of student performance on the one hand, with the computational tractibility and inferential reliability of the resulting model on the other, is where the art of assessment modeling remains.
Brian Junker (412) 268 - 2718 Department of Statistics brian@stat.cmu.edu 232 Baker Hall FAX: (412) CMU-STAT Carnegie Mellon University or (412) 268-7828 Pittsburgh PA 15213 USA WWW: http://www.stat.cmu.edu/~brian/