A Case Study in Reproducibility: Detecting Data Analysis Patterns in Text and Graphs to Characterize Student Workflows


As part of a revamp of the general education introductory statistics course at Carnegie Mellon, an interactive data explorer platform was built that allows students to fully engage in the entire data analysis workflow without relying on a particular programming language. Its functionality includes tracking actions and storing answers including open-ended questions where students describe graphs and interpret results. Under the assumption that text gives a richer picture of student comprehension (vs a right/wrong multiple choice question), we use clustering procedures to compare the topics, semantics, and complexity structure of student answers in lab sessions over the course of the semester, as well as in their final data analysis reports. Rather than employing topic modeling or natural language processing alone, we are able to identify the relationship between the descriptive text and analysis decisions by students. This allows us to flag students who answered ‘differently’ and use their actions, such as which graph they created, to assist us in understanding ‘why’. We discuss implications of our results on gaining insight into how students from different backgrounds approach introductory statistics and data analysis, potentially establishing a first-step autograder, and improving our overall understanding of the science of data science.

Stony Brook, NY
Ron Yurko
PhD Student in Statistics & Data Science

My research interests include developing selective inference methodology for applications in statistical genetics and genomics, as well as clustering problems and research on statistics in sports.