This project will begin on Monday June 13th, and conclude with a 10-15 minute presentation on Friday, June 24th (either during the morning session from 10:30 to 12 PM or in the afternoon from 1:30 to 3 PM). The goal of this project is to practice understanding the structure of a dataset, and to practice generating and evaluating hypotheses using fundamental EDA and data visualization techniques.
Your team is expected to produce R Markdown
slides (an example template will be provided shortly) to accompany your 10-15 minute presentation with the following information:
Explanation of the data structure of the dataset,
Three hypotheses you are interested in exploring,
Three data visualizations exploring the hypotheses, at least two of which must be multivariate. Each visualization must be in a different format from the other two, and you must have at least one categorical and one continuous visualization.
One clustering example,
Conclusions reached for the hypotheses based on your EDA and data visualizations.
There will be two submission deadlines:
Friday, June 17th @ 5:00 PM EST - Each student will push their individual code for the project thus far to their GitHub accounts for review. We will then provide feedback on the code submitted.
Thursday, June 23rd @ 11:59 PM EST - Slides and full code must be completed and ready for presentation. Send your slides to Prof Yurko’s email (ryurko@andrew.cmu.edu). All code, visualizations, and presentations must be made in R
. Take advantage of examples from lecture and the presentation template, but also feel free to explore material online that may be relevant!
Your team is assigned the Maternal Health Care Disparities data. The Centers for Disease Control and Prevention WONDER program helps track information related to birth records, parent demographics and risk factors, pregnancy history and pre-natal care characteristics. This data source could help identify combinations of risk factors more commonly associated with adverse outcomes which could then be utilized to develop better pre-natal care programs or targeted interventions to reduce disparities and improve patient outcomes across all ethnicities.
The data set is a sample of data from the CDC Wonder database for available birth records from 2019 that has been aggregated by state and a few conditions: (the number of prior births now deceased and whether or not the mother smoked or had pre-pregnancy diabetes or pre-pregnancy hypertension). So for example, the first row corresponds to the set of births that were born to women in Alabama who had no prior births deceased, smoked, and had both diabetes and hypertension pre-pregnancy. There were 12 such births, and the following variables (e.g. mother’s age) describe the mothers/infants in that set of 12.
State
: StatePriorBirthsNowDeceased
: The number of prior births now deceasedTobaccoUse
: Whether or not the mother uses tobacco productsPrePregnancyDiabetes
: Whether or not the mother had diabetes prior to becoming pregnantPrePregnancyHypertension
: Whether or not the mother had hypertension prior to becoming pregnantBirths
: The number of births in that state with a defined combination of the previous four conditions (PriorBirthsNowDeceased
, TobaccoUse
, PrePregnancyDiabetes
, PrePregnancyHypertension
)AverageMotherAge
: The average mother’s age for the corresponding group of birthsAverageBirthWeight
: The average birth weight in grams for the corresponding group of birthsAveragePrePregnancyBMI
: The average pre-pregnancy BMI of the mother for the corresponding group of birthsAverageNumberPrenatalVisits
: The average number of prenatal visits of the mother for the corresponding group of birthsAverageIntervalSinceLastBirth
: The average length of time since the last birth for the corresponding group of births