This project will begin on Monday June 13th, and conclude with a 10-15 minute presentation on Friday, June 24th (either during the morning session from 10:30 to 12 PM or in the afternoon from 1:30 to 3 PM). The goal of this project is to practice understanding the structure of a dataset, and to practice generating and evaluating hypotheses using fundamental EDA and data visualization techniques.
Your team is expected to produce R Markdown
slides (an example template will be provided shortly) to accompany your 10-15 minute presentation with the following information:
Explanation of the data structure of the dataset,
Three hypotheses you are interested in exploring,
Three data visualizations exploring the hypotheses, at least two of which must be multivariate. Each visualization must be in a different format from the other two, and you must have at least one categorical and one continuous visualization.
One clustering example,
Conclusions reached for the hypotheses based on your EDA and data visualizations.
There will be two submission deadlines:
Friday, June 17th @ 5:00 PM EST - Each student will push their individual code for the project thus far to their GitHub accounts for review. We will then provide feedback on the code submitted.
Thursday, June 23rd @ 11:59 PM EST - Slides and full code must be completed and ready for presentation. Send your slides to Prof Yurko’s email (ryurko@andrew.cmu.edu). All code, visualizations, and presentations must be made in R
. Take advantage of examples from lecture and the presentation template, but also feel free to explore material online that may be relevant!
Your team is assigned the Medicare Part D Prescription Claims data. Under the Medicare Part D Prescription Drug program, information is tracked for opioids and other drugs prescribed by physicians and other health care providers including the number of prescriptions dispensed (original prescriptions and refills), the total drug cost, beneficiary demographics (65+), related claims information, as well as information about the physician/provider such as their specialization and location. Your sample of data is proportionally sampled across the states (e.g. 5% from each state), and includes the following columns:
NPI
: National Provider Identifier for the performing provider on the claimLastName
: Provider Last NameFirstName
: Provider First NameCity
: The city where the provider is locatedState
: The state where the provider is locatedSpecialty
: The specialty of the provider derived from the Medicare code reported on the claimsBrandName
: Brand name of the drug filledGenericName
: Generic name/chemical ingredient of the drug filledNumberClaims
: Number of Medicare Part D claims filled (includes original prescriptions and refills)Number30DayFills
: Aggregate number of Medicare Part D standardized 30-day fills (number of days supplied dived by 30; if < 1.0, bottom-coded as 1.0; if > 12.0, top-coded as 12.0NumberDaysSupply
: Aggregate number of day’s supply for which the drug is dispersedTotalDrugCost
: Aggregate drug cost paid for all associated claimsNumberMedicareBeneficiaries
: Total number of unique Medicare Part D beneficiaries with at least one claim for the drugNumberClaims65Older
: Number of Medicare Part D claims for beneficiaries age 65 and olderNumber30DayFills65Older
: number of Medicare Part D standardized 30-day fills for beneficiaries age 65 and older (see Number30DayFills for standardized definition)TotalDrugCost65Older
: Aggregate total drug cost paid for all associated claims for beneficiaries age 65 and olderNumberDaysSupply65Older
: Aggregate number of day’s supply for which this drug was dispensed, for beneficiaries age 65 and olderNumberMedicareBeneficiaries65Older
: Number of unique Medicare Part D beneficiaries age 65 and older with at least one claim for the drugType
: Type of drug used: Brand or GenericOpioidFlag
: Whether or not the drug is an opioid or not an opioidSpecialtyCateg
: provider specialty in broader categories (see Specialty
variable)