
36-315: Statistical Graphics and Visualizations - Final Report
By Aanika Schueler, Kina Paguyo, Fritz Sanger
Introduction:
This dataset, which we found from Kaggle, contains information about songs in the
Spotify Top 200 since the year 2016. It contains 19 variables and 6513 rows. The variables are id
(Spotify ID for the track), artist_names (name of artists), track_name (name of song), source
(record label), key (key the track is in), mode (modality: major or minor), time_signature (how
many beats per measure), danceability (how suitable a track is for dancing from 0-1), energy
(measure of intensity and activity from 0-1), speechiness (how much exclusively words are
spoken on a track on a scale of 0-1), acousticness (a confidence level of whether or not the track
is acoustic from 0-1), instrumentalness (whether a track contains no vocals on a scale of 0-1),
liveness (presence of audience in the recording from 0-1), valence (musical positiveness from
0-1), loudness (loudness in decibels), tempo (beats per minute), duration_ms (length of song in
milliseconds), weeks_on_chart (number of weeks track was in the top 200), and streams
(number of streams a song had while in the charts). Each row corresponds to its own song.
Our three research questions are:
1) What variables in this dataset are good predictors of the number of streams, and how do
they relate to one another?
2) What themes/words are commonly used in song titles, and are songs with negative
themes more popular than songs with positive themes?
3) What variables in this dataset are good predictors of a song’s longevity in the Top 200?
How do these variables relate to one another?
Analysis of Indicators of Highly Streaming Songs
To address our first research question and examine the important variables in our dataset,
we created a regression model with the quantitative variables. Then, we ran a Best GLM in R to
determine which predictors are the best to keep in our final model. This is based on the Akaike
Information Criterion (AIC), an estimator of prediction error in a model, and the stepwise
regression chooses the predictors that result in the lowest AIC value. We notice that energy,
speechiness, liveness, valence, loudness, and duration_ms are the best predictors. This can be
seen by the R code and output below.