1
36-315: Statistical Graphics and Visualizations - Final Report
By Aanika Schueler, Kina Paguyo, Fritz Sanger
Introduction:
This dataset, which we found from Kaggle, contains information about songs in the
Spotify Top 200 since the year 2016. It contains 19 variables and 6513 rows. The variables are id
(Spotify ID for the track), artist_names (name of artists), track_name (name of song), source
(record label), key (key the track is in), mode (modality: major or minor), time_signature (how
many beats per measure), danceability (how suitable a track is for dancing from 0-1), energy
(measure of intensity and activity from 0-1), speechiness (how much exclusively words are
spoken on a track on a scale of 0-1), acousticness (a confidence level of whether or not the track
is acoustic from 0-1), instrumentalness (whether a track contains no vocals on a scale of 0-1),
liveness (presence of audience in the recording from 0-1), valence (musical positiveness from
0-1), loudness (loudness in decibels), tempo (beats per minute), duration_ms (length of song in
milliseconds), weeks_on_chart (number of weeks track was in the top 200), and streams
(number of streams a song had while in the charts). Each row corresponds to its own song.
Our three research questions are:
1) What variables in this dataset are good predictors of the number of streams, and how do
they relate to one another?
2) What themes/words are commonly used in song titles, and are songs with negative
themes more popular than songs with positive themes?
3) What variables in this dataset are good predictors of a song’s longevity in the Top 200?
How do these variables relate to one another?
Analysis of Indicators of Highly Streaming Songs
To address our first research question and examine the important variables in our dataset,
we created a regression model with the quantitative variables. Then, we ran a Best GLM in R to
determine which predictors are the best to keep in our final model. This is based on the Akaike
Information Criterion (AIC), an estimator of prediction error in a model, and the stepwise
regression chooses the predictors that result in the lowest AIC value. We notice that energy,
speechiness, liveness, valence, loudness, and duration_ms are the best predictors. This can be
seen by the R code and output below.
2
These results were somewhat supported by a PCA biplot of the top 200 most-streamed
songs in the dataset, which displays a high concentration of most-streamed songs where
confidence levels for the musical features of energy, liveness, valence, and loudness are high.
There is also a slight concentration of top 200 most-streamed songs where confidence levels for
speechiness are high. This biplot indicates that acousticness may be a good indicator of a song’s
streaming success, which was not reflected in the results of our Best GLM. The close
vectorization of loudness, energy, and valence indicate a strong correlation between these
variables for high-streaming songs.
Biplot of Musical Features for 200 Most-Streamed Songs
3
We also decided to create a correlation plot to examine how these predictors interact with
one another.
Based on this graph, we notice that energy and loudness have a very strong positive
correlation, indicating that an increase in the presence of activity and intensity is associated with
an increase in the number of decibels. We also see that valence and energy have a notable
positive correlation, as well as loudness and valence. Energy and liveness have a positive
correlation to a much lesser degree. We also notice that valence and the length of the song have a
negative correlation, indicating that those variables are inversely related to one another.
4
To further address our first research question in hopes of better understanding the
relationship between variables in the dataset and streams, we created a scatterplot of liveness
versus streams.
We notice that lower values of liveness appear to be correlated with higher values of
streams. This indicates that the presence of an audience in the recording has a negative impact on
how many streams the song has, which makes sense because from our experience people tend to
listen to the actual produced songs rather than live versions when using Spotify.
5
Analysis of Song Titles and Themes
To address which themes/words are most commonly used in song titles we started by
creating a word cloud of the most frequent words used in the titles of the songs in the data set. In
doing this we decided to view words by their roots so that the same word with, for example,
different tenses would still be considered the same word. Additionally, common words such as
‘the’, ‘in’, ‘as’, etc. were removed from the list.
We see that the most common word used in song titles is love, followed by remix,
Christmas, girl, bad, and la (the spanish word). This word cloud suggests that positive themes
(such as love, Christmas, heart, light, etc.) might be more common in song titles than negative
ones, as the majority of words included in it have positive associations.
6
However, we wanted to further investigate whether songs with positive or negative
themes are more popular. To do this we took the top 200 most popular songs and analyzed
whether their titles contained positive or negative themes. We then created a bar chart displaying
the percentage of the top 200 most popular songs that are positive and negative. We also created
a bar chart displaying the number of positive and negative songs that have stayed on the
Billboard top 200 chart for a year or more (52 weeks on chart).
The above chart shows that around 55% of the 200 most popular songs are negatively
themed, and around 45% are positively themed. We see a very similar distribution in the
percentage of songs charting for more than one year, with negatively themed songs making up
just shy of 60%. The similarity of these two charts makes it clear that a higher proportion of
hyper-popular songs have negative themes.
7
Analysis of Indicators of Chart Longevity
To address our third research question, which variables in the data set are good predictors
of a song's longevity in the top 100, we performed another regression analysis, this time setting
the response variable as the number of weeks on the chart.
The summary output of the regression analysis shows that the key, energy, speechiness,
liveness, valence, loudness, and duration (in milliseconds) are all significant predictors of a
song's longevity in the top 200. To investigate the relationships between these predictor variables
we created a pairs plot (omitting energy since we’ve seen that it is fairly redundant with
loudness).
8
The pairs plot shows that loudness has a slightly positive relationship with each of the
other predictor variables. We also see that duration doesn’t appear to have a strong relationship
with any of the other predictor variables. To further explore how song duration impacts the
longevity of a song on the billboard top 200 chart, we then created a scatter plot between weeks
on the chart (x-axis) and song duration (y-axis).
9
The chart shows that there seems to be an ideal song length of around 3 and minutes
(2*10^5 milliseconds). For songs of (and around) this length, the number weeks on the chart are
far higher than songs of lengths significantly shorter or longer.
Additionally, we wanted to analyze the distribution of musical features among the
longest-charting songs in the dataset. We visualize these distributions in a PCA Biplot of the top
200 longest streaming songs in the dataset. Our regression analysis revealed that the musical
features energy, speechiness, liveness, valence, and loudness are all significant predictors of a
song's longevity in the top 200. These results are reflected in our PCA Biplot, where we can see
concentrations of longest-charting songs where these musical features are reported with higher
confidence levels. Compared to our earlier PCA Biplot of Top 200 most-streamed songs, this
biplot displays less significant clustering, indicating that we might trust indicators for a song’s
streaming success more than indicators for a song’s chart longevity.
Biplot of Musical Features for 200 Longest-Charting Songs
10
Conclusion
Through our graphical analysis of the spotify data set, we were able to successfully
answer a lot of the dimensions of our initial research questions. We found that the best predictors
of the number of streams a song gets were its energy, speechiness, liveness, valence, loudness
and duration. Further, we found that loudness and energy have a correlation of close to 1,
showing that they are nearly the same variable. Beyond that, we found that energy/loudness and
valence are positively correlated, and that duration and valence are negatively correlated.
In our investigation into which words/themes are commonly used in song titles we found
that the word love is by far the most used in song titles. Additionally, the words remix,
Christmas, girl, bad, and la (in spanish) were all also quite frequently used. We also found out of
both the top 200 most popular songs, and all songs which spent a year or more on the Billboard
top 200 list that the majority of their titles concerned negative themes.
We also found that the best predictors of a song’s longevity on the Billboard top 200 list
were its key, energy, loudness, liveness, valence, speechiness, and duration, with loudness and
speechiness being the most significant of these. From our exploration of the relationships
between these variables we found that duration was pretty independent of the other quantitative
predictors, and that loudness was weakly positively correlated with each of the other quantitative
predictors. In our comparison of song duration and that song's number of weeks on the Billboard
top 200 list, we found that there appears to be a sweet spot in song length of 3 and minutes
where songs appear to be far more popular.
While our analysis answered many of our questions, it also illuminated many other
potential areas of research. It would be very interesting to investigate whether songs different in
sonic qualities depending on if they are positively or negatively themed. Additionally, comparing
the sonic qualities or themes of hyper-popular songs (such as songs with over 1 billion streams,
or songs with over 2 years on the Billboard 200 list) to songs with low popularity to see if there
are any notable differences is another area worth exploring. While this data set did not contain
information on which genre of music each song came from, comparing sonic qualities, song
themes, song popularity, etc. of songs by genre is another very interesting area for further
research.