36-315: Statistical Graphics and Visualizations - Final Report

By Aanika Schueler, Kina Paguyo, Fritz Sanger

Introduction:

This dataset, which we found from Kaggle, contains information about songs in the

Spotify Top 200 since the year 2016. It contains 19 variables and 6513 rows. The variables are id

(Spotify ID for the track), artist_names (name of artists), track_name (name of song), source

(record label), key (key the track is in), mode (modality: major or minor), time_signature (how

many beats per measure), danceability (how suitable a track is for dancing from 0-1), energy

(measure of intensity and activity from 0-1), speechiness (how much exclusively words are

spoken on a track on a scale of 0-1), acousticness (a confidence level of whether or not the track

is acoustic from 0-1), instrumentalness (whether a track contains no vocals on a scale of 0-1),

liveness (presence of audience in the recording from 0-1), valence (musical positiveness from

0-1), loudness (loudness in decibels), tempo (beats per minute), duration_ms (length of song in

milliseconds), weeks_on_chart (number of weeks track was in the top 200), and streams

(number of streams a song had while in the charts). Each row corresponds to its own song.

Our three research questions are:

1) What variables in this dataset are good predictors of the number of streams, and how do

they relate to one another?

2) What themes/words are commonly used in song titles, and are songs with negative

themes more popular than songs with positive themes?

3) What variables in this dataset are good predictors of a song’s longevity in the Top 200?

How do these variables relate to one another?

Analysis of Indicators of Highly Streaming Songs

To address our first research question and examine the important variables in our dataset,

we created a regression model with the quantitative variables. Then, we ran a Best GLM in R to

determine which predictors are the best to keep in our final model. This is based on the Akaike

Information Criterion (AIC), an estimator of prediction error in a model, and the stepwise

regression chooses the predictors that result in the lowest AIC value. We notice that energy,

speechiness, liveness, valence, loudness, and duration_ms are the best predictors. This can be

seen by the R code and output below.

These results were somewhat supported by a PCA biplot of the top 200 most-streamed

songs in the dataset, which displays a high concentration of most-streamed songs where

confidence levels for the musical features of energy, liveness, valence, and loudness are high.

There is also a slight concentration of top 200 most-streamed songs where confidence levels for

speechiness are high. This biplot indicates that acousticness may be a good indicator of a song’s

streaming success, which was not reflected in the results of our Best GLM. The close

vectorization of loudness, energy, and valence indicate a strong correlation between these

variables for high-streaming songs.

Biplot of Musical Features for 200 Most-Streamed Songs

We also decided to create a correlation plot to examine how these predictors interact with

one another.

Based on this graph, we notice that energy and loudness have a very strong positive

correlation, indicating that an increase in the presence of activity and intensity is associated with

an increase in the number of decibels. We also see that valence and energy have a notable

positive correlation, as well as loudness and valence. Energy and liveness have a positive

correlation to a much lesser degree. We also notice that valence and the length of the song have a

negative correlation, indicating that those variables are inversely related to one another.

To further address our first research question in hopes of better understanding the

relationship between variables in the dataset and streams, we created a scatterplot of liveness

versus streams.

We notice that lower values of liveness appear to be correlated with higher values of

streams. This indicates that the presence of an audience in the recording has a negative impact on

how many streams the song has, which makes sense because from our experience people tend to

listen to the actual produced songs rather than live versions when using Spotify.

Analysis of Song Titles and Themes

To address which themes/words are most commonly used in song titles we started by

creating a word cloud of the most frequent words used in the titles of the songs in the data set. In

doing this we decided to view words by their roots so that the same word with, for example,

different tenses would still be considered the same word. Additionally, common words such as

‘the’, ‘in’, ‘as’, etc. were removed from the list.

We see that the most common word used in song titles is love, followed by remix,

Christmas, girl, bad, and la (the spanish word). This word cloud suggests that positive themes

(such as love, Christmas, heart, light, etc.) might be more common in song titles than negative

ones, as the majority of words included in it have positive associations.

However, we wanted to further investigate whether songs with positive or negative

themes are more popular. To do this we took the top 200 most popular songs and analyzed

whether their titles contained positive or negative themes. We then created a bar chart displaying

the percentage of the top 200 most popular songs that are positive and negative. We also created

a bar chart displaying the number of positive and negative songs that have stayed on the

Billboard top 200 chart for a year or more (52 weeks on chart).

The above chart shows that around 55% of the 200 most popular songs are negatively

themed, and around 45% are positively themed. We see a very similar distribution in the

percentage of songs charting for more than one year, with negatively themed songs making up

just shy of 60%. The similarity of these two charts makes it clear that a higher proportion of

hyper-popular songs have negative themes.

Analysis of Indicators of Chart Longevity

To address our third research question, which variables in the data set are good predictors

of a song's longevity in the top 100, we performed another regression analysis, this time setting

the response variable as the number of weeks on the chart.

The summary output of the regression analysis shows that the key, energy, speechiness,

liveness, valence, loudness, and duration (in milliseconds) are all significant predictors of a

song's longevity in the top 200. To investigate the relationships between these predictor variables

we created a pairs plot (omitting energy since we’ve seen that it is fairly redundant with

loudness).

The pairs plot shows that loudness has a slightly positive relationship with each of the

other predictor variables. We also see that duration doesn’t appear to have a strong relationship

with any of the other predictor variables. To further explore how song duration impacts the

longevity of a song on the billboard top 200 chart, we then created a scatter plot between weeks

on the chart (x-axis) and song duration (y-axis).

The chart shows that there seems to be an ideal song length of around 3 and ⅓ minutes

(2*10^5 milliseconds). For songs of (and around) this length, the number weeks on the chart are

far higher than songs of lengths significantly shorter or longer.

Additionally, we wanted to analyze the distribution of musical features among the

longest-charting songs in the dataset. We visualize these distributions in a PCA Biplot of the top

200 longest streaming songs in the dataset. Our regression analysis revealed that the musical

features energy, speechiness, liveness, valence, and loudness are all significant predictors of a

song's longevity in the top 200. These results are reflected in our PCA Biplot, where we can see

concentrations of longest-charting songs where these musical features are reported with higher

confidence levels. Compared to our earlier PCA Biplot of Top 200 most-streamed songs, this

biplot displays less significant clustering, indicating that we might trust indicators for a song’s

streaming success more than indicators for a song’s chart longevity.

Biplot of Musical Features for 200 Longest-Charting Songs

Conclusion

Through our graphical analysis of the spotify data set, we were able to successfully

answer a lot of the dimensions of our initial research questions. We found that the best predictors

of the number of streams a song gets were its energy, speechiness, liveness, valence, loudness

and duration. Further, we found that loudness and energy have a correlation of close to 1,

showing that they are nearly the same variable. Beyond that, we found that energy/loudness and

valence are positively correlated, and that duration and valence are negatively correlated.

In our investigation into which words/themes are commonly used in song titles we found

that the word love is by far the most used in song titles. Additionally, the words remix,

Christmas, girl, bad, and la (in spanish) were all also quite frequently used. We also found out of

both the top 200 most popular songs, and all songs which spent a year or more on the Billboard

top 200 list that the majority of their titles concerned negative themes.

We also found that the best predictors of a song’s longevity on the Billboard top 200 list

were its key, energy, loudness, liveness, valence, speechiness, and duration, with loudness and

speechiness being the most significant of these. From our exploration of the relationships

between these variables we found that duration was pretty independent of the other quantitative

predictors, and that loudness was weakly positively correlated with each of the other quantitative

predictors. In our comparison of song duration and that song's number of weeks on the Billboard

top 200 list, we found that there appears to be a sweet spot in song length of 3 and ⅓ minutes

where songs appear to be far more popular.

While our analysis answered many of our questions, it also illuminated many other

potential areas of research. It would be very interesting to investigate whether songs different in

sonic qualities depending on if they are positively or negatively themed. Additionally, comparing

the sonic qualities or themes of hyper-popular songs (such as songs with over 1 billion streams,

or songs with over 2 years on the Billboard 200 list) to songs with low popularity to see if there

are any notable differences is another area worth exploring. While this data set did not contain

information on which genre of music each song came from, comparing sonic qualities, song

themes, song popularity, etc. of songs by genre is another very interesting area for further

research.