Introduction

Music and humanity have long been intertwined. The first evidence of musical instruments in caves stretches back 35,000 years ago, and the oldest melody recorded is inscribed on a gravestone from the first century AD. Today, rather than being played in caves or inscribed on gravestones, music is a multibillion dollar worldwide industry. With online streaming platforms like Spotify, access to music is at an all time high.

Spotify has recently become the world’s leading audio streaming service with over 456 million users across the globe. Each year, millions of songs are streamed on Spotify, and many of them become popular among the platform’s users. In this project, we will take a look at the dataset “Top Spotify Songs from 2010-2019 by Year”. We will explore the most streamed songs from this decade, as well as the top songs from each year. We aim to get insights into the factors that may have contributed to the success of these songs and their lasting appeal.

This report aims to answer the following question: “What makes a song more popular than the rest?”

Description Of The Dataset

This dataset was obtained by a Kaggle user from the following url: http://organizeyourmusic.playlistmachinery.com/. It contains information about the top Spotify songs around the world by each year in the range 2010-2019. We cleaned this dataset (described below) to obtain a dataframe with 595 observations (songs) and 15 columns. The dataset contains several variables about each song as outlined below:

  1. Id : A unique identifier
  2. Title: A song’s title
  3. Artist: The song’s artist
  4. Top Genre: The genre of the track
  5. Year: The song’s year in the billboard
  6. BPM: Beats.Per.Minute; the tempo of the songs; ranges in value 0 - 206
  7. nrgy: Energy; the energy of a song - the higher the value, the more energetic; ranges in value 0 - 98
  8. dnce: Danceability: the higher the value, the easier it is to dance to this song; ranges in value 0 -97
  9. dB: Loudness; the higher the value, the louder the song; ranges in value -60 to -2
  10. live: Liveness; the higher the value, the more likely the song is a live recording; ranges in value 0 - 74
  11. val: Valence; the higher the value, the more positive mood for the song; ranges in value 0 - 98
  12. dur: Length; the duration of the song; ranges in value 134 - 424
  13. Acous: Acousticness; the higher the value, the more acoustic the song is; ranges in value 0 - 99
  14. spch; Speechiness; the higher the value, the more spoken words the song contains ranges in value 0 - 48
  15. pop: Popularity; the higher the value, the more popular the song is; ranges in value 0 - 99

There are 5 categorical variables - ID, Title, Artist, Top Genre and Year. The rest of the variables are continuous quantitative variables.

Given that the dataset describes the most popular songs, we are interested in answering questions about the songs pertaining to popularity. In order to do this, we filtered our dataset to remove songs with popularity less than 25, and this got rid of 8 outlier values.

EDA

Before we dive into our research questions, we will take a look at the Popularity variable:

Since this dataset is composed of songs that made it into the Spotify top 10 list between 2010 and 2019 (inclusive), all of the songs in the data are popular, compared to other songs on the platform. The popularity score gives a ranking to that popularity for these well-known songs and covers a range from 25 to 100. We can see that the distribution is left skewed, meaning more songs had a higher popularity score than a lower popularity score. Further, the distribution is unimodal, with a peak at around 70, and asymmetrical. The line in the graph is at around 67.3, which is the average popularity. For later analysis, we classified songs above this mean as “High Popularity”, and songs below this mean as “Low Popularity”.

Research Questions

After exploring the popularity variable, we are now interested in observing how certain features of songs affect its popularity score. Specifically, we will look at how song title, general song attributes, and danceability affect the popularity of a song.

How does song title affect popularity?

Word Cloud of Song Titles

We first ask the question: which words appear more frequently in popular songs? Plotting a wordcloud of the top 50 frequently occurring words in song titles, we see that “love” and “like” are the most common words. This implies that the most popular songs revolve around themes of romance and friendship. The next most frequently occurring words include “heart”, “girl”, “beauti”, “kiss”, “young”, and “feel”, which further bolsters our initial thought of the titles revolving around romance and friendship. We could even infer from the words that there are themes of growing up, heartbreak, desire and escapism. In order to create the word cloud, we removed stop words, performed stemming, and removed parts of the titles that included text such as “featuring artists”, “movie soundtrack”, etc. to ensure that we only included words from the main part of the title in the word cloud.

Sentiment Analysis of Song Titles

Next, we ask the following questions: are the popular song titles usually positive or negative in nature, and what are the most frequently occurring positive and negative words in song titles?

After performing sentiment analysis on the data, we make word clouds of “positive” words and “negative” words. The negative word cloud is much bigger than the positive word cloud implying that the more popular song titles have a negative connotation to them. The common positive words are “love”, “like”, “good”, “lover” and “perfect”, and the common negative words are “broken”, “kill”, “lose”, “bad”, “stranger”, hard”, “hurt” and “lie”. Thus, it is evident that the popular songs with positive song titles center around romance, friendship and inspiration whereas the popular songs with negative song titles have themes surrounding pain, loss and breakups.

Overall, we observed that while the most common word in the song titles in “love”, there are more songs that have titles with negative connotations, as evidenced by the negative words wordcloud being larger. This indicates that popular songs in general tend to have negative connotations in their titles.

Proportion Test of Sentiments

To find whether this difference in positive and negative sentiments was statistically significant, we ran a proportion test testing the proportion of positive words between each year in the dataset (2010-2019). Testing at a significance level of = 0.05, we found that the difference between positive and negative words in song titles was not statistically significant as the p-value was 0.1012. We can say however, that there are more unique negative words than positive words.

Title Length versus Popularity

Next, we look at the title length in relation to popularity score. The plot below shows the conditional distribution of density of number of words in song title per year (left). We can see that most years typically have the same right skewed density curve but that 2018 and 2019 seem to be approaching a more symmetrical distribution. This means that for all other years, a shorter title is more common, but for 2018 and 2019, the titles tended to be slightly longer. On the right side of this plot, we now compare the number of words in a song title with the popularity score of that song.

We do see that longer songs tend to have higher popularity scores, but this can be attributed to the fact that there are vastly fewer songs with 6-8 words in their title than songs with a length of 3 or less. Thus, we cannot say with any certainty that there is a relationship between title length and popularity.

How do song attributes affect popularity?

Spotify tracks and quantifies the attributes of music on its platform with a range of scores from 0-100 in separate categories. These continuous song attributes are present in our data: BPM, energy, danceability, loudness, liveness, valence, duration, acousticness, speechiness, and popularity. For the sake of this analysis, we used all continuous attributes except popularity, which we transformed into a categorical variable of High and Low popularity.

MDS of High and Low Popularity Songs - split by above or below the average popularity

To understand the effect of the different song attributes observed on the popularity of the songs in our dataset, we first split the songs by popularity into a “High” and “Low” popularity group respectively, as described in our Description of the Dataset segment. We then used multidimensional scaling to identify any clusters of songs with similarities across their dimensions, discriminated by our new popularity groups.

In our multidimensional scaling, we incorporated all continuous song attributes except popularity: BPM, energy, danceability, loudness, liveness, valence, duration, acousticness, and speechiness. The plot showed one large cluster of songs, with slight clusterings outside of it. This shows that even across our multidimensional scaling, the songs with more and less popularity seem to cluster in the same area. ### PCA Plot of Song Attributes

To further see the effects of each attribute, we performed a Principal Component Analysis on the same attributes and displayed the results in this biplot, again coloring by our popularity grouping.

## Importance of components:
##                           PC1    PC2    PC3    PC4     PC5     PC6     PC7
## Standard deviation     1.5938 1.2000 1.0586 0.9956 0.91497 0.88459 0.80727
## Proportion of Variance 0.2823 0.1600 0.1245 0.1101 0.09302 0.08694 0.07241
## Cumulative Proportion  0.2823 0.4422 0.5668 0.6769 0.76992 0.85687 0.92928
##                            PC8     PC9
## Standard deviation     0.62122 0.50058
## Proportion of Variance 0.04288 0.02784
## Cumulative Proportion  0.97216 1.00000

The PCA shows us the differences between song attributes and their correlations. First, acousticness and energy appear to have a strong negative relationship, which makes sense when thought of from a musical perspective. Duration also seems to have a negative correlation with danceability and valence, which means that shorter songs tend to be more happy and danceable. Speechiness and BPM seem highly related, meaning that lower BPM songs tended to not have lyrics, while higher BPM songs tended to contain music and lyrics.

Beyond this analysis of song attribute relationships, we don’t see any specific groupings of high or low popularity music. Songs towards the extremes of acousticness and BPM tended to be less popular, but there doesn’t seem to be any visible relationship beyond that. With the data we have and analysis we have conducted, we are unable to conclude any clear relationship between song popularity and the song attributes given.

How does danceability affect popularity?

A song with a high danceability rating is likely to be more popular at parties and clubs where people are looking to dance.

Under this research question, we will aim to see if there exists a relationship between danceability and popularity. However, before doing so, we will first explore the danceability variable on its own.

Danceability By Genre

Plotting a side-by-side boxplot of danceability by the top 10 genres, we see that pop songs have the highest median danceability score (with barbadian pop having the second highest median) and neo-mellow songs have the lowest median danceability score. This is surprising, as we would expect dance pop songs to have the highest median danceability score because, by definition, dance pop songs are generally uptempo music intended for nightclubs with the intention of being danceable. The range of danceability scores is highest for Canadian contemporary r&b and lowest for British soul.

Danceability vs Popularity

Finally, plotting a scatterplot of danceability against popularity and coloring the points by the top 10 genres, we see that there is no obvious linear correlation between danceability and popularity. Further, we note that dance pop songs (dark blue colored points) are spread out across all values of the y-axis, indicating that the genre of dance pop doesn’t necessarily mean that the song is more danceable. We see that there are more pop songs (pink colored points) with a popularity greater than 65 than below, which could mean that pop songs are generally more popular.

Overall, we see that a high danceability score does not necessarily mean that the song will be more popular. A song with a low danceability rating can still be popular if it has other appealing qualities, such as catchy melodies, engaging lyrics, or a unique sound. Additionally, the popularity of a song is often influenced by factors outside of its musical qualities, such as the artist’s popularity, the marketing and promotion of the song, and cultural and social trends. So, even if a song has a low danceability rating, it is evident that it can still become popular if it resonates with people for other reasons.

Conclusions

Through our research, we were able to learn more about the relationships between music, popularity, and its component attributes. Looking at the titles of popular songs, while we saw more unique negative words than positive, our statistical analysis was unable to conclude any significant difference in the proportions of these words. Our next analysis looked at Spotify’s song attributes and their relationship with popularity. Through multidimensional scaling and principal component analysis, we saw that most songs fell into a single clustering of our principal components, regardless of popularity. Future research in this topic could include looking at songs outside of Spotify’s Top Ten, the source for our dataset, for more strongly “unpopular” songs. Finally, we dove deep into the Danceability attribute, and noted the differences between the danceability of genres. While pop and dance pop had predictably high danceability, neo mellow and british soul music had low danceability. Ultimately, this report serves as a good introductory analysis into the songs in Spotify’s Top Ten from 2010 to 2019, and has opened up the path for more in depth analysis in the future.