library(factoextra)

## Loading required package: ggplot2

## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa

library(tidyverse)

## ── Attaching packages ─────────────────────────────────────── tidyverse 1.3.1 ──

## ✓ tibble  3.1.6     ✓ dplyr   1.0.7
## ✓ tidyr   1.1.4     ✓ stringr 1.4.0
## ✓ readr   2.1.1     ✓ forcats 0.5.1
## ✓ purrr   0.3.4

## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## x dplyr::filter() masks stats::filter()
## x dplyr::lag()    masks stats::lag()

library(ggpubr)
library(ggseas)
library(trackdown)
library(stopwords)
library(wordcloud)

## Loading required package: RColorBrewer

library(wordcloud2)
library(stringr)
library(tm)

## Loading required package: NLP

## 
## Attaching package: 'NLP'

## The following object is masked from 'package:ggplot2':
## 
##     annotate

## 
## Attaching package: 'tm'

## The following object is masked from 'package:stopwords':
## 
##     stopwords

library(ggridges)


spotify.df = read.csv('/Users/bobachubs/Documents/36315/HW/Spotify.csv')

Introduction

Abstract

Music has been a source of entertainment and staple in our everyday lives. Not only has the music industry shaped pop culture since the early 20th century, but the unlimited amount of options allow us to personalize our playlists to satisfy particular tastes. However, imagine entering the industry as an aspiring artist and having to appeal toward popular interests. That would require knowing the factors that make songs popular and how those factors shift over time. Turns out, there are many tangible aspects we can inspect for a given song to find such associations. For our project, we have taken the “Spotify: All Time Top 2000 Mega Dataset” from kaggle to perform statistical and graphical analyses.

Description of dataset

This dataset was obtained by a Kaggle user via an app that uses the Spotify API to extract the 2000 most popular songs on Spotify from 1956-2019. There are 1994 observations, 3 categorical variables, and 12 quantitative variables, as described below:

Index: numbers the tracks of the original dataset from 1-1994 Title: name of the track Artist: name of the artist Top.Genre: genre of the track Year: release year of the track, ranging from 1956 to 2019 Beats.Per.Minute..BPM: the overall estimated tempo of a track in BPM Energy: the measure of the energy of a song from 0.0 (least energetic) to 1.0 (most energetic). This represents a measure of intensity and activity via dynamic range, loudness, timbre, onset rate, and entropy. Danceability: the measure of how suitable a track is for dancing, from 0.0 (least danceable) to 1.0 (most danceable). This is based on a combination of elements including tempo, rhythm stability, beat strength, and overall regularity. Loudness..dB: the loudness of a track in decibles, averaged across the entire track, ranging from -60 to 0 dB. Liveness: the probability that the track was performed live, from 0.0 (low probability) to 1.0 (high probability). A value above 0.8 provides strong likelihood that the track is live. Valence: the measure of the musical positiveness conveyed by a track, from 0.0 (lowest valence, sounds more negative) to 1.0 (highest valence, sounds more positive). Length: the duration of the song in seconds Acousticness: a confidence measure of whether the track is acoustic, from 0.0 (low confidence) to 1.0 (high confidence) Speechiness: the measure of the presence of spoken words in a track. The more exclusively speech-like the recording (podcast, audio book, poetry), the closer to 1.0 the attribute value. Values above 0.66 describe tracks that are probably made entirely of spoken words. Values between 0.33 and 0.66 describe tracks that may contain both music and speech, either in sections or layered, including such cases as rap music. Values below 0.33 most likely represent music and other non-speech-like tracks. Popularity: a value between 0 (least popular) and 100 (most popular). The popularity is calculated by algorithm and is based, in the most part, on the total number of plays the track has had and how recent those plays are. Popularity.Index:

Given that this dataset includes the most popular songs on Spotify, we were interested in looking at how these variables, many of which are audio features of the songs, affect (or don’t affect) popularity. Basically: what makes a song on Spotify popular?

Before we begin diving into the other variables, we first wanted to take a look at Popularity itself.

## Warning: NAs introduced by coercion

## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

The distribution of Popularity is roughly unimodal, asymmetrical and skewed to the left, with a median at 62 and an average at 59.53. It is interesting to note the high median indicates that the majority of the songs in this dataset are more popular than not.

Research Questions

One of the most interesting features of a song is its genre, which leads us to our first research question.

How does genre affect popularity?

Before we even answer this, we need to prove that genre even affects popularity. We constructed an ANOVA test to show that popularity means across genres aren’t equivalent.

This dataset has far too many specific genres for us to make any meaningful conclusions about the association between genre and popularity, so we grouped them into 10 general genres:

unique(spotify.subset$Popular.Genre)

## [1] "adult standards" "album rock"      "alternative"     "pop"            
## [5] "hip hop"         "indie"           "country"         "metal"

Then we perform our ANOVA test. Our null hypothesis is that the popularity means across genre groups are equivalent, and our alternative hypothesis is that the means are not equivalent (i.e. at least one is different).

summary(aov(Popularity ~ Popular.Genre, data=spotify.df))

##                 Df Sum Sq Mean Sq F value Pr(>F)    
## Popular.Genre    9  42928    4770   25.73 <2e-16 ***
## Residuals     1980 367063     185                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Since the p-value is much smaller than the significance level of 0.05, we reject the null hypothesis that the popularity means across genre groups are equivalent. Therefore, we have sufficient evidence that suggests that genre does affect popularity.

First, we’d like to look directly at the distribution of popularity within each genre with a density ridgeline plot. Since we decided to omit misc and other rock, since these genres are too varied for us to make any insightful conclusions.

# Density ridgeline plot: popularity, popular.genre
#Density plots of genre popularities
colors = c("gray", "#191414", "black", "gray", "pink", "blue", "black", "gray")
ggplot(spotify.subset, aes(x = Popularity, y = Popular.Genre)) +
  geom_density_ridges(alpha = .75, color = "#1DB954", fill = "black") +
  labs(title = "Density of Popularity by Genre", y = "Genre") +
  scale_fill_manual(values=c("pop" = "gray", "metal"="black")) +
  scale_x_continuous(breaks = seq(from = 0, to = 100, by = 10), limits = c(0,100))

## Picking joint bandwidth of 5.12

While most genres seem to be unimodal, hip hop seems more bimodal, and country seems trimodal. indie has the lowest center than the other genres (which makes sense because independent music tends to be more niche and less popular). The spread of alternative, album rock, and adult standardsis small, whereas the spread of pop and country is much larger. From the density curves, it seems that genres like metal, alternative, album rock, and adult standards enjoy somewhat stable popularity, but we’d like to investigate how their popularity changes over time. Were these genres only popular during certain years, or are they consistently popular?

To look at the general trends of popularity of each genre across time, we used a time series plot.

# Time series popularity plot of popular genres over time

test = spotify.subset %>% group_by(Year, Popular.Genre) %>% summarize(avg = mean(Popularity))

## `summarise()` has grouped output by 'Year'. You can override using the `.groups`
## argument.

test %>%
  ggplot(aes(Year, avg, color = Popular.Genre)) + 
  stat_rollapplyr(width = 5, align = "center") +
  labs(x = "Year", y = "Popularity",
       title = "Avg Popularity of Popular Genres over Time")

## Warning: Removed 32 row(s) containing missing values (geom_path).

From the time series plot, we can extract information about the variation of average popularity of top genres over time, both from the range of the individual lines and the trends direction. The graph was created using the mutated dataset in which the mean popularities were calculated for each popular genre in a given year then plotted over a width of 5 years. We can see that indie started off unpopular in the 1970s but has grown more popular over time. Album rock has become overall less popular over time; country music has been consistently mid-popular from the 1970s to 1990s but suddenly dropped in popularity in the 2000s; pop has stayed relatively popular since the 1980s. Both metal and hip hop music peaked in the 1990s-2000s but declined in popularity quickly as their eras were short.

Now that we have seen the popularity time trends of each genre, we would like see in more detail how popularity compares across genres over the years. For ease of analysis, we decided to group years by decade, since we assume that every decade has similar music popularity trends. We calculated Pearson’s residuals of popularity for each genre and decade and displayed them with a mosaic plot.

# mosaic plot
popular.table = table(spotify.subset$Popular.Genre, spotify.subset$decade)
mosaicplot(popular.table, shade=T, las = 2, main = "Pearsons Residual Plot of Popular Genres and Decades")

The Pearson’s residuals mosaic plot above is colored shades of red and blue to indicate whether a genre was more prevalent than expected or less prevalent than expected during a particular decade. The bluer a section is, the higher the standardized residual is, meaning that the genre during that time appeared more often than expected; the same goes for the red sections. While the country genre has no shading for any decade, meaning that the values present are similar to what we expect, every other genre has some form of shading. This shows that there does seem to be a relationship between the genre of the song and the decade; in other words, as time goes on, the popularity and frequency of songs in a particular genre change as well. For instance, we see that pop music was lower than expected during the 1960s and 1970s but was higher than expected during the 2000s and 2010s meaning that the number of popular songs within the pop genre increased with time.

Now we’d like to see how (and whether) year and popularity are clustered together by genre, using a 2D contour plot.

# 2D contour plot: year, popularity, genre
spotify.subset %>%
  ggplot(aes(x=Year,y=Popularity)) +
  geom_point(alpha=.5, aes(color=Popular.Genre)) +
  geom_density2d() +
  labs(title="Popularity vs Year with 2D Contour")

We note that there exist two clusters, one around the 1970s-80s with mostly album rock music. This makes sense because this genre was created in the 70s for rock artists like Bon Jovi and AC/DC. The other cluster is around the late 2000s and early 2010s with mostly hip hop and pop grouped together. They have similar popularity levels and become a trend around the same time. We do not see other obvious clusters based on the contour plot except for these two.

How do audio features affect popularity?

After a thorough analysis of genre and popularity, including some analysis of trends and interactions over time, we’d also like to include other variables, most of which are calculated from audio qualities like tempo, rhythm, volume, and key. Since there are multiple variables and they are all quantitative and similarly scaled, we decided to use a PCA biplot.

#pca 
spotify.quant = spotify.subset[, 6:15]
spotify.pca = prcomp(spotify.quant,
                      center = TRUE, scale. = TRUE)
# spotify.subset = spotify.subset %>% mutate (decades = cut(Year, c(0, 1980, 2000, 2020),
#                             labels = c("Old", "Recent", "New")))
#biplot to get relations
fviz_pca_biplot(spotify.pca, title = "PCA Biplot of Spotify Songs grouped by Popular Genres ",
                label = "var", repel = TRUE,
                alpha.ind = .25, alpha.var = .75,
                habillage = spotify.subset$Popular.Genre, pointshape = 19)

The dataset was filtered and mutated to categorize the excessive unique genres under popular main genres and include only those ones in the PCA Biplot graph. The biplot allows us to see which genres lean toward what quality of the song (Energy, Acousticnes, Danceability, etc). For example, country and adult standard music tend to lean toward acousticness and away from energetic/loud qualities, which makes sense given adults seem to enjoy more calm and melodic tunes and country music is on the slower side. Hip hop seems to lean toward higher popularity indexes and valence/danceability which is intuitive since many spotify users are the youth. Another intuitive observation is that metal music corresponds with high BPM and is known for its intense and upbeat instrumentals. Alternative music, by definition as well, seems to lean toward liveness (live recordings) as this genre is less polished and more thematic than modern songs recorded in the studio. Looking at the popularity arrow, we can see it is associated with valence and danceability qualities as well as hip hop music. The biplot allows us to see not only what main genres are associated with what particular qualities of music but also what qualities of music point in the same direction and are associated.

After seeing these quantitative trends, we were curious about the categorical variables, namely Title.

How does a song’s title affect its popularity?

What kind of words are frequent in popular song titles? And is there any association between frequent words and song popularity? To answer this, we created a word cloud colored by popularity.

head(spotify.subset[order(spotify.subset$Popularity, decreasing = T), c(2, 3, 16, 5,15)])

##                                Title        Artist Popular.Genre Year
## 794                     Dance Monkey   Tones and I           pop 2019
## 788                         Memories      Maroon 5           pop 2019
## 787                          bad guy Billie Eilish           pop 2019
## 1642 All I Want for Christmas Is You  Mariah Carey           pop 1994
## 727                          Shallow     Lady Gaga           pop 2018
## 684                          Perfect    Ed Sheeran           pop 2017
##      Popularity
## 794         100
## 788          98
## 787          95
## 1642         95
## 727          88
## 684          87

Just to get a sense of the most popular songs of all time, we’ve displayed the head of the dataframe with the song title, artist, genre category, and release year, all ordered by decreasing Popularity. We can see that the top songs of all time are all pop. Despite popularity being associated with recent plays, the classic “All I want for Christmas is You” reigns in the top 5 though it is an old song.

words = unlist(strsplit(spotify.df$Title, " "))
words = words[!(tolower(words) %in% data_stopwords_smart$en)]
words = words[!(words %in% c("Remaster", "Remastered", "Single", "Version"))]
words = words[str_detect(words, "^[:alpha:]+$")]
top_words = data.frame(rev(sort(table(words)))[1:100])
top_words$words = as.character(top_words$words)

get_popular_word = function (word) {
  pop = c()
  for (i in 1: nrow(spotify.subset)) {
    if (grepl(tolower(word), tolower(spotify.subset$Title[i]), fixed=TRUE)) {
      pop = c(pop, spotify.subset$Popularity[i])}}
  return (mean(pop))
}

pop = c()
color = c()
for (i in 1:nrow(top_words)) {
  pop = c(pop, get_popular_word(top_words[i,1]))
}

top_words$pop = pop

for (i in 1:nrow(top_words)) {
  if (top_words[i, 3] < 58.16) {
    color = c(color, "#192f1f")
  } else if (58.16 <= top_words[i, 3] && top_words[i, 3] < 64.5) {
    color = c(color, "#b3b3b3")
  } else {
    color = c(color, "#1ed760")
  }
}

wordcloud(words = top_words$words, freq = top_words$Freq,
           random.order = F, colors = color,rot.per=0.1, 
          max.words = Inf,ordered.colors=T)
title(main = "Wordcloud of Top Song Title Words by Popularity")

To get a sense of what kind of songs are popular based on titles alone, we generated a wordcloud of the top 100 most appeared words in song titles, with the size of the word corresponding to its relevant frequency. From the visual, we see that “love” outranks any other title word by a large margin, which makes sense since a large portion of music is centered around romance. In addition to reflecting the frequency of words, we wanted to marginalize by their corresponding popularities. Since words can be used in more than one title, we take the average of the popularity that the words appear in; green represents a high popularity average, gray represents a mid-range popularity average, and black represents a low popularity average. It is interesting that our most popular word in titles, “love,” is not green but gray indicating that it has a mid-range average popularity value even though the word itself is the most popular frequency-wse. One note to make here is that we got rid of stop words such as “the” and “it” but only did so for English words which is why we see stop words in other languages including “Je” and “De.” Though these are stop words, their occurrences show that many top songs are from other nations and internationally popular. In addition to stop words, we took out specific words such as “Remastered” and “Single” since they don’t provide meaning in our analysis.

Conclusion

Main takeaways

Overall, it seems that there are multiple factors that come into play when analyzing the popularity of songs on Spotify. However, among all the variables we examined, genre does seem to play a heavy role in the popularity of a song. In fact, when subsetting our songs into generalized genres, we were able to see more trends in our density graph, contour plot, and Pearson’s residual mosaic plot.

Future research

One downside to this dataset is that it is not the most recent. Thus, while we still may have seen larger trends that span over a larger period of time, it would also be interesting to analyze more recent Spotify data that would allow us to see shorter-lived trends. For instance, with a shorter range in time, we may see more factors that contribute to popularity such as the artist or subgenres rather than an overarching genre. In addition to examining different variables, we can also perform different tests for future research upon the dataset such as using machine learning.

# #Density plots of genre popularities
# colors = c("gray", "#191414", "black", "gray", "pink", "blue", "black", "gray")
# ggplot(spotify.subset, aes(x = Popularity, y = Popular.Genre)) +
#   geom_density_ridges(alpha = .75, color = "#1DB954", fill = "black") +
#   labs(title = "Density of Popularity by Genre", y = "Genre") +
#   scale_fill_manual(values=c("pop" = "gray", "metal"="black")) +
#   scale_x_continuous(breaks = seq(from = 0, to = 100, by = 10), limits = c(0,100))

Top 2000 Spotify Songs Project

Sarah Li, Evelyn Chung, Audrey Ding, Dunmin (Victor) Zhu

Due Monday, May 2, 2022 (11:59 PM EST)