One of the most interesting features of a song is its genre, which leads us to our first research question.
How does genre affect popularity?
Before we even answer this, we need to prove that genre even affects popularity. We constructed an ANOVA test to show that popularity means across genres aren’t equivalent.
This dataset has far too many specific genres for us to make any meaningful conclusions about the association between genre and popularity, so we grouped them into 10 general genres:
unique(spotify.subset$Popular.Genre)
## [1] "adult standards" "album rock" "alternative" "pop"
## [5] "hip hop" "indie" "country" "metal"
Then we perform our ANOVA test. Our null hypothesis is that the popularity means across genre groups are equivalent, and our alternative hypothesis is that the means are not equivalent (i.e. at least one is different).
summary(aov(Popularity ~ Popular.Genre, data=spotify.df))
## Df Sum Sq Mean Sq F value Pr(>F)
## Popular.Genre 9 42928 4770 25.73 <2e-16 ***
## Residuals 1980 367063 185
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Since the p-value is much smaller than the significance level of 0.05, we reject the null hypothesis that the popularity means across genre groups are equivalent. Therefore, we have sufficient evidence that suggests that genre does affect popularity.
First, we’d like to look directly at the distribution of popularity within each genre with a density ridgeline plot. Since we decided to omit misc
and other rock
, since these genres are too varied for us to make any insightful conclusions.
# Density ridgeline plot: popularity, popular.genre
#Density plots of genre popularities
colors = c("gray", "#191414", "black", "gray", "pink", "blue", "black", "gray")
ggplot(spotify.subset, aes(x = Popularity, y = Popular.Genre)) +
geom_density_ridges(alpha = .75, color = "#1DB954", fill = "black") +
labs(title = "Density of Popularity by Genre", y = "Genre") +
scale_fill_manual(values=c("pop" = "gray", "metal"="black")) +
scale_x_continuous(breaks = seq(from = 0, to = 100, by = 10), limits = c(0,100))
## Picking joint bandwidth of 5.12
While most genres seem to be unimodal, hip hop
seems more bimodal, and country
seems trimodal. indie
has the lowest center than the other genres (which makes sense because independent music tends to be more niche and less popular). The spread of alternative
, album rock
, and adult standards
is small, whereas the spread of pop
and country
is much larger. From the density curves, it seems that genres like metal
, alternative
, album rock
, and adult standards
enjoy somewhat stable popularity, but we’d like to investigate how their popularity changes over time. Were these genres only popular during certain years, or are they consistently popular?
To look at the general trends of popularity of each genre across time, we used a time series plot.
# Time series popularity plot of popular genres over time
test = spotify.subset %>% group_by(Year, Popular.Genre) %>% summarize(avg = mean(Popularity))
## `summarise()` has grouped output by 'Year'. You can override using the `.groups`
## argument.
test %>%
ggplot(aes(Year, avg, color = Popular.Genre)) +
stat_rollapplyr(width = 5, align = "center") +
labs(x = "Year", y = "Popularity",
title = "Avg Popularity of Popular Genres over Time")
## Warning: Removed 32 row(s) containing missing values (geom_path).
From the time series plot, we can extract information about the variation of average popularity of top genres over time, both from the range of the individual lines and the trends direction. The graph was created using the mutated dataset in which the mean popularities were calculated for each popular genre in a given year then plotted over a width of 5 years. We can see that indie started off unpopular in the 1970s but has grown more popular over time. Album rock has become overall less popular over time; country music has been consistently mid-popular from the 1970s to 1990s but suddenly dropped in popularity in the 2000s; pop has stayed relatively popular since the 1980s. Both metal and hip hop music peaked in the 1990s-2000s but declined in popularity quickly as their eras were short.
Now that we have seen the popularity time trends of each genre, we would like see in more detail how popularity compares across genres over the years. For ease of analysis, we decided to group years by decade, since we assume that every decade has similar music popularity trends. We calculated Pearson’s residuals of popularity for each genre and decade and displayed them with a mosaic plot.
# mosaic plot
popular.table = table(spotify.subset$Popular.Genre, spotify.subset$decade)
mosaicplot(popular.table, shade=T, las = 2, main = "Pearsons Residual Plot of Popular Genres and Decades")
The Pearson’s residuals mosaic plot above is colored shades of red and blue to indicate whether a genre was more prevalent than expected or less prevalent than expected during a particular decade. The bluer a section is, the higher the standardized residual is, meaning that the genre during that time appeared more often than expected; the same goes for the red sections. While the country genre has no shading for any decade, meaning that the values present are similar to what we expect, every other genre has some form of shading. This shows that there does seem to be a relationship between the genre of the song and the decade; in other words, as time goes on, the popularity and frequency of songs in a particular genre change as well. For instance, we see that pop music was lower than expected during the 1960s and 1970s but was higher than expected during the 2000s and 2010s meaning that the number of popular songs within the pop genre increased with time.
Now we’d like to see how (and whether) year and popularity are clustered together by genre, using a 2D contour plot.
# 2D contour plot: year, popularity, genre
spotify.subset %>%
ggplot(aes(x=Year,y=Popularity)) +
geom_point(alpha=.5, aes(color=Popular.Genre)) +
geom_density2d() +
labs(title="Popularity vs Year with 2D Contour")
We note that there exist two clusters, one around the 1970s-80s with mostly album rock music. This makes sense because this genre was created in the 70s for rock artists like Bon Jovi and AC/DC. The other cluster is around the late 2000s and early 2010s with mostly hip hop and pop grouped together. They have similar popularity levels and become a trend around the same time. We do not see other obvious clusters based on the contour plot except for these two.
How do audio features affect popularity?
After a thorough analysis of genre and popularity, including some analysis of trends and interactions over time, we’d also like to include other variables, most of which are calculated from audio qualities like tempo, rhythm, volume, and key. Since there are multiple variables and they are all quantitative and similarly scaled, we decided to use a PCA biplot.
#pca
spotify.quant = spotify.subset[, 6:15]
spotify.pca = prcomp(spotify.quant,
center = TRUE, scale. = TRUE)
# spotify.subset = spotify.subset %>% mutate (decades = cut(Year, c(0, 1980, 2000, 2020),
# labels = c("Old", "Recent", "New")))
#biplot to get relations
fviz_pca_biplot(spotify.pca, title = "PCA Biplot of Spotify Songs grouped by Popular Genres ",
label = "var", repel = TRUE,
alpha.ind = .25, alpha.var = .75,
habillage = spotify.subset$Popular.Genre, pointshape = 19)
The dataset was filtered and mutated to categorize the excessive unique genres under popular main genres and include only those ones in the PCA Biplot graph. The biplot allows us to see which genres lean toward what quality of the song (Energy, Acousticnes, Danceability, etc). For example, country and adult standard music tend to lean toward acousticness and away from energetic/loud qualities, which makes sense given adults seem to enjoy more calm and melodic tunes and country music is on the slower side. Hip hop seems to lean toward higher popularity indexes and valence/danceability which is intuitive since many spotify users are the youth. Another intuitive observation is that metal music corresponds with high BPM and is known for its intense and upbeat instrumentals. Alternative music, by definition as well, seems to lean toward liveness (live recordings) as this genre is less polished and more thematic than modern songs recorded in the studio. Looking at the popularity arrow, we can see it is associated with valence and danceability qualities as well as hip hop music. The biplot allows us to see not only what main genres are associated with what particular qualities of music but also what qualities of music point in the same direction and are associated.
After seeing these quantitative trends, we were curious about the categorical variables, namely Title
.
How does a song’s title affect its popularity?
What kind of words are frequent in popular song titles? And is there any association between frequent words and song popularity? To answer this, we created a word cloud colored by popularity.
head(spotify.subset[order(spotify.subset$Popularity, decreasing = T), c(2, 3, 16, 5,15)])
## Title Artist Popular.Genre Year
## 794 Dance Monkey Tones and I pop 2019
## 788 Memories Maroon 5 pop 2019
## 787 bad guy Billie Eilish pop 2019
## 1642 All I Want for Christmas Is You Mariah Carey pop 1994
## 727 Shallow Lady Gaga pop 2018
## 684 Perfect Ed Sheeran pop 2017
## Popularity
## 794 100
## 788 98
## 787 95
## 1642 95
## 727 88
## 684 87
Just to get a sense of the most popular songs of all time, we’ve displayed the head of the dataframe with the song title, artist, genre category, and release year, all ordered by decreasing Popularity. We can see that the top songs of all time are all pop. Despite popularity being associated with recent plays, the classic “All I want for Christmas is You” reigns in the top 5 though it is an old song.
words = unlist(strsplit(spotify.df$Title, " "))
words = words[!(tolower(words) %in% data_stopwords_smart$en)]
words = words[!(words %in% c("Remaster", "Remastered", "Single", "Version"))]
words = words[str_detect(words, "^[:alpha:]+$")]
top_words = data.frame(rev(sort(table(words)))[1:100])
top_words$words = as.character(top_words$words)
get_popular_word = function (word) {
pop = c()
for (i in 1: nrow(spotify.subset)) {
if (grepl(tolower(word), tolower(spotify.subset$Title[i]), fixed=TRUE)) {
pop = c(pop, spotify.subset$Popularity[i])}}
return (mean(pop))
}
pop = c()
color = c()
for (i in 1:nrow(top_words)) {
pop = c(pop, get_popular_word(top_words[i,1]))
}
top_words$pop = pop
for (i in 1:nrow(top_words)) {
if (top_words[i, 3] < 58.16) {
color = c(color, "#192f1f")
} else if (58.16 <= top_words[i, 3] && top_words[i, 3] < 64.5) {
color = c(color, "#b3b3b3")
} else {
color = c(color, "#1ed760")
}
}
wordcloud(words = top_words$words, freq = top_words$Freq,
random.order = F, colors = color,rot.per=0.1,
max.words = Inf,ordered.colors=T)
title(main = "Wordcloud of Top Song Title Words by Popularity")
To get a sense of what kind of songs are popular based on titles alone, we generated a wordcloud of the top 100 most appeared words in song titles, with the size of the word corresponding to its relevant frequency. From the visual, we see that “love” outranks any other title word by a large margin, which makes sense since a large portion of music is centered around romance. In addition to reflecting the frequency of words, we wanted to marginalize by their corresponding popularities. Since words can be used in more than one title, we take the average of the popularity that the words appear in; green represents a high popularity average, gray represents a mid-range popularity average, and black represents a low popularity average. It is interesting that our most popular word in titles, “love,” is not green but gray indicating that it has a mid-range average popularity value even though the word itself is the most popular frequency-wse. One note to make here is that we got rid of stop words such as “the” and “it” but only did so for English words which is why we see stop words in other languages including “Je” and “De.” Though these are stop words, their occurrences show that many top songs are from other nations and internationally popular. In addition to stop words, we took out specific words such as “Remastered” and “Single” since they don’t provide meaning in our analysis.