Data Description

The “Spotify: All Time Top 2000 Mega Dataset” is a dataset from Kaggle that contains various audio statistics and ratings of the top 1,994 songs on Spotify. For each song, it includes information such as the Title, Artist, Top Genre, Year of release, BPM (beats per minute), and Duration. The first three are nominal categorical variables, Year is an ordinal categorical variable, and the last two are quantitative variables. In addition, for each song, this dataset includes various quantitative ratings, such as those measuring its level of Energy, Danceability, Loudness, Liveness, Valence, Acousticness, Speechiness, and Popularity. We manually added additional Genre, Decade, and Decade Range columns so that we could cluster songs into fewer groups, which will make our visualizations more clear.



Research Questions

Using our dataset, we would like to answer three main questions:



Research Question 1: Genres

In order to answer the first research question, we analyzed each quantitative variable by Genre using density plots, then looked at a dendrogram of all of the variables to check if there was any clear difference, and lastly looked at a contour plot of the predictor variables that stood out from the density plots.

Graph 1

Out of the 11 quantitative variables in this dataset, besides Popularity, which is more relevant to our second research question, it seems that only Energy and Danceability show a clear difference between the different overall genres. (For the above graphs, we have removed the “other” genre, since it encompasses too many miscellaneous songs and does not provide useful information.) Within these two, it seems like country songs tend to have lower Energy than songs of other genres, while hip hop songs tend to have higher Danceability than songs of other genres. Note, however, that our sample size for the hip hop, indie, and country genres is relatively small, so the conclusions drawn here are not necessarily meaningful.

Graph 2

The clusters of this dendrogram, made based on the three variables described above, are not just the five main genres. In other words, it appears to show very little difference between the five genres. Each of the clusters’ leaves, colored by Genre, appear to be somewhat similarly distributed, with the majority of the distribution consisting of rock songs.

Graph 3

Here, we plotted the variables that appeared to show some variation from the first graph. Again, this graph appears to show very little difference between the genres. There appears to just be one cluster, containing all of the genres somewhat uniformly. Again, this verifies that we have more rock songs than anything else in this dataset.


Research Question 2: Popularity

What qualities do popular songs embody? By taking a look at the correlations between Popularity and other attributes, we can see what these popular songs have in common.

Table 1

BPM Energy Danceability Loudness Liveness Valence Duration Acousticness Speechiness
-0.00318 0.103 0.144 0.166 -0.122 0.0959 -0.0367 -0.0876 0.112

After isolating the correlation coefficient portion of the pairs plot, we can see how correlated Popularity is with the quantitative variables. As the magnitude of all of these correlation coefficients are under 0.2, we can state that Popularity is not associated with any of these quantitative variables.

Graph 4

Taking a look across genre, one clear trend seems to be that indie songs are less popular than those of the other genres. Hip-hop appears to have the highest median popularity across the genres, but with some clear skew left it is hard to judge any clear differences between the non-indie genres.


Research Question 3: Time

The third research question mainly concerns itself with time trends. Therefore, we explored various song attributes in the context of the Year or Decade Range in which it was released.

Graph 5

Since our dataset contains so many quantitative variables, we first conducted principal component analysis (PCA). We then made a graph plotting the first two components, and colored our datapoints by the Decade Range variable so that we could make some comparisons regarding time without clouding the graph with too many overlapping colors.

We can see that Decade Range slightly clusters by the first two principal components since there are mostly blue datapoints on the top and mostly red datapoints on the bottom. One observation that can be made from this graph is that as BPM increases, PC1 decreases and PC2 increases; since datapoints from the 1990s-2010s are mostly in this general direction, we can conclude that songs from the 1990s-2010s tend to have a greater number of beats per minute than songs from the 1950s-1980s. That being said, the Normal distribution ellipses overlap quite a bit; as a result, there is not enough evidence to conclude that the two groups are significantly different with respect to their principal components.

Graph 6

To address some of our qualitative variables, we made a comparison word cloud between the top genres of songs from the 1950s to the 1980s and the top genres of songs from the 1990s to the 2010s to provide insight on how the top genres have changed, if at all, between these two eras.

From this word cloud, there are a few song genres that almost exclusively appeared in the 1950s-1980s, such as “adult standards,” “classic rock,” “album,” and “europop.” Meanwhile, “alternative,” “modern,” and “pop” music seem to be more popular genres in the 1990s-2010s. So, even though we grouped multiple decades together, making us unable to analyze how top genres may or may not have changed decade-to-decade, it is clear that there are some genres that were/are more prominent in one time period or another.

Graph 7

Finally, to more closely monitor how a single quantitative attribute has changed over time, we constructed a time series plot with decomposition measuring median Danceability.

The global trend can be seen in the second facet, which shows that the median Danceability rating for songs started off rather low, climbed throughout the 1970s, reached a peak around 1980, decreased from the late 1980s and 1990s, and has been steadily increasing since 2000. The seasonal trend can be seen in the third facet: the consistent up-and-down nature of this plot, especially since 1990, suggests that the median Danceability rating follows a cyclical pattern that lasts about five years per cycle. Therefore, the main takeaway from this graph is that not only does Danceability come in big waves over the span of decades, but it also comes in small waves over the span of a few years.



Conclusions

Overall, there is not much of a difference between the five genres, at least in terms of the quantitative variables in this dataset. While we did observe some minor trends in Energy and Danceability where one or two genres somewhat differentiated themselves from the rest, we could not observe any clear difference between them.

We were able to conclude that Popularity is not associated with any of the other qualitative variables in the dataset. Indie music is less popular than other genres, but there seem to be no significant differences among the rest.

Finally, we can conclude that a variety of qualitative and quantitative attributes have appeared to change over time, such as Top Genre and Danceability, but not necessarily to a statistically-significant degree.

In our future work, we look forward to exploring these relationships with greater granularity and would be interested in experimenting with various subsets of the data to perform subgroup analyses.