With applications like Spotify and Apple Music, streaming music has never been easier. This changes the way that people listen to music and how artists release music. The goal of our project is to understand the data behind streaming and how it changes over time, over popularity, and over genre.
The dataset for this project comes from Spotify’s web API via the spotifyr package. There is a total of 32,828 rows and 22 columns in the dataset. The columns correspond to the following variables:
We will categorize audio features as danceability, energy, key, loudness, mode, speechiness, acousticness, instrumentalness, liveness, valence, tempo, and duration_min.
How is genre related to the audio features?
Does song name and playlist name affect the overall popularity of a song?
What is the relationship of album release year compared to genre and popularity?
To answer our first question, we consider the audio features as our predictor variables and genre as our response variable. The best way to visualize multiple quantitative variables, we use a PCA biplot. However, first we take a look at the mean for each audio feature to get a general understanding of each feature. On average, tracks are more danceable, more energetic, in a major key, less speechy, less acoustic, less instrumental, less live, andmore positive. Key averages at around 5 which indicates that most keys are used in tracks. Loudness also averages at around -6 which is known to be the ideal dB for tracks. Tempo averages at 120 BPM which is the most standard BPM in music due to the link between percussion rhythm and human rhythm. Finally, tracks average to 3.7 minutes which makes sense since most songs are between 3 and 4 minutes.
## danceability energy key loudness
## 0.65485038 0.69860261 5.37394907 -6.71952903
## mode speechiness acousticness instrumentalness
## 0.56573657 0.10705347 0.17535152 0.08475987
## liveness valence tempo duration_min
## 0.19017541 0.51055588 120.88364156 3.76328050
Now that we have inspected our audio features separately, we can take a look at our PCA biplot which takes into account genre:
The 17.94% of the total variation is accounted by the first principal component. The 12.96% of the total variation is accounted by the second principal component. 30.9% of the total variation is accounted by the first and second principal components together. The coefficients that point right in the biplot are loudness and energy. This indicates songs with high first principal component tend to have higher values of these variables. We see that EDM, on average, corresponds with higher values of loudness and energy. This makes sense since EDM songs tend to be more upbeat and energetic. They also have more layers of sounds which increases loudness. On the other hand, the coefficients that point left in the biplot is acousticness. This indicates songs with low first principal component tend to have higher values of this variable. We see that r&b corresponds, on average, with higher values of acousticness. This makes sense since r&b songs are guitar and band heavy.
The coefficients that point up in the biplot are danceability, valence and speechiness. This indicates songs with high second principal component tend to have higher values of these variables. We see that rap, r&b and latin music, on average, correspond with higher values of danceability, valence and speechiness. This makes sense since rap tends to be more speechy and r&b and latin music are great for dancing. There are also a lot positive messages associated with each genre. On the other hand, the coefficients that point down in the biplot is instrumentalness, duration_min, tempo, and liveliness. This indicates songs with low second principal component tend to have higher values of this variable. We see that rock and EDM correspond, on average, with higher values of instrumentalness, duration_min, tempo, and liveliness This makes sense since rock and EDM are faster paced and tend to be longer pieces. Rock also utilizes many band instruments and contains a lot of recordings with live audience in them.
Pop’s average is quite close to 0 so it is hard to categorize it into audio features. Also, Key is not a significant arrow in relation to genre. This makes sense since a “higher” key just means a higher note. In conclusion, There is clear overlap between genres which is expected; however, there is also difference in each genre’s skew in the biplot.
The 2 text comparison graphs below depict two word clouds that are divided based on their popularity. The first blue word cloud displays track names with a popularity rating between 50-100, whereas the first red word cloud exhibits track names with a popularity rating between 0-49. The second blue word cloud displays playlist names with a popularity rating between 50-100, whereas the first red word cloud exhibits playlist names with a popularity rating between 0-49.
First, we take a look at the affect of track names on popularity. Both sub-clouds share some common words such as remix, feat, and love. However, the larger font size in each cloud represents more frequent words. For instance, it appears that remixed songs are more prevalent in the less popular tracks than in the more popular ones. A significant difference between the two clouds is that the less popular songs, shown in red, contain the word “origin,” whereas the more popular ones don’t even feature this word in the cloud. Upon closer examination, “origin” was short for “original.” Therefore, there could be more original songs in the less popular tracks than in the more popular ones. On the other hand, the more popular songs seem to contain a lot of remastered and remixed songs. This makes sense since more popular songs are more likely to have different versions of the same song.
Next, we take a look at the affect of playlist names on popularity. The analysis shows that the most popular songs on Spotify in 2020 were frequently featured on playlists with terms such as pop and rock, while the less popular songs were associated with playlist names containing house, pop, and rock. Interestingly, while house music did appear as a frequent term in the popular playlist names, it was not as commonly used as in the less popular ones. This could suggest that there might be more less popular house tracks in Spotify playlists.
Another notable finding is that the term “wave track” appeared in the popular playlist names but not in the less popular ones. This could indicate that this term is more closely associated with mainstream music and may appeal to a wider audience. Additionally, the term “2010s” appeared frequently in the popular playlist names, but was not as common in the less popular ones. This suggests that listeners may have a preference for music from this specific decade, which could influence the type of songs that become popular on Spotify. Overall, the analysis highlights the importance of playlist names in shaping the popularity of songs on Spotify, and how certain genres and decades can resonate more strongly with listeners.
Next, we are interested in learning more about the relationship of album release year compared to genre and popularity. To identify patterns and trends within these variables, we first look at autocorrelation plots.
From these autocorrelation plots, we can see that the autocorrelations of edm, pop music, and latin music seem to be particularly high, though every genre does appear to have some significant autocorrelation. This indicates that the number of albums released of a particular genre at a given time is not random, which in turn means that there is a good chance that there is some pattern between the time of an albums release and the genre of the album.
Knowing the fact that the release of albums based on genres over time is not random, we want to proceed by analyzing the relationship between the popularity of songs over the timeline grouped by different genres.
The following graph depicts the overall number of releases of songs of different genres over time. Based on the result of the density curves, all genres except “rock” have peaks in the number of releases around 2020. Songs in the genre “rock” seems to have a steady number of releases over time. The matching box plots provides more detailed information about the medians and 1st and 3rd quarters. In general, it is visually observable that the majority of songs in the genres “r&b”, “rap” and “rock” are released in recent years from 2015 to 2020.
After knowing the pattern of the number of releases of songs over time, we now want to precede on analyzing the overall mean popularities of songs over time. In other words, which genre is favored in different years? By grouping the songs by their year of release and genres, we gain information on the respective average popularity of the different genres over time. Now we can perform our ANOVA test to determine whether the mean popularity of songs under different genres is the same. This ANOVA test has the null hypothesis of all mean popularities of the songs with respect to genres being the same over the alternative hypothesis of not all means being equal (at least the mean popularity of one genre is different).
## Df Sum Sq Mean Sq F value Pr(>F)
## playlist_genre 5 5094 1018.7 7.066 2.94e-06 ***
## Residuals 296 42677 144.2
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
As depicted by the result of the ANOVA test, with a p-value approximately equivalent to 0, which is much smaller than the significance level of 0.05, we reject the null hypothesis and conclude that we have sufficient evidence to believe that the mean popularities of songs under different genres are not all equivalent.
The following graph provides us with more information about the change in the mean popularity of songs with different genres over time. As we can observe from the plot, the popularities of songs in the genres “latin”, “r&b”, “rap”, and “rock” has a similar trend of change, decreasing from 1960 to 2010 and starting to increase from 2010 to 2020. “Pop” songs are comparatively very popular when it first appears in the 1970s, and their popularity starts to decrease till, as well, 2010. Songs under “edm” (Electronic Dance Music) genre has the most unique change in mean popularity. It intuitively makes sense since it is the comparatively youngest genre. The popularity of “edm” songs shows a decreasing trend from its time of creation in 1980, peaked in 2000, starts to decrease again right after 2000, and shows a gradual trend of increase from 2010 to 2020.
Therefore, it seems that the overall popularity of music of all genres has an observable decreasing trend till 2010 and starts to revive again.
Our main takeaways from our questions are as follows: (1) There are audio features associated to different genres. For example, higher levels of loudness and energy corresponds more to EDM. (2) We saw that artist features in a song are important regardless of popularity of a song. We also saw that original songs are more prevalent in less popular songs. Rock and pop are very common throughout playlist names regardless of popularity. The playlists that contain years in their name have more popular songs. (3) There is a pattern between the time of an albums release and the genre of the album. Out of all genres “rock” has the most steady number of releases over time, but the average populairty of its songs have been decreasing over time. Newer genres like rap and r&b have been increasing in popularity with pop being consistently one of the most popular genres, given its name.
In the future, it would be interesting to look at the decomposition of the playlist genre into its subgenres. This way we could see if a specific subgenre leads the genre in popularity or in its audio features. It would be interesting to do a machine learning algorithm such as clustering to explore the complexities within genres. This would require a comprehensive analysis of its own.