PREPROCESS DATA (CHANGE VAR NAMES)

##  [1] "track_name"           "artist_name"          "artist_count"        
##  [4] "released_year"        "released_month"       "released_day"        
##  [7] "in_spotify_playlists" "in_spotify_charts"    "streams"             
## [10] "in_apple_playlists"   "in_apple_charts"      "in_deezer_playlists" 
## [13] "in_deezer_charts"     "in_shazam_charts"     "bpm"                 
## [16] "key"                  "mode"                 "danceability"        
## [19] "valence"              "energy"               "acousticness"        
## [22] "instrumentalness"     "liveness"             "speechiness"

Motivation

This report analyzes Spotify’s top tracks dataset to uncover patterns and insights about song characteristics, popularity, and trends. The aim is to explore which features drive high stream counts, identify differences in playlist preferences, and investigate broader trends in the music industry.

Dataset Decription

The dataset consists of the following key variables:

-track_name: Name of the song.
-streams: Total streams for the song on Spotify.
-bpm, energy, danceability, valence, etc. (Various features describing the song)
-Additional columns: Information on playlists (e.g., Spotify, Apple).

## The mean of bpm is: 122.5404
## The median of bpm is: 121
## The range of bpm is: 65 206
## The mean of energy is: 64.27912
## The median of energy is: 66
## The range of energy is: 9 97
## The mean of danceability is: 66.96957
## The median of danceability is: 69
## The range of danceability is: 23 96
## The mean of valence is: 51.43127
## The median of valence is: 51
## The range of valence is: 4 97

BPM: The beats per minute measures the tempo of each song. This gives us an indicator of how fast or slow the song feels. The mean BPM of 122.54 suggests a typical tempo for popular, danceable and energetic music. Energy: The energy level describes how intense and active each song is. The higher the value, the more energetic and loud the songs are. The mean energy value shows that the dataset contains mostly energetic songs. Danceability: This variable shows how good each song is for dancing. This considers the rhythm, musical feel and beats. The higher the score, the easier the song can be danced to. The dataset has a mean of 66.97 danceability meaning most of the songs are easily danceable to. Valence: Valence measures the positivity of a song. The higher valence would mean a more positive and happy song. The mean shows that there is a perfect balance between the tracks as some are happy and some are sad.

Research Questions

  1. What are the characteristics of the most-streamed songs on Spotify?
  2. Are there patterns in song features (e.g., danceability, energy) based on release year or playlists?
  3. How do song features compare across popular playlists (e.g. Spotify vs. Apple)?

Graphs and Analyses

Graph 1

This histogram highlights the skewed distribution of Spotify streams, where most tracks accumulate fewer than 1 billion streams, while only a few surpass this number. The sharp decline after the first few bins indicates that streaming success is highly concentrated among a small number of songs. This visualization ties directly to our research question about identifying characteristics of the most streamed songs by showcasing the dominance of a limited subset in the streaming landscape. Understanding this disparity allows us to explore what differentiates these high performing tracks, whether through musical attributes, artist recognition, or playlist inclusion. The graph provides a quantitative foundation for investigating patterns that drive massive streaming success.

Graph 2

This hexbin plot shows the density of Spotify tracks based on their danceability and energy levels. The highest density of tracks lies in the mid to high range of both danceability (around 60-80) and energy (around 60-80), indicating that most popular tracks fall within these ranges.This supports the research question about identifying features of high-performing songs, as it suggests a sweet spot for these characteristics that aligns with listener preferences. Tracks with lower danceability and energy are sparse, reflecting less representation in Spotify’s top tracks. This visualization helps identify key musical traits that may contribute to a song’s streaming success.

Graph 3 Musical features such as bpm, danceability, and energy are moderately correlated, suggesting potential interplay in how these attributes contribute to a track’s character. Conversely, variables like speechiness and acousticness exhibit weak correlations with other features, as indicated by neutral tones, implying these characteristics may not heavily influence playlist or chart inclusion. Overall, the heatmap reveals distinct clusters of interrelated variables, especially among performance metrics, while highlighting weak or non-existent relationships among others, providing a comprehensive overview of the dataset’s structure.

Graph 4

## 
## Call:
## lm(formula = streams ~ danceability, data = spotify_data)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -623567357 -364752125 -206037641  163047364 3120365195 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  787814670   85700604   9.193  < 2e-16 ***
## danceability  -4085696    1249974  -3.269  0.00112 ** 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 5.64e+08 on 950 degrees of freedom
##   (1 observation deleted due to missingness)
## Multiple R-squared:  0.01112,    Adjusted R-squared:  0.01008 
## F-statistic: 10.68 on 1 and 950 DF,  p-value: 0.001119

The red regression line indicates a weak negative correlation between danceability and streams, as the line slopes slightly downward. This suggests that as danceability increases, there is a slight tendency for streams to decrease, though the trend is not strong or consistent. The majority of data points are clustered around lower streaming values, regardless of danceability, indicating that most tracks do not achieve high streaming numbers. Outliers with exceptionally high streams are present but do not appear to be significantly influenced by danceability. Overall, this plot suggests that danceability alone is not a strong predictor of a track’s streaming success, and other factors likely play a more significant role.

The residuals, measuring the difference between observed and predicted values, range from -623,567,357 to 312,036,519, indicating substantial variability in the model’s predictions. The intercept of 787,814,670 suggests that when danceability is zero, the predicted streams are 787,814,670, with a highly significant p-value (< 2e-16). The coefficient for danceability is -4,085,696, meaning each one-unit increase in danceability decreases streams by 4,085,696, a statistically significant relationship (p = 0.00112). However, the model’s explanatory power is very low, with an R-squared of 0.01112, indicating that danceability explains only 1.1% of the variation in streams. Despite the F-statistic of 10.68 (p = 0.00112), the residual standard error of 564 million and low R-squared suggest a poor fit. While danceability has a significant effect, it is a weak predictor.

Graph 5

The scatter plot shows patterns in the song features based on release year. We see that the newer songs are displayed in lighter blue while having a higher diversity by being widely spread across the PCA axes. On the other hand, the older songs in dark blue are more centered around the middle which would in turn mean that they share more similar characteristics compared that of newer songs. This might reflect the fact that music production styles were more simple back in the day. This evolution of music especially production has seen changes in danceability, energy and etc.. The newer songs embrace a broader range of styles and production.

Graph 6

spotify_data <- spotify_data %>%
  mutate(
    streams = as.numeric(streams),
    released_year = as.numeric(released_year)
  ) %>%
  filter(!is.na(streams), !is.na(released_year), released_year >= 2000)
avg_streams_by_year <- spotify_data %>%
  group_by(released_year) %>%
  summarise(avg_streams = mean(streams, na.rm = TRUE))
ggplot(avg_streams_by_year, aes(x = released_year, y = avg_streams)) +
  geom_line(color = "blue", size = 1) +
  geom_point(color = "red", size = 2) +
  scale_y_continuous(labels = scales::label_number(scale = 1e-6, suffix = "M")) +
  labs(
    title = "Average Streams by Release Year",
    x = "Release Year",
    y = "Average Streams (Millions)"
  ) +
  theme_minimal()

The graph shows a fluctuating trend in average streams by release year, with a notable spike in 2003 followed by a sharp drop in 2005, suggesting the influence of particularly popular tracks or inconsistencies in the dataset. From 2006 to 2015, the trend stabilizes, with moderate peaks in years like 2009 and 2014, reflecting the impact of widely streamed songs. However, from 2018 onward, there is a steep decline in streams, likely due to newer songs having less time to accumulate streams. Overall, the graph highlights the importance of longevity and specific standout years in shaping average streaming success.

Graph 7 This heat map that includes Danceability vs BPM and streams helps to answer our first research question about the characteristics that are most streamed. The graph shows that the highest streaming songs are all around the 60 to 80 ranges for danceability and they also have bpm values of between 100 to 150. In the moderate danceability and the middle tempo for bpm, the squares are the darkest suggesting the most popular songs belong here. By also showing that the most popular songs are within the higher end of the danceability axis, the graph indicates that there is a sweet spot for creating the most popular songs.

Graph 8

This PCA plot explores how audio features relate to a song’s popularity, measured by streams. Songs are projected onto two principal components, with the color gradient indicating streams—high-stream songs (red) align with features like “energy” and “danceability,” while low-stream songs (blue) are associated with “acousticness” and “instrumentalness.” By reducing dimensionality, this plot reveals patterns between features and popularity, providing a clear visualization of which attributes might influence a song’s success.

Graph 9

playlistspot = spotify_data %>% filter(in_spotify_playlists > 0) %>% mutate(playlist = "Spotify")
playlistapple = spotify_data %>% filter(in_apple_playlists > 0) %>% mutate(playlist = "Apple")

combo = bind_rows(playlistspot, playlistapple)

features = c('danceability', 'energy', 'valence','bpm')

library(reshape2)
ans = melt(combo, id.vars = "playlist", measure.vars = features)

ggplot(ans, aes(x = variable, y = value, fill = playlist)) +
  geom_boxplot() +
  labs(
    title = "Song features shown in Spotify vs Apply Playlists",
    x = "Features",
    y = "Value",
    fill = "Playlist"
  ) 

he box-plot that compares the features of songs between Spotify and Apple playlists reveal that there are small differences in the characteristics. The danceability scores are extremely close to one another. Both show a median of around 65-70. Spotify has a wider distribution of energy while the apple songs show a higher level of energy. When mentioning valence, the songs in each playlist are extremely similar to each other. However, when it comes to BPM, although both playlists have similar tempos, the distribution is much broader in Apple playlists. Apple playlists may include more tracks with extreme tempos while Spotify playlists seem more consistent in the tempo range. Overall, the graph shows that the features are extremely similar across both of the apps.

Conclusions and Takeaways

Research Questions

What are the characteristics of the most-streamed songs on Spotify?

The most streamed songs tend to balance high energy with moderate danceability, typically falling in the 60–80% range for both attributes. Heat maps and scatterplots show that tracks with bpm values between 100 and 150 are especially prevalent in the top performing songs. Songs that exhibit a blend of accessibility and rhythm, without being too extreme on either end of the spectrum, appear to resonate most with listeners. Acousticness and instrumentalness are less correlated with high amounts of streams, indicating that tracks with strong vocals and engaging beats are preferred. The distribution of streams is highly skewed, with a small subset of songs achieving extraordinary success while the majority stay under 1 billion streams. This suggests that a combination of rhythmic features and marketing success likely drives a song’s performance. Attributes like valence (positivity) seem to play a secondary role compared to energy and rhythm.

Are there patterns in song features (e.g., danceability, energy) based on release year or playlists?

PCA analysis reveals that newer songs display greater diversity in audio features, spreading more widely across principal components, while older tracks cluster more tightly in this case. This suggests evolving production techniques and a broader range of styles in modern music compared to the simpler structures of earlier songs. Tracks in popular playlists like Spotify’s tend to share common traits, including high danceability and energy levels, reflecting a consistency in curatorial preferences. While streaming platforms favor tracks with broad appeal, modern playlists increasingly showcase experimental styles, expanding listener exposure. Release year trends also show shifts toward higher energy and more complex compositions over time. These findings highlight how playlist curation and evolving listener preferences drive changes in music production.

How do song features compare across popular playlists (e.g. Spotify vs. Apple)?

Songs included in Spotify and Apple playlists generally share high energy and danceability scores, aligning with listener demand for rhythmic and engaging tracks. However, Spotify playlists tend to favor tracks with slightly higher bpm and energy levels, reflecting their appeal to a more mainstream and global audience. Apple playlists, while in a way similar, may include more varied tracks, reflecting different curation strategies. Across platforms, playlist presence correlates strongly with higher streaming numbers, highlighting the role of curation in a song’s success. Visualizations show a strong clustering of high-stream tracks within playlists, suggesting that inclusion is a key driver of visibility and reach. Although specific stylistic differences exist, the overlap in characteristics points to shared listener preferences across major platforms.

Highlight the most surprising or interesting insights

One of the most surprising insight of the entire project is the weak negative correlation between danceability and streams. This is shown in graph 4, the correlation graph between the two. Our entire group was under the assumption that the top streaming charts would be filled with highly danceable songs. The data shows that the higher that danceability increases, the number of streams does not increase. Additionally, looking at the correlation heat map, we see that there is a moderately high relationship between the features of BPM energy, valence and danceability. This makes sense as all the variables will work together to define a song. Speechiness and acousticness showing little to no correlation with the other variables was surprising. Typically, speech and acousticness make up a ton of the lyric features thus the data could be suggesting that lyrics may not have a massive contributing role in a song’s popularity.

Limitations of the analysis

With any analysis, there comes limitations. One massive limitation of the dataset and analysis is the fact that there are only so few audio features such as bpm or energy. While the audio features within the dataset are extremely important to the production of a song, it doesn’t include other influential factors such as genre or lyrics. Many people would argue that the lyrics of the song are equally or even more important than that of audio features. On top of the limited features, other important elements such as cultural influence or seasonal influence aren’t brought up. Songs may be popular due to specific holiday seasons such as Christmas songs. Many external factors are also not accounted for in the dataset. Marketing, social media influence, or popularity of original artist aren’t mentioned or accounted for. The lack of dataset on the flip side is also limiting; listener demographics such as age, location or interest in music could be a potentially great way to provide valuable context for understanding why certain songs achieve high streaming counts.

Suggest directions for further exploration of this subject

This analysis highlights key patterns in Spotify’s top tracks but leaves room for deeper exploration. The focus on most-streamed songs may overlook insights from less popular or niche tracks, limiting the generalizability of findings. The analysis also does not account for regional or cultural variations in musical preferences, which may significantly influence streaming trends. Additionally, the observed decline in streams for recent tracks suggests a potential bias due to limited time for newer releases to accumulate streams, warranting further investigation with updated datasets.Future work could examine how playlist algorithms shape streaming success and whether they disproportionately favor specific artists or genres. Incorporating listener demographic data, such as age or region, could uncover more detailed patterns of musical preference. Expanding the analysis to include lyrics, genres, and collaborations could provide a richer understanding of factors driving popularity. Finally, predictive modeling could identify the attributes most likely to lead to streaming success, offering actionable insights for artists and industry professionals.