Abstract

A university fight song is a musical composition that serves as a symbol of school spirit and pride for a college or university. These songs are typically played at athletic events, pep rallies, and other school-related functions, and are often associated with a particular sports team or tradition.

Today, fight songs are an integral part of the college experience, and are often considered to be one of the most important traditions of a school. They provide a sense of unity and identity for students, alumni, and fans alike, and are a powerful reminder of the passion and pride that people feel for their alma mater.

As researchers, studying university fight songs can provide valuable insights into the history and culture of a school, as well as the broader social and musical trends of the time. By examining the dates, lyrics, bpm, and beats of these songs, we can gain a deeper understanding of what makes them so meaningful and enduring, and how they continue to shape the identity and spirit of a university community.

Data Description and Overview

University fight songs typically incorporate themes related to school spirit, pride, and loyalty. They often reference the school’s history, traditions, and sports teams, as well as the values and ideals that the school represents. In this data set, most fight songs contain specific words, such as victory and fight.

As we take look at the data set, the origins of university fight songs can vary widely, with some dating back over a century, while others have been created more recently. Many of the earliest fight songs were composed in the late 19th and early 20th centuries, when college football was becoming increasingly popular in the United States. The melodies often contain repetitive phrases and simple chord progressions that are easy to sing and remember, making them ideal for use in a stadium or other large group setting.

The college fight songs dataset contains 65 rows in which each row represents one related information with the fight song related to that college. There are in total of 23 variables, and the variables we are interested in are related to whether the song is official, if the song is the result of the contest, and other detailed information like the speed and specific words in the song (the full variable specification is in the appendix).

Whether or not a university fight song is considered an official school fight song or contest can depend on a variety of factors. In this report, we are going to find out what there characteristics are, and how their lyrics help with it.

Research Question 1: What Factors Determine If The Fight Songs Are Chosen As Official Fight Songs?

Universities often have multiple fight songs that are associated with different sports teams or traditions, but not all of these songs may be considered official fight songs.

There are a variety of factors that may be considered when deciding whether or not to designate a particular fight song as an official school fight song. Some factors that may be considered include the song’s lyrics and musical composition, its performance traditions, and its role in university events and traditions.

In this data set, we can first take a look at the most popluar themes of fight songs by making word clouds of their song names.

## [1] 1

There are mainly two topic we can abstract from the word clouds: “fight” and “song”. It also includes themes like “victor”,“win” and “glori”.To interpret this clusters of words that mainly include the word “fight”: “fight” is often used in fight songs to represent the determination and perseverance of the school’s athletes and fans. It conveys a sense of battling against adversity and overcoming obstacles to achieve victory. This theme is often reflected in the lyrics of the song, which may encourage the team to “fight on” or “fight for the win”.

The word “fight” is so popular between song names, we might need more research on whether the words in its lyrics can help making a song official song.

As we mentioned above, “fight”,“win” and “victor” are some words that are frequently shown in the song names. In order to further research on their influence in the lyrics, we use faceted bar charts to look deeply at whether these key words help making songs to be official fight songs.

As we can see in the bar chart above, the bars of both official fight songs and non official fight songs are split into four small plots regarding whether they include the word “victory” and “win”(or “won”. Other than this, the bars or each sections are colored by whether they include the word “fight”.

In the graph above, large percentage of official fight songs and small percentage of non_official songs contain the word “fight” in their lyrics. In the official songs group, it’s very obvious to visualize that most amount of it contains both word “victory” and “win”(or “won”), and some of the songs contain the word “win”(or “won”) but not “victory”, which implies that official fight songs tend to be more oral so they are easier to remember.

Further than lyrics, the beats per minute (bpm) and duration of song are important factors that can provide valuable insight into the musical and cultural context in which the song was created, and can help to understand its significance as a cultural artifact, hence they are also important factors that can determine whether a fight song can be classified as an official fight song.

In the dot plot below, it has ‘bpm’ on the x axis and duration on the y axis with their average line plotted. The dots represent every single songs in the data set, and they are colored according to whether they are official fight songs.

This plot forms two cluster of official song, one with bpm around 150 and duration around 70, another one with bpm around 70 and various duration. All of the non official fight songs are far from the average line of either duration and bpm. Also, some of the non official fight songs have bpm that are around 120 which is in between the two clusters, and one of the non official fight songs have very long duration which can be seen as an outlier.

Compare to duration of songs, we decide that ‘bpm’ is a more significant factors that can determine whether the fight songs are chosen as the result of an official fight song.

Research Question 2: What Factors Determine If The Fight Songs Are Chosen As The Result Of A Contest?

After exploring the difference between official and non official fight songs that recognized by the school, another characteristics that worth studying is the potential difference between songs from contest and songs that are not result from contest.

According to the commonsense, words that tend to appear in the fight songs are likely related to victory, call out to their opponents, and energize their own school’s players in singing something related to school. Therefore, the total number of words like fight, win-won, opponents might be different for the songs that are result from contest compared to the total number of words that are not result from the contest.

First, we would like to look directly at the distribution of number of trope for the songs that are result of contest and not with a density ridgeline plot.

Based on the density curve we can tell that for the songs that are result of contest and not result of contest, they both seem to be bimodal. The spread of songs that are not result of contest seems to be larger than that of compared to the songs that are result of contest. We can tell from the graph that the center for the songs as the result of contest in terms of the number of trope seems to be at the right of the center for that of the songs that are not the result of contest, indicating that songs chosen as the result of a contest and songs are not might have different mean. In order to see if we have a significant difference of number of trope for songs chosen as results of contest and not we decided to perform a t-test. Our null hypothesis is that the mean number of trope for songs that are chosen from contest and the mean number of trope for songs that are not chosen from contest is the same, and the alternative hypothesis is that not all means are equal. The result of the test are as follows:

## 
##  Welch Two Sample t-test
## 
## data:  contestY and contestN
## t = 2.4565, df = 13.271, p-value = 0.02852
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  0.1568592 2.4067771
## sample estimates:
## mean of x mean of y 
##  4.700000  3.418182

Overall, we can see that the p-value is 0.02852 which is smaller than 0.05 so we have significant evidence to reject the null hypothesis that the number of trope for songs chosen as a result of contest and that of the songs not chosen as a result of contest are the same. Therefore, we have sufficient evidence suggests that the mean number of trope for songs from contest and songs not from contest differ from each other. Furthermore, we can study more detailed difference in terms of words appearance in the song for those chosen from contest and not.

To be more specific, we did a t-test on the number of word “fight” for songs chosen as a result of contest and those that does not. We have a p-value of 0.7745 which is large than 0.05, indicating that the mean number of fight in the song is the same for songs chosen as a result of contest and those that does not (result not shown). The word that include in the number of trope include “win-won”, “opponent”, “men” and such word, so we could further discuss whether the appearance of these words is dependent on whether the songs are chosen as a result of contest. The variable related to the appearance of words like “win-won”, “victory”, “opponent”, and other words are showed up as categorical variable in this dataset, so we are unable to do a t-test in comparing the mean number; instead, we perform a Chi-squared test for independence on whether the song chosen as a the result of contest with variable ‘victory_win_won’, ‘men’, ‘opponents’ separately. Our null hypothesis is that the appearance of word and if the song is chosen as a result of contest are independent, and the alternative is that there is some dependence between the two variables. Sadly, all of the three chi-squared test for independent has a p-value greater than 0.05, so we conclude that they are all independent from each other (not shown).

Research Question 3: How Do Fight Songs Distribute In The Higher Dimensional Space?

For the above sections, we have explored what kinds of factors might have effects in determining whether a particular fight song is an official song or it is chosen as the result of a contest. Since there are some meaningful correlations indicated, it becomes interesting that there may be some clusters of fight songs retaining the common characteristics and some of these features may be closely correlated with each other.

In this section, we are going to discuss the classification clusters of fight songs and the potential correlations between quantitative variables in the higher dimensional space.

Dendrogram With Complete Linkage Clustering

We adopt all quantitative variables in our dataset (‘bpm’, ‘sec_duration’, ‘number_fights’, ‘trope_count’) except for ‘year’, since many rows of this variable have the value “Unknown” and cleaning out these mysterious values could result in a big amount of data loss, which is not beneficial for exploring our dataset given that the original sample size is not sufficiently large. Before officially conducting our higher dimensional analysis, we will first have to standardize the columns of the selected data. Below is the first a few rows of the standardized dataset.

##           bpm sec_duration number_fights trope_count
## [1,] 4.584848     2.554277     0.3094224    3.583840
## [2,] 2.292424     3.951147     1.2376897    2.986534
## [3,] 4.675339     2.195082     1.5471121    2.389227
## [4,] 4.132396     2.474456     0.0000000    1.791920
## [5,] 2.413078     2.674009     1.8565345    1.791920
## [6,] 4.615012     1.476691     0.0000000    1.194613

After we make our standardized data, our next step is to compute the distance matrix necessary for clustering. Using this distance matrix, we are able to represent the dendrogram that classifies the fight-song dataset based on given variables. Here we apply complete linkage instead of single linkage, as complete linkage allows for the production of compact, well-separated clusters that are more representative of the data and it performs better in preventing overfitting

Above is the deodrogram that colors the tree by the clusters it automatically divides itself into (here we set the number of divided clusters to be 2). Let’s now explore how well the two clusters align with the categorical variables given in our dataset, particularly ‘official_song’ and ‘contest’ since they are dominant features of our data and both of them naturally have two categories, respectively.

Here, the blue color represents the fight songs that are formally recognized as official songs, otherwise the red color. Based on our complete-linkage unsupervised clustering model, the classification generally aligns with our expectation based on the variable ‘official_song’. To be specific, both the dendrogram and clustering indicate that there are two major clusters, the official songs and non-official songs. The majority of blue color (representing official songs) locates at the right portion of the cluster, which corresponds to the green section in the dendrogram. At the same time, the majority of red color (representing non-official songs) locates at the relatively left portion of the cluster, which corresponds to the red section in the dendrogram. Thus, our classification based on ‘official_song’ seems to be valid.

Another important categorical variable (which is also specifically discussed in the previous RQ) is ‘contest’. We can do the same analysis here to explore the clustering of fight songs based on whether a particular song is chosen as the result of a contest.

In this case, the red color (representing fight songs that are not the results of any contest) occupies the majority of the cluster, which corresponds to the green section in the dendrogram. The rest is simply blue color (representing fight songs that are the results of contests) and it corresponds to the red section in the dendrogram. Thus, our clustering is again successful in this case.

Pairs Plot Over Quantitative Variables

After we finish our analysis on clustering, we are going to explore how our selected quantitative variables are potentially correlated. To achieve our goal, we firstly compute Euclidean distance to plot the first two dimensions from multi-dimensional scaling (MDS) on a scatterplot.

As of now, the MDS dimensions don’t have a lot of interpretability. Thus, it can be useful to see how the MDS dimensions relate to the original data. Let’s focus on the first dimension returned by the MDS (denoted as \(MDS_1\)). We will start by conducting a linear regression on \(MDS_1\) as the outcome and the quantitative variables in the dataset as the covariates. Below is the outcome after we run our linear regression.

## 
## Call:
## lm(formula = mds1 ~ bpm + sec_duration + number_fights + trope_count, 
##     data = fight_songs)
## 
## Residuals:
##        Min         1Q     Median         3Q        Max 
## -4.332e-15 -2.255e-16 -3.440e-17  3.061e-16  2.246e-15 
## 
## Coefficients:
##                 Estimate Std. Error    t value Pr(>|t|)    
## (Intercept)   -3.184e-01  5.673e-16 -5.613e+14   <2e-16 ***
## bpm           -1.364e-02  2.923e-18 -4.668e+15   <2e-16 ***
## sec_duration   2.358e-03  3.801e-18  6.204e+14   <2e-16 ***
## number_fights  1.964e-01  3.167e-17  6.201e+15   <2e-16 ***
## trope_count    3.726e-01  6.050e-17  6.158e+15   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 7.509e-16 on 60 degrees of freedom
## Multiple R-squared:      1,  Adjusted R-squared:      1 
## F-statistic: 4.205e+31 on 4 and 60 DF,  p-value: < 2.2e-16

In this regression table, we can easily tell that the variable ‘bpm’ is negatively associated with \(MDS_1\) to a statistically-significant degree, while the rest three variables (‘sec_duration’, ‘number_fights’, and ‘trope_count’) are positively associated with \(MDS_1\) to a statistically-significant degree. Since all four selected variables are indicated to have statistical significance in our regression analysis, we are now going to incorporate them into our final pairs plot graph that is colored by ‘official_song’.

In this pairs plot, for all fight songs that are formally recognized as official songs, there seems to be a negative correlation between ‘number_fights’ and ‘bpm’ and a positive correlation between ‘number_fights’ and ‘trope_count’, as the absolute value of their correlation coefficients are greater than 0.15. On the other hand, for those songs that are not official songs, there seems to be correlations between all pairs of variables, as the absolute values of their correlation coefficients are all greater than 0.15. However, one thing worth mentioning here is that the total number of unofficial songs in our dataset is not sufficiently large (<10), so the potential correlations here might not be as strong as they look so far.

Conclusion

In this research, we mainly investigate the factors determining if the fight songs are chosen as official songs, the factors determining if the fight songs are chosen as the result of a contest, and the distribution of fight songs in the higher-dimensional analysis.

In the first section, we conclude that for official songs, most amount of them contain both word “victory” and “win”(or”won”), and some of the songs contain the word “win”(or”won”) but not “victory”, which implies that official fight songs tend to be more oral so they are easier to remember. Also, we have decided that bpm is a more significant factors in determining whether the fight songs are chosen as the result of an official fight song.

In the second section, we summarize the distribution of number of trope for the songs that are result of contest and not with a density ridgeline plot. To further explore if we have a significant difference of number of trope for songs chosen as results of contest and not, we perform a t-test with the p-value 0.02852, which indicates that the mean number of trope for songs from contest and songs not from contest differ from each other. Also, we have performed a Chi-squared test for independence on whether the song chosen as a the result of contest with variable ‘victory_win_won’, ‘men’, ‘opponents’ separately. However, the result shows that there is no interesting correlation observed.

In the last section, based on the variables ‘official_song’ and ‘contest’, we successfully classify all fight songs into two main clusters in the dendrogram with complete linkage. To further investigate the potential correlations between all quantitative variables in our dataset, we apply multi-dimensional scaling and pairs plot to display the distribution of fight songs in the higher dimensional space. The result shows that for all fight songs that are formally recognized as official songs, there seems to be a negative correlation between ‘number_fights’ and ‘bpm’ and a positive correlation between ‘number_fights’ and ‘trope_count’. For fight songs that are unofficial, there seems to be correlations between all pairs of selected variables. However, there is no formal conclusion made so far since the total number of unofficial songs in our dataset is not sufficiently large.

Future Discussion

Limitations

Overall the data size is a bit small with only a total of 65 observations from the five college football conferences. The results would be more generative if we can include fight songs from college football teams that are not in the conference.
Only have several categorical variables indicate if the song includes words like “rah”, “victory”, and other words related to the fight. The inclusion of the counts of these numbers and the lyric of the fight song itself would allow more comprehensive analysis as we can carry out text analysis and try to identify interesting patterns from the fight song’s lyrics.

Future Research

Due to the limit of the data size and the variables in this dataset being mostly categorical in nature, there are a few questions that can be asked as the future research direction. For example, for the second research question, we found that the mean number of tropes is different for the songs result from the contest and those that do not, but we lack the count of these related words in the dataset, so we can not further carrying out analysis in identify what exact words might appear significantly more time in the songs result from the contest and those that do not. What is more, if we can get the lyrics of each fight song, we can further do topic modeling in identifying if there are a few latent categories for the fight songs in terms of whether it is official or not, and in terms of whether the songs result from the contest. Last but not least, with more data from other colleges, we can try to identify some cliche elements that might appear in the fight songs; for example, if fight songs generally like to mention the name of the opponent, whether it tends to be fast, the mean length of the fight songs and other interesting cliches element.

Appendix

Below is a short specification of all variables contained in our dataset:

school: School name

conference: School college football conference

song_name: Song title

writers: Song author

year: Year the song written. Some values are Unknown

student_writer: Was the author a student? Some values are Unknown

official_song: Is the song the official fight song according to the university?

contest: Was the song chosen as the result of a contest?

bpm: Beats per minute

sec_duration: Duration of song in seconds

fight: Does the song say “fight”?

number_fights: Number of times the song says “fight”?

victory: Does the song say “victory”?

win_won: Does the song say “win” or “won”?

victory_win_won: Does the song say “victory,” “win” or “won”?

rah: Does the song say “rah”?

nonsense: Does the song use nonsense syllables (e.g. “Whoo-Rah” or “Hooperay”)

colors: Does the song mention the school colors?

men: Does the song refer to a group of men (e.g. men, boys, sons, etc.)?

opponents: Does the song mention any opponents?

spelling: Does the song spell anything?

trope_count: Total number of tropes (fight, victory, win_won, rah, nonsense,colors, men, opponents, and spelling)

spotify_id: Spotify id for the song Below is a short specification of all variables contained in our dataset:

36315 Final Project

Zhuyun Jin, Tianyi Zhang, Yixin Pan

2023-04-28