Note that the echo = FALSE parameter was added to the code chunk to prevent printing of the R code that generated the plot.

Introduction

One of the most intense survival shows on television today, “Alone” hosts a variety of contestants all with the same goal: surviving some of the harshest conditions in nature. In the show, contestants are dropped off in remote locations wherein they must withstand 100 days in the Arctic winter solely dependent upon their survival skills. The show attracts thousands of viewers across the United States per season.

The data used in this report comes from the Alone data package from a public data repository. The Alone data package contains contestant and viewer information for the survival TV show “Alone.” The following report conducts an analysis of various variables from two of the datasets within the package; the survivalists dataset (contains contestant information across 9 seasons) and the episodes dataset (contains viewer data for each episode).

The survivalists data contains data on 98 episodes and 16 variables: season, age, name, gender, city, state, country, result, days_lasted, medically_evacuated, reason_tapped_out, reason_category, team, day_linked_up, profession, and url.

The variables of interest from this dataset are as follows:

state: state the contestant is from (categorical variable)

age: contestant’s age (quantitative variable)

gender: contestant’s gender (quantitative variable)

medically_evacuated: whether or not the the contestant was medically evacuated from the game (categorical variable)

reason_category: the simplified category of the reason a contestant tapped out (categorical variable)

days_lasted: the number of days the contestant lasted in the game before tapping out or winning (quantitative variable)

The episodes data contains data on 94 contestants and 11 variables: version, season, episode_number_overall, episode, title, air_date, viewers, quote, author, imdb_rating, and n_ratings. Since the dataset has incomplete data on Season 9 of the show, episodes from Season 9 were not included in the analysis.

The variables of interest from this dataset are as follows:

episode_number_overall: episode number across seasons (quantitative variable)

viewers: number of viewers in the US, in millions (quantitative variable)

imdb_rating: IMDb rating of the episode (quantitative variable)

n_ratings: number of ratings for the episode (quantitative variable)

Based upon the data, we are interested in which factors influence contestant success as well as general reception of the television show “Alone.” Given the intense nature of this reality show, we are particularly interested in contestant attributes that contribute to success, as well as trends in viewership and ratings across different seasons.

Specifically, the research questions of interest are:

1. What is the demographic breakdown of the contestants on the show?

2. How does age, gender, medically_evacuated, and reason_tapped_out affect days_lasted?

3. What trends are there in terms of viewership and ratings across different seasons of the show?

Research Questions

Question 1: What is the demographic breakdown of the contestants on the show?

To answer the first research question, we created various graphs to understand the distributions of the predictor variables. The first graph we created was a map of where the contestants are from.

The map above shows the distribution of where the survivalists are from, by state, not including Alaska and Hawaii. There are 16 states that contestants are not from. The state with the largest number of contestants is Maine; this makes sense, as this is a state that has a lot of nature, so contestants might be better trained if they are from Maine Washington and Utah have the second most number of contestants, which also makes sense, using the same logic. The contestants are from a diverse range of states, so where the contestants from might not have an effect on the linear model.

The above graph shows the distribution of the age variable; we wanted to better understand the distribution of age and how it could be used. The distribution is approximately normal and has no outliers, so this shows that the variable might be good for the linear regression.

The bar graph above shows the gender distribution of all the contestants. As seen, there are 40 more male contestants than female contestants. While this is not an equal distribution, this variable is still important, as all the contestants that have won have been male, and therefore, the ones that have lasted the longest. Gender might be an important variable in dictating how many days a contestant is on the show, so it still should be included.

This plot shows the distribution of age but grouped by if the contestant was medically evacuated or not. This is important, as there might be a correlation between medically evacuated and days lasting that could be necessary in the regression. As seen in the plot, for both contestants who were medically evacuated and those who were not, the distribution of age is approximately normal and unimodal. The distribution for people who were medically evacuated is slightly skewed left, so it might be a good variable to include in the linear model.

The last graph for this question shows the distribution of days lasted based on the category of the reason the contestant left the show. The median for loss of inventory is a lot lower than the medians for family/personal and medical/health. The range for family/personal and loss of inventory are around the same, but the range for medical/health is larger than both. The median for medical/health is higher than the median for family/personal. It is worth noting that there is a N/A category, which is for contestants who won; they do not have a reason to go, so they are set as N/A. Overall, the reason gone categories have somewhat different distributions of days lasted, so this might be telling to how the variable is as a predictor.

Question 2: How does age, gender, medically_evacuated, and reason_tapped_out affect days_lasted?

To further analyze the impacts of age and gender of the contestant on contestant success, we chose to investigate the variables age, gender, and days_lasted in the survivalists. There may potentially be a relationship between age and the number of days lasted since younger people are typically expected to last more days due to their physical health compared to older people. There may also potentially be a relationship between gender and the number of days lasted, since men may be more attuned to survival situations (due to gender norms, physicality, etc.). Thus, we are interested in factors that may predict or be associated with the number of days lasted by each contestant; these factors could help us deduce which contestants are likely to last the longest.

The above plot displays age on the x-axis, the number of days lasted on the y-axis and is colored by gender. For males, there seems to be a potentially strong, positive relationship between age and days lasted; as age increases, the number of days lasted increases. For females, there seems to be a negative relationship between age and days lasted; as age decreases, the number of days lasted decreases. This presents an interesting difference between the trends for males and females concerning age.

Based upon the initial exploration into the variables of interest, there may potentially be a relationship between the age, gender, if the contestant was medically evacuated, the reason the contestant tapped out, and the number of days lasted. Thus, we chose to conduct a multiple linear regression to analyze the association between the previously mentioned predictor variables and the response variable (days_lasted.)

## 
## Call:
## lm(formula = days_lasted ~ gender + age + medically_evacuated + 
##     reason_category, data = survivalists)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -39.967 -19.928   1.064  17.093  47.847 
## 
## Coefficients:
##                                   Estimate Std. Error t value Pr(>|t|)   
## (Intercept)                       46.31573   14.40742   3.215  0.00190 **
## genderMale                       -19.80800    6.65601  -2.976  0.00389 **
## age                                0.03548    0.30767   0.115  0.90848   
## medically_evacuatedTRUE          -16.11304    7.55754  -2.132  0.03615 * 
## reason_categoryLoss of inventory   0.90523   14.86822   0.061  0.95161   
## reason_categoryMedical / health   13.01920    6.93492   1.877  0.06421 . 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 24.67 on 78 degrees of freedom
##   (10 observations deleted due to missingness)
## Multiple R-squared:  0.1581, Adjusted R-squared:  0.1041 
## F-statistic:  2.93 on 5 and 78 DF,  p-value: 0.0178

Based on the linear regression summary output, the number of days lasted is impacted by various factors. By conducting t-tests, we found that the intercept, gender when male, and medical evacuation were statistically significant. The interpretations of the coefficients are as follows:

We reject the null hypothesis that the beta coefficient for the intercept is equal to 0 since the p-value is less than the significance level of 0.05 (t-test, t =3.215, p = 0.00190). When the gender of the contestant is female, the reason the contestant tapped out is family/personal, the contestant was not medically evacuated, and age is 0, the number of days lasted by the contestant is approximately 46 days.
We reject the null hypothesis that the beta coefficient for genderMale is equal to 0 since the p-value is less than the significance level of 0.05 (t-test, t =-2.976, p = 0.00389). Gender is associated with the number of days lasted in that when the gender of the contestant is male, the number of days lasted decreases by approximately 19-20 days.
We reject the null hypothesis that the beta coefficient for medically_evacuatedTRUE is equal to 0 since the p-value is less than the significance level of 0.05 (t-test, t =-2.132, p = 0.03615). Medical evacuation is associated with the number of days lasted in that if the contestant was medically evacuated, the number of days lasted decreases by approximately 16 days.

Thus, the results of the linear regression found gender and medical evacuation as factors associated with the number of days lasted; the analysis found that both gender (being male) and medical evacuation are associated with a decrease in the number of days lasted.

Question 3: What trends are there in terms of viewership and ratings across different seasons of the show?

Finally, to investigate the trends in Alone episode viewership and ratings, we have decided to use cluster-based techniques such as PCA and dendrogram plots.

For the cluster-based plots, we use the “Episodes” section of the Alone data, which we then filtered for just the quantitative variables: “episode_number_overall,” representing the overall number of the episode in relation to the entire show, “viewers,” how many viewers each episode had in millions, “imdb_rating,” the episode’s IMDB rating out of 10, and “n_ratings,” how many ratings the episode got on IMDB. In order to perform PCA, we had to remove all of the NA values, which reduced the dataset to just cover the first 8 seasons of the show, as there is no viewer data for the ninth season.

First, we created a scree plot to determine how many clusters we need for our dendrogram.

Based on the plot, we can see that two clusters is the best choice for our dataset. Two clusters account for 80.9% of the variance in principal component accounts, and two clusters is the amount above the 1/p dotted line.

Next, we created a PCA plot using the first two components, which we deduced based on the scree plot.

Based on the plot, it looks like using the first two components was able to separate the seasons. For example, season 1, shown in orange, has mostly negative values for both PC1 and PC2; season 4, in pink, has slightly positive/negative values for PC1 and all positive values for PC2; and season 7, in navy blue, has all positive values for PC1 and all negative values for PC2. It also seems like most of the data points for the seasons are clustered together.

Finally, we created a dendrogram to clarify how the clusters are created.

Based on the dendrogram, we can once again see that the small sub-clusters are primarily made up of one season; we see season 1 (orange) all the way on the left in its own sub-cluster, and season 3 (sky blue) makes up the rest of the left, red cluster. We can see this pattern continue throughout the rest of the dendrogram.

From all of these plots, we can see that the four quantitative variables in the “Episode” data do have significant correlations with the individual seasons. This suggests that we can use episode number, viewer numbers, IMDB ratings, and number of ratings to see differences between the seasons of Alone.

Conclusion

Through answering the research questions, we came to a few conclusions about the dataset. First, we were able to see that there were relationships between many of the variables, and that could mean a linear regression with days_lasted as the response variable could tell us more information about the dataset. From doing that linear regression, we found that gender and medical evacuation are associated with the number of days lasted. If the gender is male and they are medically evacuated, their number of days lasted is less. We also wanted to see if there were any trends with viewership and ratings; we were able to see through various plots created that episode number, viewer numbers, IMDB ratings, and number of ratings show differences between seasons of “Alone.”

Discussion

Many questions could be answered through future works; we focused mostly on the survivalist data, so future work can focus on the other datasets. There is a loadouts dataset and a season dataset that was not focused on. Since we did focus on two datasets, future works could potentially show the connection between the two datasets, and see if there are any relationships within them. We decided to focus on only two datasets, as we wanted to show the relationships within them; for future works, people could show the relationships within and between datasets.

It could also be interesting to explore the differences in country; much of this data is from the US, but the show runs in multiple different countries. We would need more data for this, as the data for the foreign countries was very limited, and not enough to make formal conclusions from. Since there are only 9 seasons, and therefore only 9 winners, it would be compelling to work with data once there are more winners from the show. All of the winners are also male, so future work could focus more on the winners and see if there are any correlations between the two. As we can see, there are many possibilities for future work. The Alone datasets were very compelling, so any future work analyzing them could be interesting.

36-315 Final Project - Beyond the Wilderness: Determining Success and Trends in Alone

Ananya Manglik, Tasnim Rida, Jordan Brown

2023-12-11