Note that the echo = FALSE
parameter was added to the
code chunk to prevent printing of the R code that generated the
plot.
One of the most intense survival shows on television today, “Alone” hosts a variety of contestants all with the same goal: surviving some of the harshest conditions in nature. In the show, contestants are dropped off in remote locations wherein they must withstand 100 days in the Arctic winter solely dependent upon their survival skills. The show attracts thousands of viewers across the United States per season.
The data used in this report comes from the Alone data package from a public data repository. The Alone data package contains contestant and viewer information for the survival TV show “Alone.” The following report conducts an analysis of various variables from two of the datasets within the package; the survivalists dataset (contains contestant information across 9 seasons) and the episodes dataset (contains viewer data for each episode).
The survivalists data contains data on 98 episodes and 16 variables: season, age, name, gender, city, state, country, result, days_lasted, medically_evacuated, reason_tapped_out, reason_category, team, day_linked_up, profession, and url.
The variables of interest from this dataset are as follows:
state: state the contestant is from (categorical variable)
age: contestant’s age (quantitative variable)
gender: contestant’s gender (quantitative variable)
medically_evacuated: whether or not the the contestant was medically evacuated from the game (categorical variable)
reason_category: the simplified category of the reason a contestant tapped out (categorical variable)
days_lasted: the number of days the contestant lasted in the game before tapping out or winning (quantitative variable)
The episodes data contains data on 94 contestants and 11 variables: version, season, episode_number_overall, episode, title, air_date, viewers, quote, author, imdb_rating, and n_ratings. Since the dataset has incomplete data on Season 9 of the show, episodes from Season 9 were not included in the analysis.
The variables of interest from this dataset are as follows:
episode_number_overall: episode number across seasons (quantitative variable)
viewers: number of viewers in the US, in millions (quantitative variable)
imdb_rating: IMDb rating of the episode (quantitative variable)
n_ratings: number of ratings for the episode (quantitative variable)
Based upon the data, we are interested in which factors influence contestant success as well as general reception of the television show “Alone.” Given the intense nature of this reality show, we are particularly interested in contestant attributes that contribute to success, as well as trends in viewership and ratings across different seasons.
Specifically, the research questions of interest are:
1. What is the demographic breakdown of the contestants on the show?
2. How does age, gender, medically_evacuated, and reason_tapped_out affect days_lasted?
3. What trends are there in terms of viewership and ratings across different seasons of the show?
To answer the first research question, we created various graphs to understand the distributions of the predictor variables. The first graph we created was a map of where the contestants are from.
The map above shows the distribution of where the survivalists are from, by state, not including Alaska and Hawaii. There are 16 states that contestants are not from. The state with the largest number of contestants is Maine; this makes sense, as this is a state that has a lot of nature, so contestants might be better trained if they are from Maine Washington and Utah have the second most number of contestants, which also makes sense, using the same logic. The contestants are from a diverse range of states, so where the contestants from might not have an effect on the linear model.
The above graph shows the distribution of the age variable; we wanted to better understand the distribution of age and how it could be used. The distribution is approximately normal and has no outliers, so this shows that the variable might be good for the linear regression.
The bar graph above shows the gender distribution of all the contestants. As seen, there are 40 more male contestants than female contestants. While this is not an equal distribution, this variable is still important, as all the contestants that have won have been male, and therefore, the ones that have lasted the longest. Gender might be an important variable in dictating how many days a contestant is on the show, so it still should be included.
This plot shows the distribution of age but grouped by if the contestant was medically evacuated or not. This is important, as there might be a correlation between medically evacuated and days lasting that could be necessary in the regression. As seen in the plot, for both contestants who were medically evacuated and those who were not, the distribution of age is approximately normal and unimodal. The distribution for people who were medically evacuated is slightly skewed left, so it might be a good variable to include in the linear model.
The last graph for this question shows the distribution of days lasted based on the category of the reason the contestant left the show. The median for loss of inventory is a lot lower than the medians for family/personal and medical/health. The range for family/personal and loss of inventory are around the same, but the range for medical/health is larger than both. The median for medical/health is higher than the median for family/personal. It is worth noting that there is a N/A category, which is for contestants who won; they do not have a reason to go, so they are set as N/A. Overall, the reason gone categories have somewhat different distributions of days lasted, so this might be telling to how the variable is as a predictor.
To further analyze the impacts of age and gender of the contestant on
contestant success, we chose to investigate the variables
age
, gender
, and days_lasted
in
the survivalists. There may potentially be a relationship between age
and the number of days lasted since younger people are typically
expected to last more days due to their physical health compared to
older people. There may also potentially be a relationship between
gender and the number of days lasted, since men may be more attuned to
survival situations (due to gender norms, physicality, etc.). Thus, we
are interested in factors that may predict or be associated with the
number of days lasted by each contestant; these factors could help us
deduce which contestants are likely to last the longest.
The above plot displays age on the x-axis, the number of days lasted on the y-axis and is colored by gender. For males, there seems to be a potentially strong, positive relationship between age and days lasted; as age increases, the number of days lasted increases. For females, there seems to be a negative relationship between age and days lasted; as age decreases, the number of days lasted decreases. This presents an interesting difference between the trends for males and females concerning age.
Based upon the initial exploration into the variables of interest,
there may potentially be a relationship between the age, gender, if the
contestant was medically evacuated, the reason the contestant tapped
out, and the number of days lasted. Thus, we chose to conduct a multiple
linear regression to analyze the association between the previously
mentioned predictor variables and the response variable
(days_lasted
.)
##
## Call:
## lm(formula = days_lasted ~ gender + age + medically_evacuated +
## reason_category, data = survivalists)
##
## Residuals:
## Min 1Q Median 3Q Max
## -39.967 -19.928 1.064 17.093 47.847
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 46.31573 14.40742 3.215 0.00190 **
## genderMale -19.80800 6.65601 -2.976 0.00389 **
## age 0.03548 0.30767 0.115 0.90848
## medically_evacuatedTRUE -16.11304 7.55754 -2.132 0.03615 *
## reason_categoryLoss of inventory 0.90523 14.86822 0.061 0.95161
## reason_categoryMedical / health 13.01920 6.93492 1.877 0.06421 .
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 24.67 on 78 degrees of freedom
## (10 observations deleted due to missingness)
## Multiple R-squared: 0.1581, Adjusted R-squared: 0.1041
## F-statistic: 2.93 on 5 and 78 DF, p-value: 0.0178
Based on the linear regression summary output, the number of days lasted is impacted by various factors. By conducting t-tests, we found that the intercept, gender when male, and medical evacuation were statistically significant. The interpretations of the coefficients are as follows:
genderMale
is equal to 0 since the p-value is less than the
significance level of 0.05 (t-test, t =-2.976, p = 0.00389). Gender is
associated with the number of days lasted in that when the gender of the
contestant is male, the number of days lasted decreases by approximately
19-20 days.medically_evacuatedTRUE
is equal to 0 since the p-value is
less than the significance level of 0.05 (t-test, t =-2.132, p =
0.03615). Medical evacuation is associated with the number of days
lasted in that if the contestant was medically evacuated, the number of
days lasted decreases by approximately 16 days.Thus, the results of the linear regression found gender and medical evacuation as factors associated with the number of days lasted; the analysis found that both gender (being male) and medical evacuation are associated with a decrease in the number of days lasted.
Finally, to investigate the trends in Alone episode viewership and ratings, we have decided to use cluster-based techniques such as PCA and dendrogram plots.
For the cluster-based plots, we use the “Episodes” section of the Alone data, which we then filtered for just the quantitative variables: “episode_number_overall,” representing the overall number of the episode in relation to the entire show, “viewers,” how many viewers each episode had in millions, “imdb_rating,” the episode’s IMDB rating out of 10, and “n_ratings,” how many ratings the episode got on IMDB. In order to perform PCA, we had to remove all of the NA values, which reduced the dataset to just cover the first 8 seasons of the show, as there is no viewer data for the ninth season.
First, we created a scree plot to determine how many clusters we need for our dendrogram.
Based on the plot, we can see that two clusters is the best choice for our dataset. Two clusters account for 80.9% of the variance in principal component accounts, and two clusters is the amount above the 1/p dotted line.
Next, we created a PCA plot using the first two components, which we deduced based on the scree plot.
Based on the plot, it looks like using the first two components was able to separate the seasons. For example, season 1, shown in orange, has mostly negative values for both PC1 and PC2; season 4, in pink, has slightly positive/negative values for PC1 and all positive values for PC2; and season 7, in navy blue, has all positive values for PC1 and all negative values for PC2. It also seems like most of the data points for the seasons are clustered together.
Finally, we created a dendrogram to clarify how the clusters are created.
Based on the dendrogram, we can once again see that the small sub-clusters are primarily made up of one season; we see season 1 (orange) all the way on the left in its own sub-cluster, and season 3 (sky blue) makes up the rest of the left, red cluster. We can see this pattern continue throughout the rest of the dendrogram.
From all of these plots, we can see that the four quantitative variables in the “Episode” data do have significant correlations with the individual seasons. This suggests that we can use episode number, viewer numbers, IMDB ratings, and number of ratings to see differences between the seasons of Alone.
Through answering the research questions, we came to a few conclusions about the dataset. First, we were able to see that there were relationships between many of the variables, and that could mean a linear regression with days_lasted as the response variable could tell us more information about the dataset. From doing that linear regression, we found that gender and medical evacuation are associated with the number of days lasted. If the gender is male and they are medically evacuated, their number of days lasted is less. We also wanted to see if there were any trends with viewership and ratings; we were able to see through various plots created that episode number, viewer numbers, IMDB ratings, and number of ratings show differences between seasons of “Alone.”
Many questions could be answered through future works; we focused mostly on the survivalist data, so future work can focus on the other datasets. There is a loadouts dataset and a season dataset that was not focused on. Since we did focus on two datasets, future works could potentially show the connection between the two datasets, and see if there are any relationships within them. We decided to focus on only two datasets, as we wanted to show the relationships within them; for future works, people could show the relationships within and between datasets.
It could also be interesting to explore the differences in country; much of this data is from the US, but the show runs in multiple different countries. We would need more data for this, as the data for the foreign countries was very limited, and not enough to make formal conclusions from. Since there are only 9 seasons, and therefore only 9 winners, it would be compelling to work with data once there are more winners from the show. All of the winners are also male, so future work could focus more on the winners and see if there are any correlations between the two. As we can see, there are many possibilities for future work. The Alone datasets were very compelling, so any future work analyzing them could be interesting.