Final Project: An Analysis of the Hollywood dataset

Introduction

Our final report analyzes a dataset of 1162 rows and 13 columns, containing information on Hollywood movies. It includes their release year, movie names, director, gender, birth dates, actors, and other relevant variables. The report aims to explore various research questions related to the dataset, such as the impact of the gender difference between actors and directors, the representation of diverse genders and sexual orientations in Hollywood movies over time, and the relationship between the ages of the actors over time via movie release years. By examining these three research questions, this report aims to provide insights into the trends and patterns in the Hollywood movie industry and to identify potential areas or limitations for future research.

  • the dataset has 13 columns and 1162 rows.

  • It covers gender, age, name information of the protagonists and the director.

The world of Hollywood is full of fascinating questions, some of which have the potential to challenge our preconceptions about the industry. How does the gender of a director impact the choice of main actors and actresses for a movie? Are there changes in the representation of diverse genders and sexual orientations in Hollywood movies over the years? And what about the impact of the movie’s release year on the age range of actors casted? These questions can help us better understand the complex dynamics that shape the movies we love, and uncover intriguing insights about the world of Hollywood that we may have never considered before. Whether you’re a film buff or simply curious about the industry, exploring these questions can lead to a greater appreciation of the art and business of making movies.

  • How does the gender of the director influence the gender of the main actor/actress in the movie?
  • How has the representation of diverse genders and sexual orientations in Hollywood movies changed over the years?
  • How does the release year of the movie impact the age range of actors cast in their movies?

Question 1

The first question we would like to examine is “How does the gender of the director influence the gender of the main actor/actress in the movie?”. By digging into this question, we first have to see the gender distribution of movie directors by years.

It is noteworthy that there has been a significant increase in the number of both male and female movie directors since 1990. This trend is particularly striking for female directors, as there were very few women in this role prior to 1990. The rise in the number of directors since the 1990s has been substantial.

Since 1990, there has been a noticeable increase in gender diversity among main movie characters. Prior to this, heterosexual characters were clearly the predominant depiction. Furthermore, from 2000 to 2025, there has been a significant rise in the portrayal of male-male character pairings. This trend reflects Hollywood’s increasing efforts towards inclusivity with regard to gender representation.

It is uncertain whether the rise of female directors is causally related to the increase in gender inclusivity during the 1990s, or if it is simply coincidental. Although there has been a substantial increase in the number of female directors since the 1990s, it is difficult to determine if this trend was the driving force behind the push towards gender inclusivity in Hollywood, or if it was simply a reflection of larger societal changes. Nonetheless, the concurrent rise of both female directors and gender inclusivity in Hollywood during the 1990s suggests that there may be some association between the two.

## 
##  Pearson's Chi-squared test
## 
## data:  tab1a
## X-squared = 7.3472, df = 4, p-value = 0.1186

After conducting a chi-square test to examine the relationship between the gender of movie directors and the gender of main movie characters, the results indicate that there is no significant association between these variables. The analysis considered a sample of movies from the Hollywood dataset, which included information on the gender of directors and the gender of main actors/actresses. The chi-square test allowed for the comparison of the observed frequencies of each gender pairing (male-male, male-female, female-male, and female-female) against the expected frequencies, assuming there was no association between the two variables. Based on the results of the analysis, the null hypothesis was not rejected, indicating that there is no evidence of a relationship between the gender of movie directors and the gender of main movie characters. However, it is important to note that this finding only holds true for the sample of movies included in the dataset and may not necessarily generalize to the entire movie industry. Further research is needed to explore this relationship in more detail and determine if other factors, such as genre or audience demographics, may play a role in shaping the representation of gender in Hollywood movies.

Question 2

How has the representation of sexual orientations in Hollywood movies changed over the years? The assumption would be that this representation has increased, but we cannot find out without running the numbers. We started by adding an extra variable to the dataset cataloging if a couple was heterosexual or homosexual. Then we just had to crunch some numbers, and make some graphs.

A quick table here shows the overall proportions of sexualities of couples within the data (these are the demonstrated sexualities of the characters, not the actors.)

## 
## Heterosexual   Homosexual 
##         1138           23

So only 1.98% of the couples in the dataset are homosexual, and the remaining 98.02% are heterosexual. So we are going to be worrying less about the absolute numbers of couples, and more about the changes in proportion.

Sexualities over time

This graph gives us a few key observations. We can see that the yellow line showing the density of movie couples in the dataset over time is very similar to the density of strictly heterosexual movie couples. By running a two-sample KS test, we can see that they have no statistically significant difference in distribution.

## 
##  Asymptotic two-sample Kolmogorov-Smirnov test
## 
## data:  het and all
## D = 0.008583, p-value = 1
## alternative hypothesis: two-sided

We also notice that the distribution of homosexual couples appears to be very different to that of heterosexual couples. Another two-sample KS test demonstrates this, showing that the two distributions are different to a statistically significant degree.

## 
##  Asymptotic two-sample Kolmogorov-Smirnov test
## 
## data:  hom and het
## D = 0.43325, p-value = 0.000422
## alternative hypothesis: two-sided

We see heterosexual couples going all the way back to the first film of the dataset in ___, and peaking in 2004, as that is when the plurality of movies here are from. The first movie with a homosexual couple is from 1997, and we see their median peak shifted slightly to 2010 after that of the general couple in the data. This is around when many polls started showing that around 50% of Americans believed that gays and lesbians should be allowed to marry.

## [1] 2004
## [1] 2010

Gender granularities

## 
##   man woman 
##    12    11

Out of the 23 same-sex couples, 12 are male, and 11 are female. We can see their individual distributions over time with 2 violin plots. We also show boxplots to get a better sense of the values within the distribution.

These show us a simpler and more engaging look at when these gay and lesbian couples were shown in movies. We see that besides on male gay couple in 1997, those couples are highly concentrated in the late 2000s to early 2010s. The lesbian couples show a more uniform distribution starting in 2003 and continuing past 2020.

Homosexual couples within movies

Now that we’ve looked into the distributions of couples over time, let’s now focus on the specifics of the dataset, this being the age gaps.

These two distributions look rather similar, and so it may seem that age gaps in Hollywood movies stay relatively consistent where the relationships the age gaps are representing are straight or gay. We can confirm this by performing another KS-test to analyze the distribution, and a t-test to analyze the mean.

## 
##  Asymptotic two-sample Kolmogorov-Smirnov test
## 
## data:  movies_hom$`Age Difference` and movies_het$`Age Difference`
## D = 0.14262, p-value = 0.7488
## alternative hypothesis: two-sided
## 
##  Welch Two Sample t-test
## 
## data:  movies_hom$`Age Difference` and movies_het$`Age Difference`
## t = 1.265, df = 22.543, p-value = 0.2188
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -1.839771  7.615426
## sample estimates:
## mean of x mean of y 
##  13.30435  10.41652

Both tests have p-values above 0.05, so we conclude that homosexual and heterosexual couples follow the same distributions with regards to the Age Gaps in their couples for Hollywood movies. And more specifically, we can conclude that there is not a statistically significant difference for the mean age gap in a hollywood movies between gay and straight couples.

Question 3

How does the release year the movie impact the age range of actors cast? With this research question at hand, we want to delve deeper into how the release year of a movie impacts the age range of the actors cast in the respective movie. It is clear that the variables involved are Release.Year, Age.Difference, Actor.1.Age, and Actor.2.Age.

The first key visualization is to understand the distribution of movies made given the release year of the movie. This will help understand how the number of movies changed over time in Hollywood.

Upon looking at this first histogram, we notice key patterns about overall movies made as provided by this data set given the year in time. It is clear that from about 1925-1990 there were much fewer movies made, but there is a drastic increase in the total amount of movies made from 1990-present time. We can summarize the average and sd of the Release Year distribution by calculating the summary statistics. In the end, we now create a subset of the original data to only go from 1990 to present time in release years of movies. We can highlight more interesting patterns by zooming in on a portion of the Release Year data.

## [1] 2000.594
## [1] 16.60167

Now, we take a look at a series of three scatterplots understanding the relationship between Release Year and Actor 1 Age, Release Year and Actor 2 Age, and Release Year and Actor Age Difference using a subset of our original data.

Upon looking at all three datasets, there doesn’t seem to be any particular relationship between the release year and age difference of actors. Over time, there doesn’t seem to be a change in the preference of actor age or actor age difference. It is interesting that there just doesn’t seem to be a change because this is something that you may expect to have changed over time as audience, style, etc. evolve over different decades and time periods.

Lastly, to validate that there is no significant relationship between Release Year and Actor Age Difference, we created a linear regression model. We display the above scatterplot with a linear line representing the data and then summarize the model to inspect whether we get significant p-values.

## 
## Call:
## lm(formula = Age.Difference ~ Release.Year, data = movies_post1990)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -10.075  -6.376  -2.014   4.305  40.320 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)
## (Intercept)  70.57652   67.48882   1.046    0.296
## Release.Year -0.03036    0.03364  -0.903    0.367
## 
## Residual standard error: 8.142 on 985 degrees of freedom
## Multiple R-squared:  0.0008262,  Adjusted R-squared:  -0.0001881 
## F-statistic: 0.8145 on 1 and 985 DF,  p-value: 0.367

Upon looking at the scatterplot and linear regression model, it is clear that there is no significant relationship between Release Year and Actor Age Difference. The linear trend line is almost completely horizontal. In the summary of the model, the p-value for the Release Year variable is 0.367, which is greater than 0.05. This tells us that the linear model is not representing a significant relationship between Release Year and Age Difference. In addition, the R-squared value is approximately zero, showing that there is an extremely weak relationship between the two variables at hand.

Conclusion

In conclusion, the Hollywood movie dataset offers a valuable resource for exploring trends and patterns in the film industry over time. It analyzes a dataset of 1162 rows and 13 columns of information about Hollywood movies to explore research questions related to the impact of gender differences between actors and directors, representation of diverse genders and sexual orientations in Hollywood movies over time, and the relationship between the average age of the actors and the directors’ age. The report examines how the gender of the director influences the gender of the main actor or actress in a movie, how the representation of sexual orientations has changed over time, and the impact of the release year on the age range of actors cast in movies. The results suggest that there is no significant association between the gender of movie directors and the gender of main movie characters. The report also finds a rise in gender inclusivity and the portrayal of male-male character pairings in Hollywood movies since the 1990s. Moreover, the report examines the changing proportions of sexualities of couples in the dataset over time and finds that there has been a steady increase in the representation of homosexual couples in Hollywood movies since the 1990s.

Limitations and future work

However, there are still limitations and opportunities for future work in this dataset. For instance, the current dataset is limited to a specific region or time period, and there is a need to collect more data from different regions or time periods to increase the generalizability of the results. Additionally, the team could explore more advanced statistical techniques, such as machine learning algorithms, to better understand the complex relationships between the variables. Furthermore, future research could examine the impact of other variables, such as demographic factors, on the relationships identified in the current dataset. Overall, these limitations and opportunities for future work provide important directions for researchers to improve the quality and scope of research based on this dataset.

The report does discuss some questions that have not been fully answered by the project but could be addressed in future work. The team provides adequate reasons for why these questions were left as future work, such as the need for more data or more sophisticated statistical techniques that they have not yet mastered. These future-work questions are generally well-motivated given what the team has accomplished in this project.