Dataset Description

Our report utilizes data from the New York Squirrel Census. Individuals from the Squirrel Census (https://www.thesquirrelcensus.com/) count and observe squirrels in New York City’s parks. This particular dataset contains information from the Central Park census in 2018. It consists of 3,023 rows, each representing a single squirrel sighting. Please note that this means the same squirrel could be recorded multiple times. There are 36 different columns, corresponding to the variables recorded for the squirrel sightings. For this analysis, we will mainly focus on geographic location, date sighted, age, and primary fur color, as well as all of the binary behavioral variables (e.g. running, foraging, etc.). These variables of interest are explained in further detail below. The link to this dataset can be found here.

  • long: a quantitative variable indicating the longitude at which the squirrel was sighted
  • lat: a quantitative variable indicating the latitude at which the squirrel was sighted
  • date: the date on which the sighting occurred
  • age: a categorical variable indicating whether the squirrel sighted is either adult or juvenile
  • primary_fur_color: a categorical variable indicating whether the squirrel’s primary fur color is either gray, cinnamon, or black
  • running: a binary variable indicating whether or not the squirrel was seen running
  • chasing: a binary variable indicating whether or not the squirrel was seen chasing
  • climbing: a binary variable indicating whether or not the squirrel was seen climbing
  • eating: a binary variable indicating whether or not the squirrel was seen eating
  • foraging: a binary variable indicating whether or not the squirrel was seen foraging
  • kuks: a binary variable indicating whether or not the squirrel was heard kukking (kuks are a sound made by squirrels for many reasons)
  • quaas: a binary variable indicating whether or not the squirrel was heard quaaing (quaas are a sound made by squirrels in the presence of a ground predator)
  • moans: a binary variable indicating whether or not the squirrel was heard moaning (moans are a sound made by squirrels in the presence of an air predator)
  • tail_flags: a binary variable indicating whether or not the squirrel was seen flagging its tail
  • tail_twitches: a binary variable indicating whether or not the squirrel was seen twitching its tail.
  • approaches: a binary variable indicating whether or not the squirrel was seen approaching humans
  • indifferent: a binary variable indicating whether or not the squirrel was indifferent to human presence
  • runs_from: a binary variable indicating whether or not the squirrel was seen running from humans

Research Questions

Below, we enumerate the overarching questions we wished to explore with this squirrel census dataset.

  1. Are older squirrels more likely to exhibit certain behaviors than younger squirrels?
  2. Is there a relationship between the location where the squirrel was sighted and primary fur color?
  3. Does the distribution of how far above the ground a squirrel was first seen vary across age groups?

With the questions above, we hope to get a better understanding of some of the underlying behaviors and patterns exhibited by squirrels in New York City.

Question 1

First off, we are interested in the differences in observed behavior between adult and juvenile squirrels. For this portion of the report, we will, therefore, filter out any rows with blank entries for the variable age.

Stacked Bar Charts

We will begin our analysis by creating a series of stacked bar charts of the conditional distribution of each behavioral variable given age on the proportion scale. The only variables for which there is a clear visual difference in proportion between adult and juvenile squirrels are whether or not they were seen foraging and whether or not they were indifferent to human presence. The stacked bar charts for these two variables are represented below.

The graph above indicates that the proportion of adult squirrels seen foraging is higher than the proportion of juvenile squirrels seen foraging.

The graph above indicates that the proportion of adult squirrels observed being indifferent to human presence is higher than the proportion of juvenile squirrels observed being indifferent to human presence.

Chi-Squared Test For Independence

We now proceed by conducting a chi-squared test for independence between age and whether or not the squirrel was seen foraging, as well as age and whether or not the squirrel was seen being indifferent to human presence.

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(squirrels_subset$age, squirrels_subset$foraging)
## X-squared = 21.807, df = 1, p-value = 3.015e-06
## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(squirrels_subset$age, squirrels_subset$indifferent)
## X-squared = 3.5803, df = 1, p-value = 0.05847

Only the p-value of the chi-squared test for independence between the variables age and foraging is below the alpha level of 0.05. As such, we reject the null hypothesis that a squirrel’s age and whether or not they were seen foraging are independent. We do not, however, reject the null hypothesis that a squirrel’s age and whether or not they were indifferent to human presence are independent.

Mosaic Plot

We now use a mosaic plot colored by Pearson residuals to identify the particular relationship between a squirrel’s age and the foraging behavior.

The mosaic plot displays that the observed counts of juvenile squirrels not seen foraging are significantly higher than what would be expected under independence and the observed counts of juvenile squirrels foraging are significantly lower than what would be expected under independence. As this is consistent with what was previously observed in the stacked bar chart, we, therefore, conclude that juvenile squirrels are less likely than adult squirrels to be observed foraging.

Question 2

We were also interested in whether there was a relationship between where a squirrel was sighted in Central Park and its fur color. One way we can explore this is by creating a locational density plot of the squirrels conditioned on their primary fur color. Before we do that, let’s check out the locational density plot for all squirrels.

Locational Squirrel Density Plot

In this plot, we can see a preliminary density plot of all squirrel sightings in Central Park. This gives us an idea of where the squirrels are sighted the most. We also see from the contour plot that the density that the squirrels are found in appears to be more located in the very center of the park. This density distribution of the squirrels could indicate that there could be more habitable areas in the center of the park compared to closer to the street and could be additional areas of interest if further data is collected. It also could mean that it’s easier to sight squirrels in that part of the park.

Now that we have an idea of the density of squirrels in Central Park, we want to see how that differs by their fur color. To explore this, we make a locational density plot of squirrels conditioned on their primary fur color.

Locational Squirrel Density Plot Conditioned on Primary Fur Color

In the plot in the graph above, we get additional information regarding the primary fur color distribution. The first relationship that we notice is the distribution of squirrels that have a primary fur color of cinnamon. Taking a look where these squirrels appear, we see that they do not appear in the northern side central park. This could suggest that the locational preferences of the squirrels are correlated with their primary fur color if squirrels of certain colors prefer different areas of the park. Similar to the previous density plot of all squirrels sighted in central park, we can see that the majority of squirrels regardless of fur color had the greatest sighting density closer to the center of the park.

Question 3

Finally, we are interested in if the distribution of how far above the ground squirrels were first sighted varies across ages. For this portion of the report, we will, therefore, filter our data to include only the observations in which the squirrels were seen above ground level.

Conditional Smoothed Density Plot

We begin our analysis with a visual comparison using a conditional smoothed density plot of the squirrels’ height above the ground given age. We note that the distributions have different shapes and that the center of the distribution is shifted right for juvenile squirrels.

Bartlett Test

We now proceed with a series of statistical analyses to ascertain if these visually observed differences in distribution are statistically significant. We begin with the Bartlett test, which checks if all the variances are equal.

## 
##  Bartlett test of homogeneity of variances
## 
## data:  above_ground_sighter_measurement by age
## Bartlett's K-squared = 28.278, df = 1, p-value = 1.051e-07

The p-value of the Bartlett test for homoscedasticity is below the alpha level of 0.05, so we reject the null hypothesis that the variance of the squirrels’ height above ground is the same for adult and juvenile squirrels.

One-way ANOVA test

Next, we conduct a one-way ANOVA test, to see if the mean height above ground is all the same across age groups.

## 
##  One-way analysis of means (not assuming equal variances)
## 
## data:  above_ground_sighter_measurement and age
## F = 5.7617, num df = 1.00, denom df = 214.92, p-value = 0.01723

The p-value is below the alpha level of 0.05, so we reject the null hypothesis that the mean height above ground is the same for adult and juvenile squirrels.

Two-sample KS Test

Lastly, we run a two-sample KS test to assess whether the distribution of squirrels’ height above ground is different across age groups.

## 
##  Asymptotic two-sample Kolmogorov-Smirnov test
## 
## data:  squirrels_adult$above_ground_sighter_measurement and squirrels_juvenile$above_ground_sighter_measurement
## D = 0.099315, p-value = 0.3101
## alternative hypothesis: two-sided

The p-value for the two-sample KS test is not below the alpha level of 0.05, so we cannot reject the null hypothesis that the distribution of squirrels’ height above ground is the same for adult and juvenile squirrels. Based on these three statistical analyses, we, therefore, conclude that the variance and mean of the distribution of squirrels’ height above ground is different for adult and juvenile squirrels. The distribution of squirrels’ height above ground, however, is equal across adult and juvenile squirrels.

Conclusion

Closing this analysis, we summarize the main findings as follows:

Our first question asked if there were differences in observed behavior between juvenile and adult squirrels. After creating stacked bar charts for each binary behavioral variable, we found that only foraging and indifference to human presence presented a notable visual difference between adults and juveniles. Using chi-squared tests for independence on both variables, we found that age and foraging were dependent, but age and indifference were independent. The mosaic plot corroborates both this statistical analysis and the stacked bar chart. Ultimately, we conclude that juvenile squirrels are less likely to be seen foraging than adult squirrels.

Our second question asked if a relationship existed between the squirrels’ fur color and their location. Using a location density plot, we first observed that squirrels were located toward the center of the park. Conditioning on fur color, we found that cinnamon squirrels do not appear in the north of the park. This could suggest that squirrels have locational preferences based on fur color.

Finally, our last question asks if the distribution of how far above ground squirrels were first sighted varies by age group. We created a conditional smoothed density plot of height above ground given the age group. This graph revealed that the shape of the curve differs between juveniles and adult squirrels and that the center of the distribution is shifted right for juvenile squirrels. Using the Bartlett test, one-way ANOVA test, and two-sample KS test, we found that the mean and variance between both groups differ; however, they do not have different distributions.

Limitations and Further Research

Though these conclusions are significant in understanding the preferences and behaviors exhibited by squirrels in Central Park, there are flaws in the dataset and its methodology that limit our conclusions. Being an observational study, this dataset may say more about the sighters’ behaviors than the squirrels’. Furthermore, we should note that this dataset has a disproportionately large number of adult squirrels (2568 squirrels) compared to juvenile squirrels (330 squirrels), which may reduce the validity of our claims regarding age. In further research, we recommend observing more juvenile squirrels. Furthermore, considering that animals are heavily affected by different climate conditions throughout the year, we recommend observing squirrels throughout the entire year instead of just a two-week time period. This would allow for a better understanding of the seasonality of different variables. Finally, future avenues for research could also involve recording squirrels’ species and seeing what differences and similarities can be derived from that.