Hoops and Homework:

Introduction:

As college athletics have become more competitive over the years, many have questioned the relationship that exists between academics and sports. Because of this, we wanted to get a better idea of this relationship and see how it impacts college basketball. In this report, we aim to explore if academic ranking influences the performance of college basketball teams. More specifically, we are looking at data from six randomly selected ACC schools where we observe whether their game performance changes on weekday games versus weekend games.

In order to answer this question, we conducted a multilevel analysis where we compared teams from schools ranked in the Top 30 against those in the Top 31-100, focusing on their win rates, scoring averages, and field goal attempts. We randomly selected 3 Top 30 schools, which are Duke (7), UNC (22), University of Virginia (24). We also randomly selected 3 schools that are ranked between 31-100, which are Boston College (39), NC State (60), University of Pittsburgh (67). We wanted to look into this topic because we believe that understanding the balance between academic achievement and athletic success is important for optimizing the overall development of student-athletes. This study could reveal important trends and patterns that help design educational and athletic programs that support student athletes. Moreover, insights from this research could lead to more effective recruitment strategies, where institutions can align their academic and athletic offerings to attract the best talent. Ultimately, by providing a more integrated approach to student-athlete development, schools can enhance their reputational standing and competitiveness in both academic and athletic arenas.

Data:

To answer this question, we decided to use college basketball data from the hoopR package in R, specifically from the ‘load_mbb_team_box’ to get the men’s college basketball team box scores. With this data, we filtered the team_abbreviation column to only include Duke, UNC, UVA, Boston College, NC State, and University of Pittsburgh. Since our study focuses on the effect between weekday games VS. weekend games, we need a column that contains the game date. However, the hoopR package only contains the game day. Thus, we used the game_date column to create a new column, named game_day that translates the date the game was played on to the day of the week it corresponds to.

## # A tibble: 5 × 58
##     game_id season season_type game_date  game_date_time      team_id team_uid  
##       <int>  <int>       <int> <date>     <dttm>                <int> <chr>     
## 1 401638632   2024           3 2024-03-28 2024-03-28 21:54:00     153 s:40~l:41…
## 2 401638622   2024           3 2024-03-23 2024-03-23 17:45:00     153 s:40~l:41…
## 3 401638586   2024           3 2024-03-21 2024-03-21 14:45:00     153 s:40~l:41…
## 4 401625413   2024           2 2024-03-16 2024-03-16 20:30:00     153 s:40~l:41…
## 5 401625411   2024           2 2024-03-15 2024-03-15 19:00:00     153 s:40~l:41…
## # ℹ 51 more variables: team_slug <chr>, team_location <chr>, team_name <chr>,
## #   team_abbreviation <chr>, team_display_name <chr>,
## #   team_short_display_name <chr>, team_color <chr>,
## #   team_alternate_color <chr>, team_logo <chr>, team_home_away <chr>,
## #   team_score <int>, team_winner <lgl>, assists <int>, blocks <int>,
## #   defensive_rebounds <int>, fast_break_points <chr>, field_goal_pct <dbl>,
## #   field_goals_made <int>, field_goals_attempted <int>, …

Above we see the first 5 rows of our finalized data. The data includes unique identifiers for each game (game_id), as well as the date (game_date) and time (game_date_time) of when the matches were played. Additionally, the dataset contains identifiers for the teams (team_id, team_uid) and a slug representing the team names (team_slug), providing a straightforward way to reference the competing institutions, such as ‘nc-state-wolfpack’, ‘duke-blue-devils’, and ‘north-carolina-tar-heels’.

Exploratory Data Analysis:

Our first plot focuses on average field goal percentage between weekdays and weekends categorized by Top 30 and 31-100 ranked schools. The data suggests that there is a negligible difference in performance between schools in the Top 30 academic rank and those ranked 31 to 100, especially when comparing weekdays to weekends. Specifically, the Top 30 schools have a slightly higher field goal percentage on weekdays, while schools ranked 31 to 100 have a marginal edge on weekends. This could imply that while academic ranking may play a role in overall athletic success, as indicated in the previous wins data, the direct impact on specific metrics such as field goal percentage is less pronounced. The relatively equal performance across different days of the week also hints at a consistent level of athletic competency regardless of academic ranking. These findings could suggest that factors other than academic ranking, such as individual skill level, team dynamics, and specific game strategies, may have a more direct influence on the precision of athletic performance in this aspect.

Then we wanted to explore the win count between weekdays and weekends categorized by Top 30 and 31-100 ranked schools. According to the bar chart above, schools that are ranked within the Top 30 academically have a higher number of wins on both weekdays and weekends compared to those ranked between 31 and 100. This could indicate that higher academic standards may be associated with more effective athletic programs, or that the environment and resources at top-ranked institutions contribute positively to their athletic success. However, it’s important to consider other variables that might influence this outcome, such as the recruitment of athletes, investment in sports facilities, and the quality of coaching staff.

Methods:

In general, our modeling stage was divided into three phases: a level one naive model, a level two model using the Two-Stage Modeling Approach, and some varying intercepts and slopes multilevel models. Since our dataset is very similar to the NFL passing data covered in the course, we decided to imitate the analysis techniques that were used in demos 6 and 7 in particular.

The first phase involves a naive logistic regression model, implemented using the glm() function in R. This model predicts the probability of a team winning based on the game day and the team’s top-30 status. Assumptions of this model include the independence of observations and binomial distribution of the response variable (team winner). However, given the nested structure of our data—multiple games for each team—these assumptions are likely violated, prompting the need for more sophisticated approaches. This initial model serves as a baseline to understand simple effects without adjustments for intra-team correlation, providing a preliminary insight into factors directly affecting game outcomes.

In the second phase, we will implement a Two-Stage Modeling Approach. First Stage: Separate logistic regression models are constructed for each team, which allows for capturing unique team behaviors and initial variability in winning probabilities across different game days. Second Stage: We analyze the distribution of intercepts and slopes derived from these individual models to understand broader patterns and potential outliers across teams. The main assumption here is that we believe teams are independent of each other, which is also the foundation for this approach. Generally, this approach was chosen to begin addressing within-team correlation without fully modeling these dependencies, providing a transition from the naive model to more complex structures. It offers a balance between complexity and interpretability, allowing us to isolate team-specific effects before considering deeper interactions in the data.

Phase three advances to varying intercepts and slopes of multilevel models, where we not only account for the non-independence of observations within teams but also allow the effect of predictors to vary by team. Since we decided to proceed with relatively more complicated models, we decided to look at a new variable ‘field_goals_attempted’ as we believe it may have a different relationship between the teams regarding the winning probability. Moreover, assumptions include normally distributed random effects for intercepts and slopes and a logistic link function for the binomial outcome of winning. These models are motivated by our sports question regarding how different factors influence game outcomes. The choice of multilevel modeling aligns with the complex hierarchical structure of our data, ensuring that the dependencies and variances within and between teams are appropriately modeled.

To effectively evaluate and compare the performance of the statistical models developed in the three phases, we used several criteria, which are the Akaike Information Criterion (AIC) for model selection, as well as the analysis of residuals and consideration of the significance of model coefficients. The AIC is particularly useful for comparing models with different numbers of parameters, providing a balance between model complexity and goodness of fit. Besides, for the multilevel models, we will look at the random effects’ variance components to evaluate if these models adequately capture the hierarchical structure of the data. Since we will be implementing various different types of models, we decided to use a relatively easier way to compare them. As of now, the reason we evaluated our models this way is due to the complex nature of sports analytics, where models must deal with hierarchical data structures and varying effects across teams. The chosen methods of comparison and evaluation are designed to test our models’ effectiveness in answering key questions about NCAA basketball performance.

To this end, our analysis includes methods for quantifying uncertainty for all model estimates. Standard errors, derived directly from our summary output, provide a measure of the variability in our estimates due to sampling error. From these standard errors, we will compute confidence intervals for each coefficient, which offer a range of plausible values for the true effect size and allow us to assess the precision of our estimates. Within the multilevel modeling framework, we will use the output from the lmer and glmer functions to obtain estimates of variance components for random effects. These estimates will be crucial in understanding the uncertainty at different levels of the data hierarchy—individual games within teams—and will be particularly relevant when interpreting the random slopes, which describe how the relationship between predictors and outcomes varies across teams. The nature of NCAA basketball data demands an approach that acknowledges and quantifies uncertainty at multiple levels. Our methods are selected to ensure that our uncertainty quantification is as rigorous and relevant as the estimates themselves, tailored to the context of NCAA basketball data.

Results:

The initial analysis involved a naive logistic regression model, which sought to predict the probability of a team winning based on the day of the game and whether the team was ranked in the top 30. This model served as a baseline to understand the influence of game scheduling and team strength without accounting for inter-team variations and other complexities. Our summary output sets Monday games as the baseline. The coefficient for the intercept was 1.23356 with a standard error of 0.66217, indicating a significant base effect on Mondays (p=0.0625). The coefficients for game days from Tuesday to Sunday showed varied effects, with Tuesday games showing a notable decrease in winning probability (coefficient = -1.34008, p=0.0601). Being in the top 30 increased the probability of winning (coefficient = 0.50343, p=0.0965), suggesting a competitive advantage for higher-ranked teams.

The second phase involved a two-stage modeling approach. Initially, separate logistic regression models were computed for each team to explore within-team variability. Subsequently, the intercepts and slopes from these models were aggregated and analyzed to capture overarching trends across teams. In the results, significant variability was observed in the winning probability across different game days, indicating that team performance dynamics vary significantly throughout the week. Additionally, when intercepts and slopes were aggregated, the analysis showed some teams having consistently higher or lower performance irrespective of the game day. The first histogram below shows the distribution of regression estimates for different game days. From the first plot, the estimates for weekend games show a spread with multiple peaks, suggesting varying effects on team performance. To better understand the relationship, we cutoff the outliers by setting a threshold of the absolute value of 20 as shown in plot 2. The estimates are distributed around a central value with slight right skewness, suggesting that for some teams, Mondays might have a slightly positive impact on performance. The estimates for weekends show a spread around zero, with most values slightly negative. This could suggest a slight disadvantage or variation in performance during weekend games.

The third phase involved the deployment of more complex multilevel models to more accurately model the dependency structures within the data, capturing both the fixed effects of known covariates and the random effects associated with individual team variances. We also categorized Fridays to Mondays as Weekends and Tuesdays to Thursdays as Weekdays. The initial model started with varying intercepts only for each team to capture the baseline differences in winning probabilities among teams, without including game day effects or other covariates. The AIC for this model is 276.9, which means it provides a basic understanding of how much team variation exists regarding the outcome. The second model is with additional covariates (fixed effects). Here, field_goals_attempted, top_30 status, and weekend were added as fixed effects to observe their influence alongside the varying intercepts for each team. The AIC is 278.5, which means the increase in AIC despite additional covariates indicated a nuanced understanding but also suggested complexity without significant improvement in model fit. Lastly, we have a varying slopes model, which allows the effect of field_goals_attempted to vary by team, recognizing that different teams might react differently to game dynamics. This model had the highest AIC of 281.3, which often suggests overfitting or an unnecessary complexity in the context of the available data.

The varying intercepts model confirmed significant differences in baseline winning probabilities across teams, emphasizing the unique characteristics of each team. The addition of game-specific covariates like field_goals_attempted showed variable impacts on the outcome, underscoring the diverse strategies and performance factors at play in different games and teams. Finally, the varying slopes model indicated complex interactions within the teams concerning field_goals_attempted, suggesting that teams respond uniquely to the dynamics of each game. To compare the models, the models’ complexity increased, but the AIC scores from multilevel modeling (ranging from 276.9 to 281.3) suggested an improved fit compared to the naive model (AIC = 275.11), indicating that accounting for random effects provides a more nuanced understanding of the factors influencing game outcomes. Furthermore, variables like top_30 and field_goals_attempted were found to have differential impacts on winning probability, depending on the team, highlighting the importance of contextual team factors.

In conclusion, our chosen Model 1, which integrates a fixed effect for field goals attempted along with random intercepts for each team, emerges as the most straightforward and statistically robust model based on its low AIC. This model elucidates that neither playing on weekends nor the volume of field goals attempted significantly alters the odds of winning when considering the calculated uncertainty. Specifically, the standard errors associated with these estimates indicate a broad range of potential true effects, reflecting significant uncertainty in these predictors. Furthermore, while being a top 30 team appears to enhance winning odds, the confidence intervals around this estimate suggest this effect is not statistically definitive. These findings highlight the nuanced and complex nature of sports competition outcomes and suggest further analytical avenues to explore additional variables that might significantly impact game results, potentially with narrower confidence intervals to reduce result uncertainty.

## # A tibble: 14 × 10
##    model   term  estimate std.error statistic p.value team_abbreviation Estimate
##    <chr>   <chr>    <dbl>     <dbl>     <dbl>   <dbl> <chr>                <dbl>
##  1 Phase … (Int…  1.23e+0     0.662    1.86    0.0625 <NA>                    NA
##  2 Phase … game… -1.34e+0     0.713   -1.88    0.0601 <NA>                    NA
##  3 Phase … game… -1.73e-1     0.756   -0.228   0.819  <NA>                    NA
##  4 Phase … game… -8.88e-1     0.852   -1.04    0.297  <NA>                    NA
##  5 Phase … game…  3.89e-2     0.848    0.0459  0.963  <NA>                    NA
##  6 Phase … game… -8.46e-1     0.683   -1.24    0.216  <NA>                    NA
##  7 Phase … game… -1.16e+0     1.00    -1.16    0.248  <NA>                    NA
##  8 Phase … top_…  5.03e-1     0.303    1.66    0.0965 <NA>                    NA
##  9 Phase … game… -1.73e+1  2797.      NA       0.995  BC                      NA
## 10 Phase … game… -1.73e+1  2797.      NA       0.995  BC                      NA
## 11 Phase … game… -1.83e+1  2797.      NA       0.995  BC                      NA
## 12 Phase … game…  7.10e-9  3956.      NA       1.00   BC                      NA
## 13 Phase … game… -1.77e+1  2797.      NA       0.995  BC                      NA
## 14 Phase … game… -1.76e+1  2797.      NA       0.995  BC                      NA
## # ℹ 2 more variables: `Std. Error` <dbl>, `Pr(>|z|)` <dbl>

Discussion:

In conclusion, our study has provided valuable insights into the relationship between academic rankings and athletic performance, showing that while game timing does not significantly impact outcomes, there is a tendency for teams from top-ranked universities to have better odds of winning, potentially due to better recruitment, funding, and motivation. Although Model 1 offers a balanced analysis, we acknowledge the limitations of not accounting for variables such as coaching and individual player performance. For future study, a robust approach could involve analyzing the effect of individual student-athlete academic progress on athletic performance, potentially using GPA as a proxy for academic commitment and its correlation with sports success. Additionally, it would be beneficial to explore the specific impact of coaching by evaluating the relationship between coaching staff credentials and team performance metrics.