Dataset Description

The data is a daily record of the top trending Youtube videos, as determined by Youtube. Note that according to the official Youtube Help page, Youtube determines whether a video is trending using a combination of data, including views, location of viewers, and age of video.

The data was taken from a Kaggle dataset called “Trending YouTube Video Statistics.” The complete Kaggle dataset consisted of different .csv for several countries, but this data combined only the datasets from the United States and Canada.

In total, there are 28,512 rows and 16 columns. Each row consists of a trending video and the columns have different quantitative or qualitative data about the respective video. The qualitative variables are video_id, title, channel_title, category_id, tags, thumbnail_link, comments_disabled, ratings_disabled, and video_error_or_removed. The quantitative variables are trending_date, publish_time, views, likes, dislikes, and comment_count.

Research Questions

We will focus on answering the following three questions and will elaborate on the motivation / approach further along in this report.

How do the publication times of trending Youtube videos in the major video categories vary over the course of a year? How about over the course of a day?
What is the consensus on YouTube content about Obama, Trump, and Biden, and what does this content generally contain?
How do viewers react to trending music videos on Youtube in terms of their likes and comments? How do viewers react differently to VEVO and non-VEVO music videos on Youtube?

Research Question 1

To begin, we wanted to take a closer look at general patterns in Youtube videos. In particular, we were interested in exploring both yearly and daily uploading trends. The research question of interest, then, is: How do the publication times of trending Youtube videos in the major video categories vary over the course of a year? How about over the course of a day?

To approach this research question, we have decided to zoom in on the top five most frequent Youtube video categories in our data set which were Sports, People & Blogs, Comedy, Entertainment, and News & Politics

New variables were created to isolate the month a video was published and the time of day a video was published from the publish_time attribute which was of the form yyyy-mm-dd hh:mm:ss.

Here, we have displayed the distribution of the month that trending videos were published and coded each density plot’s color by the category the video belongs to. In order to avoid overcrowding the graphic, we have chosen to isolate only the top five most popular video categories and only examine these. As can be seen in the graph, there is a strange dip in density for all categories starting in June to about October. Outside of this dip, it appears the publish month of trending news and politics videos peak in January, for trending sports and comedy videos it peaks around February, for trending people & blogs videos it peaks around March, and for trending entertainment videos the publish month peaks around May.

To examine the dip in the graph further, we have zoomed in to the months July to October to better understand the trends in these lower density months. As seen in the above graph, the densities are not exactly zero for these months. Comedy videos and people & blogs videos seem to peak around September. This could potentially be due to a peak in video content corresponding to the beginning of the academic school year. It is unclear why there is a dip in the densities for these months. We hypothesize that the data were not collected evenly and perhaps there were gaps in the data collection process for these summer months. Due to this strange pattern in the data, it is difficult to concretely conclude how the trends in the Youtube video trending months correspond with annual trends and attribute peaks or valleys to annual events.

For clarification, all times in the data set have been recorded in UTC (Coordinated Universal Time Zone) To put the times in context (given that the data we are working with are all videos from the US and Canada), EST is UTC−05:00 and EDT is UTC−04:00. Across all five graphs we can observe a dip in publication time frequency around 09:00 UTC which is understandable because this corresponds to the early morning times in the US/Canada when most are probably asleep.

For the sports category, though there are quite a large portion of videos published around 16:00-21:00 UTC, there appears to be a more prominent peak peak around 03:00-04:00 UTC. This suggests that most trending sports videos are published in the afternoon/evening hours in US/Canada although a larger portion are published in the evening hours. However, for the rest of the categories, the trend appears to differ.

For most of the other categories, (specifically people & blogs, comedy, and entertainment) there appears to be a peak around 16:00 UTC which corresponds to 11:00 EST, 12:00 EDT. This could be partially attributed to the fact that many popular content creators/Youtubers post at certain key times during the day such as 12:00 EDT which is also 09:00 PDT, another popular publish time for many California (specifically Los Angeles) based content creators/Youtubers. Additionally, there seems to be another block of frequent publish times ranging from 01:00 to 05:00 UTC.

For news & politics videos, the peak around 16:00 UTC is not as extreme. In fact, there appear to be three modes at 02:00, 16:00 and 22:00 UTC. This could be attributed to the fact that news videos get published around the clock with morning, afternoon, and evening news programs.

Thus, we can conclude that the majority of trending Youtube videos seem to be published around noon time although we are unable to pinpoint a specific time because times are specific to the publishers’ individual time zones.

Research Question 2

After looking at Youtube categories on a broader level, we hoped to take a deeper dive into videos in the political space.

While the events of the 2020 election are distant, it is still worth viewing general public consensus on specific presidents during their time in office. One area of influence that is particularly important to view is social media. For many netizens, social media sites are the primary way that they provide feedback on the success/failures of those in office. Looking to specific sites, one popular one is YouTube. This site features a like/dislike system, a key component used in this section of the study of YouTube data. The research question of interest, then, is: What is the consensus on YouTube content about Obama, Trump, and Biden, and what does this content generally contain? This research question suggests we should examine video titles as well as individual video dislike/like ratios.

The original data set was modified to only include entries with videos related to Obama, Trump, Biden. To ensure independence, a new variable was added to the data set featuring each video’s dislike/like ratio. A log transformation was then performed on dislike/like ratio to ensure the normality assumption was not violated. The resulting DF had a few N/A entries removed as well (such as log(0)).

Here, we have displayed a violin plot of the log(Dislike/Like ratio) for videos with Biden, Obama, and Trump in the title. As seen, the medians for each plot is between -2 and -3, and the interquartile range for all of the plots are all below zero. This indicates that most videos that include each respective president’s name have a dislike/like ratio below 1 (i.e. more likes than dislikes). However, we see that both Obama and Trump have a wider distribution of ratios than Biden. We also note that Youtube videos with Obama and Trump in the title have a few videos with a dislike/like ratio above 1 (i.e. more dislikes than likes). The wider distribution and dislike/like ratios for Obama and Trump indicate that videos with these president names in their title tend to have more polarizing reception from viewers.

To further explore the differences in these ratios between presidents, we look to perform a one-way ANOVA test to find a difference in mean dislike/like ratio. As normality assumptions are violated, we perform a log transformation on the data and clean invalid values. Despite issues of unequal variance across groups, our created ANOVA model features normally distributed residuals (as can be seen in the created histogram and Normal QQ plot), so we continue with interpretation of the analysis. The performed test and created visualizations can be seen below.

##               Df Sum Sq Mean Sq F value Pr(>F)   
## name           2   13.2   6.579   4.826 0.0081 **
## Residuals   2226 3034.4   1.363                  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

As seen from the ANOVA output, the p-value is very small. This indicates that there is a significant difference in the mean log(Dislike/Like ratio) for videos depending on whether the title contains Biden, Trump, or Obama.

In addition to understanding the feedback on president-related YouTube content, we also hoped to understand the tone/type of this content as well. We then look to create a visualization based on the landscape of content in video titles related to the three observed presidents. To filter the data, an algorithm was used to only retain entries that feature the president’s first/last name in the video title, tags, or description. The following word cloud was created to visualize the results.

As can be seen from the results, the most frequent words (aside from “president” and first name) involve discussion clips of president-related topics on talk shows. Words like “daily,” “show,” “live,” “CNN,” etc. all are related to news/discussion TV channels. While this is not surprising, it is interesting to see how users on YouTube most commonly interact with president-related content. Some words to note are “lies,” “truth,” “racist,” and “fake,” which also hint at the tone of some content as well.

Research Question 3

For our last research question, we want to specifically look at videos on music category (category id = 10) and see what kinds of music are included in this set and what viewer trend we can extract from it. Our variables of interest are views, comment_count and likes. We also want to create a new categorical variable called “VEVO,” categorizing our videos to VEVO and non-VEVO videos (VEVO videos = “YES”, non-VEVO videos = “NO”). VEVO is an American multinational music video network that provides music videos on Youtube. Since VEVO is well known for having outstanding viewerships and for creating high quality music videos with its established partnerships with major record companies, independent artists and other premium content owners, we want to distinguish how viewers respond to these videos differently from rest of the non-VEVO videos.

The research question of interest, then, is: How do viewers react to trending music videos on Youtube in terms of their likes and comments? How do viewers react differently to VEVO and non-VEVO music videos on Youtube?

Before diving in further, we created a bar chart to explore different types of music included in the dataset. To do so, we grouped the dataset by channel and sorted out top 20 channels with the most number of trending videos.

PTX official channel has the most number of trending videos, followed by Charlie Puth, ibighit, SMTOWN, Billboard channel and so on. Most of the channels are for independent artists/musicians in the U.S, while there are also some channels for music/radio agencies in the U.S. Interestingly, 3 out of top 20 channels are Kpop channels (ibighit, SMTOWN, jyp entertainment). Among the top 20 channels, 8 of them are VEVO channels.

To approach our original research question, we have created scatter plots for comment vs. views and likes vs. views to examine the relationship between each of the two variables. We also labeled these videos by VEVO and non-VEVO videos through different colors (VEVO videos = mint, non-VEVO videos = red). There is a positive linear relationship between the number of likes and views, as well as between the number of comments and views. For number of likes vs. views, we see that slope is steeper for non VEVO videos than for VEVO videos, which means that viewers tend to click on more likes for non VEVO videos than for VEVO videos. Similarly, for number of comments vs. views, we see that slope is steeper for non VEVO videos than VEVO videos, which means that viewers tend to leave more comments for non VEVO videos than for VEVO videos. We therefore can conclude that viewers tend to respond better to non VEVO videos than to VEVO videos in terms of likes and comments.

## 
## Call:
## lm(formula = likes ~ views, data = music_video_vevo)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1387810   -43907   -28281    15252  1482248 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 4.621e+04  7.433e+03   6.217 9.93e-10 ***
## views       1.986e-02  4.289e-04  46.303  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 164800 on 558 degrees of freedom
## Multiple R-squared:  0.7935, Adjusted R-squared:  0.7931 
## F-statistic:  2144 on 1 and 558 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = comment_count ~ views, data = music_video_vevo)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -133301   -2289   -1450    1167  172386 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 2.147e+03  7.512e+02   2.858  0.00442 ** 
## views       1.522e-03  4.335e-05  35.103  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 16660 on 558 degrees of freedom
## Multiple R-squared:  0.6883, Adjusted R-squared:  0.6877 
## F-statistic:  1232 on 1 and 558 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = likes ~ views, data = music_video_non_vevo)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -1206730   -23187   -14771      211  2053174 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 1.533e+04  4.367e+03   3.511 0.000459 ***
## views       2.882e-02  5.134e-04  56.138  < 2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 160400 on 1519 degrees of freedom
## Multiple R-squared:  0.6748, Adjusted R-squared:  0.6745 
## F-statistic:  3151 on 1 and 1519 DF,  p-value: < 2.2e-16

## 
## Call:
## lm(formula = comment_count ~ views, data = music_video_non_vevo)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -292628   -1785     556    1417  786481 
## 
## Coefficients:
##               Estimate Std. Error t value Pr(>|t|)    
## (Intercept) -8.986e+02  8.996e+02  -0.999    0.318    
## views        3.602e-03  1.058e-04  34.055   <2e-16 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 33040 on 1519 degrees of freedom
## Multiple R-squared:  0.4329, Adjusted R-squared:  0.4326 
## F-statistic:  1160 on 1 and 1519 DF,  p-value: < 2.2e-16

Tp confirm our previous analysis, we constructed 4 regression models: number of likes vs. views and number of comments vs. views for each of the VEVO and non VEVO datsaset.

For VEVO videos: According to our first two regression models, 1 view generates about 0.02285 likes from the viewers, while 1 view generates about 0.002029 comments from the viewers. Viewers are more likely to click on likes after watching than to leave a comment.

For non VEVO videos: According to our third and fourth regression models, 1 view generates about 0.03561 likes from the viewers, while 1 view generates about 0.005027 comments from the viewers. Viewers are more likely to click on likes after watching than to leave a comment.

From our regression analysis, we confirm our previous conclusion that music video viewers tend to respond better to non VEVO videos than to VEVO videos in terms of likes and comments. Moreover, across all music videos, viewers are more likely to click on likes than to leave a comment.

From our regression analysis, we also see that the r squared value is relatively higher for the VEVO videos than for non VEVO videos. This reflects the fact that the number of views accounts for the viewer responses relatively well for the VEVO videos and that there may be other factors being involved in accounting for the viewer responses for non VEVO videos.

Conclusion

In this project, we sought to understand uploading trends across Youtube categories and dive deeper into Youtube videos in the political and music space.

In Research Question #1, we wanted to explore uploading trends annually and daily. To look closely into annual trends, we used a density plot to look at the number of trending videos in the top 5 Youtube categories during each part of the year. Notably, there was a strange downward dip from July-October, and we speculated that this could be attributed to the beginning of the academic year or a data collection issue. Then, we explored how publishing times differ across the top 5 Youtube categories using a bar plot. We noticed different patterns across categories, but noted uploads tended to peak around noon; we also warned that readers should account for publishers’ individual time zones.

In Research Question #2, we wanted to explore the reception of Youtube videos in the political space and words associated with the three most recent U.S. presidents. First, we created a violin plot displaying the log(Dislike/Like ratio) for videos with Biden, Obama, and Trump in the title. Most videos had more likes than dislikes. Additionally, Obama and Trump had much wider distributions, and had a few videos with more dislikes than likes. We note that this implies that videos containing Obama or Trump in the title were more controversial and had a wider range of negative/positive reception from viewers. According to the ANOVA test we then performed, there was a significant difference in mean log(Dislike/Like ratio) across each President. We then created a word cloud displaying the most commonly found words in titles of videos that had the names of any of the most recent presidents. The most common words were those relating to news/discussion TV channels, news validity, and other hot topics.

In Research Question #3, we wanted to explore reception of Youtube videos in the music category and also wanted to see if viewers’ opinions depended on whether the channel was a VEVO channel. We first looked at the top 20 channels in the music category using a bar chart, and we saw that 8/20 were VEVO channels. Then, we looked at the relation between likes vs. views and comments vs. views using a scatterplot; we also distinguished between VEVO and non-VEVO channels in the scatterplot. All of the relations in the scatterplot had a positive, linear trend, and we concluded that viewers tend to better respond to non VEVO videos than to VEVO videos in terms of likes and comments.

Future Work

There is great opportunity to dive deeper into our research questions in the future.

In Research Question #1, one issue that we saw in the publication time for publishers was that all times were converted into UTC. Therefore, we cannot be completely sure that the trends we observed from the bar plots are reflective of true publication time trends. Therefore, we alter our original research question to ask, Accounting for individual time zones, how do publication times vary over the course of a day for major video categories?. This question is left as future work as we require data that isn’t converted into UTC.

In Research Question #2, we explored feedback on video-based content within Youtube. However, it is also worthwhile to view president feedback/reviews on other popular discussion-based websites such as Twitter, Instagram, etc. How are Biden, Trump, and Obama perceived differently on alternate social media sites? This question is left as future work as it requires more data from other social media platforms.

In Research Question #3, we explored differences in feedback on music videos depending on whether they were a VEVO channel. We noticed that among the top 20 channels, most were independent artists/musicians, music/radio agencies, or Kpop channels. It would be interesting to gauge feedback depending on these categories. How does feedback depend on the category of the channel a music video is posted on?. This question is left as future work as it requires categorizing all of these channels within the music category, which is not information in our dataset and would potentially require manual categorization.

36315: Analyzing USA/Canada Youtube Trending Videos