Youtube has now become an integral platform of everyday life, revolving around education, music, entertainment, and many more categories of video sharing and viewing. Youtube has a feature where it displays the most trending videos based on views and popularity as well as the increasing rate of both factors.

Our dataset examines statistics of trending US Youtube videos from 2017/12/01 to 2018/05/31. The dataset includes a total of 16 variables. The quantitative variables include ‘trending_date’, ‘publish_time’, ‘views’, ‘likes’, ‘dislikes’, and ‘comment_count’. The qualitiative/categorical variables include ‘video_id’, ‘title’, ‘channel_title’, ‘category_id’(factor), ‘tags’, ‘thumbnail_link’, ‘comments_disabled’, ‘ratings_disabled’, ‘video_error_or_removed’, ‘description’.

The following are our main research questions of trending US Youtube videos: (1) What is the distribution of the views of videos like, and how do they differ depending on what time of the day the videos were published? (2) How did the number of views of videos changed as time passed? What is the trend? (3) What is the relationship between the number of likes and the number of comments of videos? (4) What is the relationship between the number of likes and the number of dislikes? (5) What are the three most frequently used words among these trending videos?

  1. First we looked at the distribution of views and transformed it to visualize the logged number of views as the raw numbers weren’t that clean to handle.
ggplot(youtube, aes(x = log(views), fill ="orange")) + geom_histogram(fill = "blue") + labs(title = "Distribution of Logged Number of Views of Videos", x = "Views (log)") + theme(plot.title = element_text(hjust = 0.5))

We can see that the distribution is pretty normally distributed and unimodal with visually no outliers. This shows that the numbers of youtube views are gathered around the mean.

Continuing on our focus of views, we expected a relationship between the time of day the video was uploaded and the amount of views the trending video had. To do this, we created a variable ‘time_published’ which marked the video as posted in the “Morning” (4AM - 12PM), “Afternoon” (12PM - 8PM), and “Night” (8PM - 4AM).

First we checked to see the marginal distribution of logged views conditional on the time of day the video was uploaded.

youtube$time_published <- format(as.POSIXct(youtube$publish_time, "%Y-%m-%d %H:%M:%S", tz = ""), format = "%H")
youtube <- mutate(youtube, time_of_day = ifelse(as.numeric(time_published) >= 20|as.numeric(time_published) < 4, "Night", ifelse(as.numeric(time_published) >=4 & as.numeric(time_published) < 12, "Morning", "Afternoon")))
ggplot(youtube, aes(x = log(views), fill = time_of_day)) +
geom_histogram() +
    labs(x = "Views (log)", y = "Count", fill = "Time of Day Uploaded",
         title = "Marginal Distribution of Video Views Given Time of Day Uploaded")

Similar to the distribution of logged views overall, all three time categories seem to be normally distributed with similar modes.

We also wanted to get an idea of how the proportions of videos by time category was distributed and created a pie chart.

ggplot(data = youtube, aes(x = factor(1), fill = time_of_day)) + 
  geom_bar(width =1, aes(fill = time_of_day)) + 
  coord_polar(theta = "y") + 
  labs(title = "Pie Chart of Videos by Time of Day", x = "", y = "Count of Videos", fill = "Time of Day Uploaded") +
  theme(axis.text.y = element_blank(), axis.ticks = element_blank())

It is evident that the majority of videos were uploaded in the Afternoon followed by Night and then Morning. This could serve as a hint to an aspiring youtuber that his videos are more likely to get more attention from viewers if he publishes them during the afternoon than at night or in the morning.

  1. Another thing that interested us was how the number of views, more precisely the average/mean number of views by day, changed as time passed. We transformed the trending_date variable to a factor variable to capture the time series of the moving average mean views from our earliest to most recent dates.
youtube <- mutate(youtube, trending_date = as.Date(trending_date, format = "%y.%d.%m"))
views_per_day <- youtube %>%
group_by(trending_date) %>%
summarize(mean_views = mean(views)) %>%
mutate(trending_date = as.Date.factor(trending_date))
library(ggseas)
ggplot(data = views_per_day, aes(trending_date, mean_views)) + geom_line(color = "blue") +
  stat_rollapplyr(width = 7, align = "right", size = 2, alpha=.6, color = "orange") + labs(
  title = "Mean Views of Trending Videos by Day Trending",
  x = "Month",
  y = "Mean Views"
) + theme(plot.title = element_text(hjust = 0.5))

We found that the mean number of views increased very slightly up until April where it started to increase at a much faster rate compared to the previous year. In other words, more videos were viewed on the second half of a year than the first half.

  1. Now, we explored the relationship between the logged number of likes and the logged comment counts.
ggplot(youtube, aes(x = log(likes), y = log(comment_count))) + geom_point(alpha = 0.5, color = "blue") + labs(title = "Distribution of Logged Likes and Logged Comments for Trending Videos", x = "Likes (log)", y = "Comment Count (log)")

We can clearly see that there is a very strong positive relationship between the number of likes and comments which make sense in context as people would be willing to share their thoughts on a video they enjoyed by providing constructive feedback and appreciation.

  1. In addition to exploring the relationship between the number of likes and comment counts, we wanted to explore the relationship with the impact the number of words in the title of the videos had.
youtube$words_in_title <- lengths(strsplit(youtube$title, "\\W+"))
ggplot(data = youtube, aes(x = log(views), y = log(comment_count))) + geom_point(aes(color = words_in_title), alpha = 1) + labs(
  title = "Distribution of log(likes) and log(dislikes) in Trending Youtube Videos",
  x = "Likes (log)",
  y = "Comment Count (log)",
  color = "Number of Words in Title"
)

Although there are way too many data points to make a precise conclusion from the graph above, we noticed a rather darker section of points, less words in title, near the top right with high comment counts and high number of likes and roughly visualized videos with around 5~10 words in the title represented this area.

  1. Finally, we explored how certain words included in the title of videos affected the videos popularity and which words accounted for popularity the most.
library(tm)
library(SnowballC)
library(wordcloud)
youtube.title = Corpus(VectorSource(youtube$title))
youtube.title = tm_map(youtube.title, content_transformer(tolower))
youtube.title = tm_map(youtube.title, removePunctuation)
youtube.title = tm_map(youtube.title, stripWhitespace)
dtm.youtube.title = DocumentTermMatrix(youtube.title, control = list(stopwords = TRUE, stemming = TRUE))
title.youtube.words = dtm.youtube.title$dimnames$Terms
title.youtube.freqs = colSums(as.matrix(dtm.youtube.title))
wordcloud(words = title.youtube.words, freq = title.youtube.freqs, max.words = 80, random.order = FALSE, color = "red")

We found that the word “official” followed by “video” and “trailer” were the most common words used in the most popular videos.

Exploring this dataset pertaining to youtube, we were able to find the following takeaways:

  1. The distribution of the logged number of views is normal and unimodal, with no outliers. The distribution of the logged number of views for videos uploaded in the morning, in the afternoon, and at night, are all normally distributed as well.
  2. Compared to the previous year, average number of views of trending videos tended to increase, with the views increasing at a faster rate starting in April.
  3. There is a strong positive relationship between the number of likes and the comment counts.
  4. There is a storng positive relationship between the number of likes and the number of dislikes,
  5. The three most frequently used words in these trending US youtube videos are: “official”, followed by “fideo” and “trailer.”

For future work, we are interested to examine which categories of videos were in particular more popular than others (based on the number of views), and further study the trends of such videos, such as if they increased over time, or decreased in recent years, and if so, starting when. We could then also study their relations to the number of likes, dislikes, and most frequently used, similar to what we have studied so far.