Youtube has now become an integral platform of everyday life, revolving around education, music, entertainment, and many more categories of video sharing and viewing. Youtube has a feature where it displays the most trending videos based on views and popularity as well as the increasing rate of both factors.
Our dataset examines statistics of trending US Youtube videos from 2017/12/01 to 2018/05/31. The dataset includes a total of 16 variables. The quantitative variables include ‘trending_date’, ‘publish_time’, ‘views’, ‘likes’, ‘dislikes’, and ‘comment_count’. The qualitiative/categorical variables include ‘video_id’, ‘title’, ‘channel_title’, ‘category_id’(factor), ‘tags’, ‘thumbnail_link’, ‘comments_disabled’, ‘ratings_disabled’, ‘video_error_or_removed’, ‘description’.
The following are our main research questions of trending US Youtube videos: (1) What is the distribution of the views of videos like, and how do they differ depending on what time of the day the videos were published? (2) How did the number of views of videos changed as time passed? What is the trend? (3) What is the relationship between the number of likes and the number of comments of videos? (4) What is the relationship between the number of likes and the number of dislikes? (5) What are the three most frequently used words among these trending videos?
ggplot(youtube, aes(x = log(views), fill ="orange")) + geom_histogram(fill = "blue") + labs(title = "Distribution of Logged Number of Views of Videos", x = "Views (log)") + theme(plot.title = element_text(hjust = 0.5))
We can see that the distribution is pretty normally distributed and unimodal with visually no outliers. This shows that the numbers of youtube views are gathered around the mean.
Continuing on our focus of views, we expected a relationship between the time of day the video was uploaded and the amount of views the trending video had. To do this, we created a variable ‘time_published’ which marked the video as posted in the “Morning” (4AM - 12PM), “Afternoon” (12PM - 8PM), and “Night” (8PM - 4AM).
First we checked to see the marginal distribution of logged views conditional on the time of day the video was uploaded.
youtube$time_published <- format(as.POSIXct(youtube$publish_time, "%Y-%m-%d %H:%M:%S", tz = ""), format = "%H")
youtube <- mutate(youtube, time_of_day = ifelse(as.numeric(time_published) >= 20|as.numeric(time_published) < 4, "Night", ifelse(as.numeric(time_published) >=4 & as.numeric(time_published) < 12, "Morning", "Afternoon")))
ggplot(youtube, aes(x = log(views), fill = time_of_day)) +
geom_histogram() +
labs(x = "Views (log)", y = "Count", fill = "Time of Day Uploaded",
title = "Marginal Distribution of Video Views Given Time of Day Uploaded")
Similar to the distribution of logged views overall, all three time categories seem to be normally distributed with similar modes.
We also wanted to get an idea of how the proportions of videos by time category was distributed and created a pie chart.
ggplot(data = youtube, aes(x = factor(1), fill = time_of_day)) +
geom_bar(width =1, aes(fill = time_of_day)) +
coord_polar(theta = "y") +
labs(title = "Pie Chart of Videos by Time of Day", x = "", y = "Count of Videos", fill = "Time of Day Uploaded") +
theme(axis.text.y = element_blank(), axis.ticks = element_blank())
It is evident that the majority of videos were uploaded in the Afternoon followed by Night and then Morning. This could serve as a hint to an aspiring youtuber that his videos are more likely to get more attention from viewers if he publishes them during the afternoon than at night or in the morning.
youtube <- mutate(youtube, trending_date = as.Date(trending_date, format = "%y.%d.%m"))
views_per_day <- youtube %>%
group_by(trending_date) %>%
summarize(mean_views = mean(views)) %>%
mutate(trending_date = as.Date.factor(trending_date))
library(ggseas)
ggplot(data = views_per_day, aes(trending_date, mean_views)) + geom_line(color = "blue") +
stat_rollapplyr(width = 7, align = "right", size = 2, alpha=.6, color = "orange") + labs(
title = "Mean Views of Trending Videos by Day Trending",
x = "Month",
y = "Mean Views"
) + theme(plot.title = element_text(hjust = 0.5))
We found that the mean number of views increased very slightly up until April where it started to increase at a much faster rate compared to the previous year. In other words, more videos were viewed on the second half of a year than the first half.
ggplot(youtube, aes(x = log(likes), y = log(comment_count))) + geom_point(alpha = 0.5, color = "blue") + labs(title = "Distribution of Logged Likes and Logged Comments for Trending Videos", x = "Likes (log)", y = "Comment Count (log)")
We can clearly see that there is a very strong positive relationship between the number of likes and comments which make sense in context as people would be willing to share their thoughts on a video they enjoyed by providing constructive feedback and appreciation.
youtube$words_in_title <- lengths(strsplit(youtube$title, "\\W+"))
ggplot(data = youtube, aes(x = log(views), y = log(comment_count))) + geom_point(aes(color = words_in_title), alpha = 1) + labs(
title = "Distribution of log(likes) and log(dislikes) in Trending Youtube Videos",
x = "Likes (log)",
y = "Comment Count (log)",
color = "Number of Words in Title"
)
Although there are way too many data points to make a precise conclusion from the graph above, we noticed a rather darker section of points, less words in title, near the top right with high comment counts and high number of likes and roughly visualized videos with around 5~10 words in the title represented this area.
library(tm)
library(SnowballC)
library(wordcloud)
youtube.title = Corpus(VectorSource(youtube$title))
youtube.title = tm_map(youtube.title, content_transformer(tolower))
youtube.title = tm_map(youtube.title, removePunctuation)
youtube.title = tm_map(youtube.title, stripWhitespace)
dtm.youtube.title = DocumentTermMatrix(youtube.title, control = list(stopwords = TRUE, stemming = TRUE))
title.youtube.words = dtm.youtube.title$dimnames$Terms
title.youtube.freqs = colSums(as.matrix(dtm.youtube.title))
wordcloud(words = title.youtube.words, freq = title.youtube.freqs, max.words = 80, random.order = FALSE, color = "red")
We found that the word “official” followed by “video” and “trailer” were the most common words used in the most popular videos.
Exploring this dataset pertaining to youtube, we were able to find the following takeaways:
For future work, we are interested to examine which categories of videos were in particular more popular than others (based on the number of views), and further study the trends of such videos, such as if they increased over time, or decreased in recent years, and if so, starting when. We could then also study their relations to the number of likes, dislikes, and most frequently used, similar to what we have studied so far.