Introduction

Mashable is an online news website and entertainment platform that publishes numerous news articles on a variety of topics from phone reviews to celebrity drama. The dataset we are working with, provided by the UC Irvine Machine Learning Repository, collected information on 39,797 Mashable news articles from 2013 to 2015. While the dataset does not include article content, it does include metadata such as token counts, length averages, counts of web page elements, publishing date, data channel (category), polarities, and number of shares. Each row in the dataset is one article taken from Mashable and the columns are the aforementioned metadata variables.

News publishers are generally interested in the number of shares that an article receives, which is indicative of the article’s popularity. Mashable has been a prominent source of tech and entertainment (along with other categories) news and blogging since its inception in 2005. Publishers such as Mashable may be interested in the factors that make an article popular, which is what our questions aim to answer.

Question 1

Our first question of interest is how does vocabulary complexity and length influence the popularity of an article? In the first graph, we aimed to quantify vocabulary complexity based on the rate of unique words and the average word length, since a higher rate of unique words and a longer average word length could be indicative of more complex vocabulary. The number of shares an article gets is also a natural measure of its popularity. Thus, the following graph attempts to answer the related but more specific question: among articles with less than 10,000 shares, how does the number of shares an article gets depend on the average word length and the rate of unique words?

The above graph suggests that the number of shares that an article gets cannot be entirely predicted using just the average word length and the rate of unique words of that article. The above graph plots the rate of unique words on the x-axis and the average word length on the y-axis, with points colored by the number of shares that an article gets. Note that this graph only plots news articles with less than 10,000 shares to focus on non-outliers. In the above graph, points that are white received little to no shares, whereas points with a darker hue of red received much more shares. If there were a relationship between these two variables and the number of shares, we would be able to see distinct clusterings that are either mostly white or mostly red. However, we can see that there is one main cluster in the center of the graph with a seemingly random mix of both white-ish and red-ish points. There does not seem to be any distinguishable patterns or clusterings and therefore no relationship can be ascertained.

However, let’s formally test whether or not there actually is a relationship between these two variables and the number of shares. I will fit a multiple linear model where the rate of unique words and the average word length are the predictors and the log of the number of shares will be the response. We will take the log since the number of shares is highly right skewed. The model will have the following form: \(log(shares) = \beta_0 + \beta_1Average\_Token\_Length + \beta_2n\_unique\_tokens + \epsilon\)

Note that n_unique_tokens is actually encoded as a rate and not a number in the original dataset. Once we fit the model, we checked a plot of the residuals vs. fitted values as seen in the following plots:

Looking at the residuals vs. fitted graph, we can examine the assumptions of multiple linear regression. First off, we do see a little bit of an oddify, where some groups of fitted values are very far apart with no fitted values in between, but this should not interfere with our assumptions. Firstly, we do not see any noticeable non-linearities, and the residuals appear to center around zero. Furthermore, the vertical spread appears to be mostly constant across all the fitted values, which means we can reasonably assume homoscedasticity. Finally, there were no patterns above or below the mean of 0, which tells us that there is no autocorrelation between the residuals. We can also examine the normal Q-Q plot of the residuals, and the residuals appear to have much heavier tails. However, we are working with far more than 30 pieces of data, which means we can use the central limit theorem to claim nonetheless that our residuals are approximately normally distributed. Thus, our assumptions hold and we can reasonably trust the t-tests seen in the summary output:

Observations 39644
Dependent variable log(shares)
Type OLS linear regression
F(2,39641) 41.48
0.00
Adj. R² 0.00
Est. S.E. t val. p
(Intercept) 7.70 0.03 301.13 0.00
average_token_length -0.05 0.01 -9.05 0.00
n_unique_tokens 0.00 0.00 1.23 0.22
Standard errors: OLS

The t-tests seen above test the coefficients of the predictor terms in our regression model. The null hypothesis is that the coefficient (beta) is equal to zero, whereas the alternative hypothesis is that the coefficient is not equal to zero. We will assume a significance level of 0.05. In our test, we found that the average token length was significantly associated with the number of shares (t(39640) = -9.055, p < 2e-16). Furthermore, our model estimates that for every one character increase in the average word (token) length, we expect the log of the number of shares to decrease by -0.050079, on average while holding all else constant. To translate this number back to the number of shares instead of the log of the number of shares we must exponentiate. Thus, exp(-0.050079) = 0.9511543, which means that the number of shares will decrease by about 5% on average while holding all else constant.

On the other hand, we did not find that the rate of unique tokens in the article was significantly associated with the number of shares (t(39640) = 1.23, p = 0.219)

Thus, while our initial scatterplot did not show any obvious clusterings or patterns, our regression model and our t-tests on the coefficients of the model revealed that average token/word length does in fact provide us with some information about the number of shares, and in fact it tells us that an increase in word length may result in a decrease in the number of shares. Thus, we may suggest article writers keep the average word length to a minimum to ensure the popularity of their article.

On the other hand, one could also argue that the decrease in the number of shares isn’t too sizeable, since they lose only around 5% for every one increase in the average word length, and our adjusted R-squared achieves a value of 0.002038, which means there is a lot of variation in the response that can’t be explained by just the two variables we used.

Continuing on with our analysis, we wanted to examine if any other variables related to vocabulary complexity shared a relationship with the number of shares. Many readers tend to focus mainly on article titles, and determine whether or not to read the remainder of the article depending on their interest in the title. Thus, we will now focus on certain variables related to an article’s title. More specifically, let’s look at how the number of words in the title relates to the number of shares.

The above graph plots the number of words in the title on the x-axis and the number of shares on the y-axis. The above graph suggests that articles with around 10 to 11 words in the title tend to have the most number of shares. Thus, it may be optimal for news article writers to aim for 10 to 11 words when creating their titles. More words in a title may be indicative of a more complex topic, therefore better capturing the interest of readers. However, there may come a point where the title and topic are too complex to the point where most readers can’t understand it and therefore move on to different articles.

A limitation with this graph is that we never run a formal statistical analysis, which can be done in the future to confirm our suspicions.

Question 2

Another question of interest is how popular is Mashable overall as a news website over time? Do any metadata variables mirror this trend?

Sometimes, investors may want to invest in or even purchase news websites, which has in fact happened with Mashable already. In order to determine whether or not to actually invest, we need to gauge how successful and popular Mashable is. We can do this by graphing the number of shares that Mashable gets per day as a time series:

The above graph shows us the total number of shares across all articles published on a certain day, with the x-axis being the number of days before the dataset was created. In the trend graph, we can see that there is a sizable dip downwards in the number of shares, but this is expected as more recent articles haven’t had as much time to gain shares. In the first graph, we can see the raw data, which shows us that Mashable has seen several outlier articles that have been massively popular, as indicated by the very tall peaks. However, we can also see that Mashable hasn’t had an extremely successful article in the past 200 days. We can also see from the seasonal graph that the peaks and troughs seem to be decreasing in magnitude, and there appears to be a cutoff at the 200 day mark. Thus, it appears that something has changed that has caused the time series to stay closer to its 30 day moving average. Overall, it doesn’t appear that Mashable’s popularity is changing that much, and the average number of shares for articles published on the same day seems to hove around 200,000 shares without any sizable movement upwards or downwards. If someone is looking for a stable investment, then we may recommend them to invest in Mashable.

Furthermore, let’s examine if certain variables follow the same or opposing trends and patterns, which may give us more information about why Mashable’s overall shares are moving the way they are.

We will specifically look at the average word length variable, and plot the mean of the average word length across all articles published on a certain day.

As we can see in the above graph, it appears that at the 200 day mark, the mean of the average word length of articles published on a certain day is decreasing, from 4.6 all the way down to 4.2. This tells us that word lengths in Mashable articles are decreasing, perhaps signifying a decrease in vocabulary complexity. Furthermore, we can see that the seasonal and irregular decompositions reveal larger and larger peaks and troughs. This means that the average word length is straying further away from its 30 day average. This may indicate either a wider variety of topics or perhaps different writers with different writing styles, resulting in the larger differences in average word length. Once again, these peaks and troughs seem to get larger around the 200 day mark, which mirrors the decrease in the peaks and troughs of the seasonality of the total number of shares. Thus, while the trend for the mean of the average word length doesn’t exactly mirror the trend for the total shares per day, there are some features of both time series that match up, which we can perhaps use in the future to maybe predict the total shares across all Mashable articles. Of course, we are very limited in the conclusions we can draw, as we can’t really claim that the mean of the average word length causes a certain trend in the number of shares. Future work can also be done to examine how other variables’ trends may relate to the trend of the total number of shares.

Question 3

Our third question of interest is whether these factors remain constant or vary between types of content e.g. Tech vs. Entertainment data channel). We used a time series graph to graph to visualize and compare the temporal changes in article title length across different channels.

This plot displays a moving average of token counts (with a width of 30 days) in article titles faceted by channel, revealing both seasonality and trend patterns unique to each channel. It’s evident that while there is low seasonality across channels, indicating a consistent approach to title creation throughout the time period of interest, there is an observable overall increasing trend in title length. This suggests that the site’s editorial strategy might be evolving, with a tendency towards more descriptive or comprehensive titles over time. As sites have been trying to get better search engine optimization (SEO) they may be driven to increase distinct tokens to maximize the chance of that article being found in specific searches. Channels such as ‘world’, ‘entertainment’, ‘tech’, ‘other’, and ‘business’ exhibit more variability in title length while channels like ‘social media’ and ‘lifestyle’ show less variability, which could suggest a more consistent content type or a stable titling convention within these genres. This is a common practice as publishers aim to capture a wider audience via search engines by including more distinct tokens, which could increase the likelihood of articles appearing in search results for related queries.

The use of a 30-day moving average is particularly useful in this analysis because it smooths out short-term fluctuations and reveals longer-term trends and patterns. This approach minimizes biases that could occur due to variations within the week, such as those driven by weekday-weekend cycles, and allows for a clearer observation of the trends in editorial practices across different content channels. While we found an increase, every channel’s moving average does not trend higher than 15 words, which makes sense given that it seems like the sweet spot is 10-11 words in the title for shares. Thus, while many would argue that it is hard to give a formula for success for online popularity, it would seem that having more words in a title is desirable, but only up to a point.

Our analysis through box plots reveals that titles tend to exhibit more extreme polarity compared to the overall content. Furthermore, articles with titles that possess a slightly above-neutral polarity tend to garner more shares, suggesting that a moderate level of emotional engagement in titles may correlate with higher social sharing.

From the box plots, we also observe that the trend of title polarity being more extreme than the global sentiment is consistent across different channels. While the median polarity does not show significant variation between article content and article titles, there is a significantly higher number of titles outside the central interquartile range. Additionally, the global sentiment polarity of articles is generally slightly higher, indicating a tendency towards more positive content. The similarity of polarity between channels perhaps provides evidence for the consistency of Mashable’s editing. The consistency in polarity leaning positive makes sense in the context that slightly above neutral polarity articles tend to get the most shares.

Question 4

The fourth question that we wanted to find out from this dataset is: how does the sentiment associated with an article relate to its popularity and whether other factors like the type of channel affect the overall sentiment of the article? Sentiment polarity means the overall sentiment towards the subject. A high polarity at around 1 would be positive and a low polarity at around -1 would be considerably negative. The motivation behind this question is the idea that articles with high degrees of sentiment polarity would be more popular and have more shares than articles with more neutral sentiment polarity. The second part of the question is also motivated by the idea that different channels tend to report different things so a social media or life channel may tend to have a better sentiment than a channel like world that may report news that may have more polarizing sentiments. The first graph that we observe is the faceted dot plot of Shares vs the two different types of sentiment. One thing to note about this graph is that we filtered out articles that had more than 50000 since they seemed to remove a lot of the detail from the graph due to scaling. Overall, we found that the shares were particularly high for articles with a slightly positive or neutral polarity for both title and global sentiment. However, we found that global sentiment polarity seemed to be limited to a range of (-0.5, 0.75) while the title sentiment polarity spanned fully from (-1, 1). As a result, most of the shares for global sentiment polarity were near the center of 0.1. For title sentiment polarity, there were some spikes in the extremities such as -1, 1, 0.5, and -0.5 with most of the shares being in the slightly positive zone of around 0.1. Some more interesting parts of these 2 plots is that for global sentiment, there didn’t appear any articles that had a sentiment polarity outside of the range of -0.5 and 0.75. This is likely due to the fact that it’s incredibly difficult to write an entire article that is overly positive or overly negative without using some degree of neutral or opposite sentiment words. However, the title sentiments can be more extreme given that titles tend to summarize the main ideas that the articles are trying to convey or are purposefully made more extreme to drive more traffic to the website. In terms of what this graph means in the context of the question, we found that articles seem to have the most shares when they have a slightly positive sentiment polarity for both the global and the title sentiments. However, we did observe that the title sentiment polarity seemed to have more shares in the extremes compared to global sentiment polarity where the articles with the most shares seemed to be in the range of 0 and 0.5.

The second graph that we are looking at is a histogram of combined sentiment polarity combining both global and title sentiment. Overall from this graph, we find that a vast majority of articles that are posted seem to have a sentiment polarity of around 0 to 0.25 with the largest three buckets being in the range between 0 and 0.3. This suggests similar to the previous section that overall sentiment for most articles seems to be neutral to slightly positive.

The final graph that we can observe for this question is the box plot graph from the previous section. We were interested in this graph because in our motivation, we mentioned that there may have been some degree of relationship between channels and their sentiment polarity which may also in turn give some insight as to why different articles had more shares. However, from this boxplot, we found that across almost all of the channel types, the sentiment polarity had a median of about 0 and a range of around 0.1 to 0.3. Across all the channels, we found that the title polarity seemed to have a lot more outliers than global sentiment polarity which supports the idea from the first graph that titles seem to have more extreme language. In terms of the association between channels and sentiment polarity, the similarity between all the different types of channels in terms of the range of the sentiment polarity suggests that there isn’t a meaningful relationship between sentiment polarity and channel.

Overall between the three graphics that we’ve observed, we’ve found an overwhelming amount of the articles with a large amount of shares were from articles that had a slightly positive sentiment polarity. We also found that global sentiment polarity seemed to have a much narrower range compared to title sentiment polarity with title sentiment polarity appearing to have significantly more outliers than global sentiment polarity. Finally, we found that there didn’t appear to be any significant relationship between the sentiment polarity and the channel of the article from the graphics that we made.

In terms of an attempt at explaining the reasoning behind these results, the lack of range in the global sentiment of an article is likely explained due to the fact that it’s hard to keep up an extremely negative or positive sentiment throughout an article without having some degree of neutral language. Given this, most articles seem to regress to having a neutral to slightly positive tone. This is also because many articles attempt to present themselves as fact, especially in channels like world news, business news, and tech news. As a result, it is likely that they try to maintain a more neutral tone to seem impartial and make themselves more trustworthy leading to the results we’re seeing. In terms of shares and global sentiment polarity, it seems likely that the only relationship that can be seen in these graphs is that articles with the most shares are articles that are relatively neutral to positive global sentiment. This pattern seems to also hold for title sentiment but there seems to be more leeway since we suspect that regardless of what the title says, the actual content of the article determines whether the reader shares the article or not.

Conclusion

In conclusion, our project revealed insights into the elements which contribute to Mashable’s online popularity. Our statistical analysis showed that given our specific linear model, articles with shorter average word lengths tend to get more shares. Moreover, we observed a distinct preference for brevity in title length, with titles averaging 10-11 words (but showing an increasing trend) being the most popular. This suggests that while succinctness is appreciated, there is a growing inclination towards slightly more descriptive titles, which we believe may be caused by editing to increase SEO. We also found that global sentiment polarity has a narrower range compared to title sentiment polarity with title sentiment polarity appearing to have significantly more outliers. An explanation for this may be that titles are crafted with more emotion in order to capture reader attention. These trends were highly consistent across data channels, pointing to an optimization of content for broader appeal and shareability, regardless of the specific channel. But genres like social media and lifestyle may have a more consistent content type or a stable titling convention. Interestingly, global (article) and title polarity also showed remarkable consistency across data channels. We found that articles with the most shares tended to have a slightly positive sentiment polarity.

The data is limited to 2013-2015 articles and Mashable’s set of channels. Expanding the data set to include a broader range of dates and additional channels might reveal different trends or patterns, especially for a more longitudinal study. The data is also limited because it only comes from Mashable, so the inclusion of other publishers could validate if these patterns are specific to Mashable or generalizable across the industry. However, this data was not in UCI’s dataset, so further questions along those lines would have to come from further projects.

The graph titled “How is Shares related to vocabulary complexity? (shares < 10,000)” has a limitation in that it only graphs data points with shares less than 10,000, and therefore does not include the full dataset. We chose to do this because we thought the outliers took too much focus away from the main body of data. Furthermore, since we have so many data points it is somewhat difficult to see all of them, even though they were made to be transparent.

Moreover, in question two when we explored how the trend of the mean of the average word length of articles published on a certain day related to the trend of the overall shares of Mashable articles published on a certain day, the similarities that we found were hard to translate into actionable items that writers could do to improve the number of shares that their articles receive. More work can be done by analyzing other metadata variables, as we only analyzed one.