Our data consists of tweets sent regarding COVID-19 vaccines (Pfizer/BioNTech, Sinopharm, Sinovac, Moderna, Oxford/AstraZeneca, Covaxin, and Sputnik V), from 12 Dec 2020 until 22 April 2021. We have 69718 tweets, 14383 of which we could get a meaningful user location. Our data is taken from Kaggle.

There are 16 variables in the original dataset, but some variables, such as username, were irrelevant to our research purposes. We added new variables derived from the existing data, such as the country/continent of the user, the longitude and latitude of each city, and whether the tweet was sent on a workday or weekend.

We wanted to investigate the ways that tweets differ with respect to their text semantics, user location, trends over time and popularity of the tweets. We posed these four questions:

  • In what ways do weekend and workday tweets differ?
  • Do tweets from the same geographical location contain similar content?
  • Which countries do most vaccine tweets come from and how does the number of tweets across different continents change over time?
  • What variables appear to be related to popularity?

In what ways do weekend and workday tweets differ?

We wanted to learn more about the ways that weekend and workday tweets differ. Tweets posted over the weekend compared to those posted on a workday could differ based on the popularity of the user. One measure of popularity can be whether a user is verified on Twitter, so we considered faceting the tweets by whether they were posted on a workday or weekend and whether the user was verified or not. We cleaned the text by removing stopwords and retrieved word sentiments from the bing lexicon.

As a result, a word cloud of the tweets reveals that workday tweets differ from weekend tweets by the amount of press or breaking news that may be covered over the week and verified users tend to tweet about negative consequences of covid compared to non-verified users being positive and “grateful” for the “safe” and “effective” vaccine.

More nuanced sentiments such as joy or fear can be analyzed between workday and weekend tweets to add more granularity and numerical concreteness to our comparisons between workday and weekend tweets. We first extract each word from texts of weekend and workday, and join them by the “nrc” defined sentiments in R. Finally, we make a mosaic plot to see whether the count of each sentiment by day of week is independent or not. This mosaic plot is shaded by the residuals, and we can see that there are significantly more “anger”, “fear”, “negative”, and “sadness” words in the tweets made on the weekends(shaded by blue), while there are significantly less “anticipation”, “positive”, and “trust” words in the tweets made on the weekdays(shaded by red). One interesting note is that there seems to be an independent relationship between days of week and sentiments of tweets for “joy” and “surprise” sentiments. Overall, there is a notable difference in sentiment between workday and weekend tweets based on the Pearson residuals.

Do tweets from the same geographical location contain similar content?

Looking towards geographical similarities, we study whether similar vaccines are discussed in tweets from similar geographical locations. In the point-referenced map plot, the moderna hashtag is colored in green and most prominently used by twitter users at the coasts of the United States (US), covid19 hashtag in brown is scattered throughout the coastal regions of the US, UK and other parts of Europe, and some regions in India and the Middle East, Oxford-Astrazeneca vaccine hashtags in shades of blue are prevalent in the UK and Eastern Europe, and the covaxin, sinovac, and sputnik vaccine hash tags in shades of pink appear to cluster mostly in India, scattered throughout Europe (not so much in the UK), and in some parts of Africa. This could explain accessibility issues which prevent the twitter users from having access to the more common mRNA vaccine by moderna and talking about sputnik of sinovac at the moment. Moderna is widely distributed in the US, Oxford-Astrazeneca is being distributed in the UK, and other vaccines such as Sputnik and Sinovac may be provided in other parts of Europe and India.

Which countries do most vaccine tweets come from and how does the number of tweets across different continents change over time?

To learn more about countries tweeting the most about vaccines and how tweet frequency changes across different continents over time, we first extracted the country and continent information of the tweets, and decided to display them in a bar chart. In order to keep the country data concise, we only show the top 10 countries with the most number of tweets in the bar chart.

From the graphs, we can see that out of the tweets from which we could get the location information, most of the tweets come from North America, especially the United States. The U.S. is followed by India and the United Kingdom. When we look at the continental distribution, Asia and Europe are supported by the other countries, so their proportion is higher than India and the U.K’s proportion.

We can see a general increase in tweets starting from the beginning of February 2021, and a volatile increase, especially in North America. Another important point is that there is a huge spike in Asian tweets in the beginning of March, 2021.

Conclusions

To identify differences and groupings between COVID-19 vaccine tweets, we asked in what ways tweets are distinct based on weekend or workday posting, geographical location, and popularity. COVID-19 vaccine tweets offer locational, textual and temporal information to analyze relationships and differences among tweets. We found that retweets and favorites are related to popularity while frequency of tweets may not be significantly associated with popularity. Vaccine tweets increase after February where most of them are coming from the US and India and we observed similarities in tweets from similar geographical locations. Workday tweets also tend to be more positive while weekend tweets are more negative.

Although we explored geographical, workday/weekend sentiment, country/continent tweet volume, and popularity differences by source among tweets, future work can be done in identifying semantic similarities between different countries as pairs of countries may tweet in a similar fashion. And could analyzing mean popularity differences between tweets at a continent level rather than a country level offer significant correlation between popularity and tweet frequency? More exploration can be done in the extraction of user occupations from user descriptions to identify correlations between the most popular users and their occupations.