Dataset

Our team found this Starbucks dataset on Kaggle, in which the data was last updated in 2015. This dataset describes Starbuck drinks in terms of their beverage categories (coffee, frappucino, etc) and beverage prep (Size and type of Milk), as well as their nutrional values. In the original dataset, there are 18 variables (15 quantitative and 3 categorical) and 242 rows. After removing null and invalid values, 218 rows are left. Some of the variables are renamed to clean the strings and be more simplified.

Quantitative Variables

  • Calories: The number of calories in a drink (recorded in grams). The minimum value of calories in this dataset is 3g and the maximum is 510g
  • Total.Fat: The amount of total fat in a drink (recorded in grams). The minimum value of total fat in this dataset is 0g and the maximum is 15g
  • Trans.Fat: The amount of trans fat in a drink (recorded in grams). The minimum value of trans fat in this dataset is 0g and the maximum is 9g
  • Saturated.Fat: The amount of saturated fat in a drink (recorded in grams). The minimum value of saturated fat in this dataset is 0g and the maximum is 3g
  • Sodium: The amount of sodium in a drink (recorded in milligrams). The minimum value of sodium in this dataset is 0mg and the maximum is 40mg
  • Total.Carbohydrates: The total amount of carbohydrates in a drink (recorded in grams). The minimum value of total carbohydrates in this dataset is 0g and the maximum is 340g
  • Cholesterol: The amount of cholesterol in a drink (recorded in milligrams). The minimum value of cholesterol in this dataset is 0mg and the maximum is 90mg
  • Dietary.Fibre: The amount of dietary fibre in a drink (recorded in grams). The minimum value of dietary fibre in this dataset is 0g and the maximum is 8g
  • Sugars: The amount of sugar in a drink (recorded in grams). The minimum value of sugar in this dataset is 0g and the maximum is 84g
  • Protein: The amount of protein in a drink (recorded in grams). The minimum value of protein in this dataset is 0g and the maximum is 20g
  • Vitamin.A: Percentage of your daily value of Vitamin A in a drink. The minimum value of Vitamin A in this dataset is 0% and the maximum is 0.5%
  • Vitamin.C: Percentage of your daily value of Vitamin C in a drink. The minimum value of Vitamin C in this dataset is 0% and the maximum is 1.0%
  • Calcium: Percentage of your daily value of Calcium in a drink. The minimum value of Calcium in this dataset is 0% and the maximum is 0.6%
  • Iron: Percentage of your daily value of Iron in a drink. The minimum value of Iron in this dataset is 0% and the maximum is 0.5%
  • Caffeine: The amount of caffeine in a drink (recorded in grams). The minimum value of caffeine in this dataset is 0g and the maximum is 410g

Categorical Variables

  • Beverage: The names of a drink. In this dataset, there is a total of 30 unique beverage names
  • Beverage_category: The category of a drink. There are 9 groups – some example data for this include the following: Classic Espresso Drinks, Coffee, Frappuccino® Blended Coffee, and Shaken Iced Beverages.
  • Beverage_prep: The preparation of beverages, which includes the type of Milk and the Size. In this column, the data formatting varies in the sense that some of them have only the Size, while others may have either the Milk or both the Size and the Milk. Currently, there is a total of 13 unique values in the column.

New Categorical Variables

It made the most sense to clean some of the values up. As mentioned earlier, the original Beverage_prep included a combination of information in each value and the values are not consistent. To combat this issue, the columns Size and Milk are created by extracting information from Beverage_prep. Likewise, the information about total fat might not be easily understandable by the audience, which resulted in the creation of Fat_levels. Below are the newly created variables:

  • Fat_levels: After exploring the data, it made sense to create various fat levels from very low to very high, in which the data increments by 4g of total fat. Hence, there is a total of 4 levels
  • Size: The original dataset contains four main levels of beverages: Grande, Short, Tall, and Venti. However, there are some values that are unclassified (no labeling of size). The new size column also contains 5 levels.
  • Milk: The original dataset contains four main levels of beverages: 2%, Nonfat, Soymilk, and Whole Milk However, there are some values that are unclassified (no labeling of milk). The new size column also contains 5 levels.
  • Bev_class: With various Beverage_category groups, Bev_class simplifies the original group into 4 levels: Classic Espresso, Frappuccino, Signature Espresso, and Miscellaneous.

Goals

There are a few questions we wanted to explore. First, we would like to explore if different beverage classes have various caffeine content and we would like to see which one(s) has the highest caffeine content. We expect that the espresso categories would have the highest caffeine content, and we do not expect that milk type will change the caffeine content. Second, we would like to identify factors that makes a drink healthy, in which we defined healthy as high in calories. We suspect that milk, size, and the type of drink will impact the calories. Lastly, we would like to determine the relationship between various nutritional values such as vitamins, calcium, iron, and fiber across drinks types.

Analysis

EDA: Distributions of Beverage Categories and Beverage Prep

Word Cloud to analyze beverage names

From the word cloud above, “whip”, “cream”, “latt[e]”, and “without” words are very common in beverage names. After investigating by looking through other names, we noticed that it was common for drinks to specify “without whipped cream” as well as a type of coffee latte drink. Other common words include “tea”, “mocha”, “caramel”,“vanilla” and many more, which represent flavors or syrups in the drinks, which is also unsurprising.

Research Question 1

What’s the beverage class that has the highest caffeine content?

This is showing the conditional distribution of caffeine given milk and beverage class. We can see that the center of caffeine remains roughly the same throughout each type of milk, suggesting that milk does not necessarily change the caffeine content, but after we tested the hypothesis using one-way anova, we realized that the type of milk affects the calories (p-value = 0.000492). In terms of comparing different beverage classes between one another, it is unclear, although classic espresso caffeine seems to be centered below 200s, Frappuccino seems to be slightly above 200s, signature espresso centered around 200s, and miscellaneous below 200s.

##              Df  Sum Sq Mean Sq F value   Pr(>F)    
## Milk          3  180409   60136   6.159 0.000492 ***
## Residuals   214 2089329    9763                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The boxplot answers our hypothesis from the previous graph that the beverage class does not have a significant effect on the Calories. The median for the calories are as follow from highest to lowest: classic espresso, frappuccino, signature espresso, and miscellaneous. Frappucino has the smallest variance, while miscellaneous has the largest variance, which makes sense since the drinks included in miscellaneous includes both non-caffeinated and caffeinated drinks.

Research Question 2

What factors contribute to making drinks healthier vs. unhealthier? Beverage category? Size? Milk?

Does the size affect calories for various types of beverage classes differently?

We see a linear trend for both sugar content and calories across beverage categories, where the amount of sugar has a direct relationship with the number of calories. In relation to our research questions, this demonstrates the importance of sugar as a factor contributing to the nutrition value/calorie levels of drinks, regardless of their type.

This relationship between sugar and total fat is similar to what we expected, in which there is a positive and significant correlation between the amount of sugar and fat in a drink (p-value = 5.94e-09). However, it is surprising that the various sizes have a slope to another and sugar content to one another since we would expect venti (the largest size) to have more sugar and hence fat than short (the smallest size). That being said, the sizes short, tall, grande, and venti seem to have similar coefficient for sugar, whereas unclassified is much different and significant as indicated by the p-value (p = 2.13e-15).

## 
## Call:
## lm(formula = Total.Fat ~ Sugars + Size, data = sbux)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -4.2697 -1.3804 -0.1604  1.0275  8.3706 
## 
## Coefficients:
##                   Estimate Std. Error t value Pr(>|t|)    
## (Intercept)      -0.189638   0.590879  -0.321    0.749    
## Sugars            0.047760   0.007873   6.066 5.94e-09 ***
## SizeGrande       -0.594429   0.726230  -0.819    0.414    
## SizeTall         -0.377215   0.716703  -0.526    0.599    
## SizeVenti        -0.892986   0.762304  -1.171    0.243    
## SizeUnclassified  3.284833   0.626764   5.241 3.85e-07 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 2.161 on 212 degrees of freedom
## Multiple R-squared:  0.4954, Adjusted R-squared:  0.4835 
## F-statistic: 41.63 on 5 and 212 DF,  p-value: < 2.2e-16

This plot allows for us to see that drink size does impact the calories for different drink types. For example, we consistently see that the smaller sized drinks (short and tall) have the lowest caloric distribution in comparison to the other sizes. As drink size increases, caloric range increases for all drink types as well, even with combined sizes.

## Warning: The dot-dot notation (`..level..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(level)` instead.

There seem to be 2 modes in the data, with many of the drinks having carbohydrate content between around 50 and 125 grams respectively and calories around 200 on average. This examines the levels of these variables seen across Starbucks drinks.

From the heat map, there is a trend where as the fat level increases from very low to very high, the cholesterol also increases although not necessarily consistent among all beverage classes. In Signature Espresso, the cholesterol with fat level very low seems to have one of the highest cholesterol content, and it somewhat decreases as fat level increases. After exploring this further using one.way anova and box plot, we were able to determine that the there is a difference between the population mean cholesterol of different beverage classes since the p-value is less than 2e-16, specifically that Classic Espresso has a lower mean Cholesterol.

##              Df Sum Sq Mean Sq F value Pr(>F)    
## Bev_class     3  35580   11860   43.02 <2e-16 ***
## Residuals   214  59002     276                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Standardized

Complete Dentogram, K = 6

## 
##   1   2   3   4   5   6 
## 133   5  13  24  37   6

Complete Dentogram, K = 6 Leaf Colored by Bev_cat

Beverage category is able to explain the variation in of the drinks of certain drinks but to a limited degree. The blue cluster consists mostly Frappuccino, which has green label.

PCA & Biplot

Standard and Centralized

## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     2.4715 1.5835 1.4521 1.16984 1.07465 0.88856 0.54014
## Proportion of Variance 0.4072 0.1672 0.1406 0.09124 0.07699 0.05264 0.01945
## Cumulative Proportion  0.4072 0.5744 0.7150 0.80620 0.88319 0.93583 0.95528
##                            PC8     PC9    PC10    PC11    PC12    PC13    PC14
## Standard deviation     0.51786 0.43590 0.33292 0.24555 0.15849 0.11681 0.04168
## Proportion of Variance 0.01788 0.01267 0.00739 0.00402 0.00167 0.00091 0.00012
## Cumulative Proportion  0.97315 0.98582 0.99321 0.99723 0.99890 0.99981 0.99993
##                           PC15
## Standard deviation     0.03235
## Proportion of Variance 0.00007
## Cumulative Proportion  1.00000

1st component explains 40% of total variation, 2nd component explains 16% of total variation.

the elbow we pick is 4

Research Question 3

Is there a relationship between nutritions (vitamins, calcium, iron, protein, fiber) across drinks?

Many of the scatterplots have positive linear relationships, as seen with variable combinations such as Calcium & Protein, Vitamin A & Protein, and Vitamin A & Calcium. These were the strongest positive relationships and most Vitamin/Fiber + Protein variable combinations had a strong correlation coefficient. This graph allows us to see that there is a significant relationship between nutrients (especially the combinations mentioned earlier) across drinks.

Conclusion

Through the observations, we gained various insights about the data and were able to answer our research questions. Here’s the summary of the answers to our research questions: - Beverage class and milk type do not have a significant relationship with caffeine content in a drink - In terms of healthiness (calories and cholesterol), the factors that have a relationship with them include sugar content, size, carbohydrates, and fat level. However, the beverage class does seem to affect the measured values much since the Classic Espresso group has a lower value of carbohydrate and calories in comparison to other classes. - There seems to be some relationship between nutritional measurements, in which the strongest ones would be calcium and protein (correlation coefficient = 0.841), protein and Vitamin A (correlation coefficient = 0.80), as well as Fibre and Vitamin C (correlation coefficient = 0.736).

Prior to the project, we had some ideas about nutritional and health values for drinks such as the larger the size, the more calories it is, and the sweeter the drink (more sugar), the more calories it is. However, we did not understand the true relationship between these variables, so being able to explore how various drinks had different qualities was helpful for us.

Some issues that we ran into include the lack of consistency among data. In the future, it would be nice to have all the drinks available at Starbucks up to date so we can better provide analysis to future customers. Additionally, it would be helpful to have data in specific and consistent formats in order to prevent “unclassified” values in our data, which could affect the result. To add to our work in the future, we could potentially look at whether these levels are consistent across Starbucks chains internationally. We could also include more variables relating to customer information, and research whether customers of certain demographics are more likely to order certain kinds of drink types at Starbucks.