Our team found this Starbucks dataset on Kaggle, in which the data was last updated in 2015. This dataset describes Starbuck drinks in terms of their beverage categories (coffee, frappucino, etc) and beverage prep (Size and type of Milk), as well as their nutrional values. In the original dataset, there are 18 variables (15 quantitative and 3 categorical) and 242 rows. After removing null and invalid values, 218 rows are left. Some of the variables are renamed to clean the strings and be more simplified.
It made the most sense to clean some of the values up. As mentioned earlier, the original Beverage_prep included a combination of information in each value and the values are not consistent. To combat this issue, the columns Size and Milk are created by extracting information from Beverage_prep. Likewise, the information about total fat might not be easily understandable by the audience, which resulted in the creation of Fat_levels. Below are the newly created variables:
There are a few questions we wanted to explore. First, we would like to explore if different beverage classes have various caffeine content and we would like to see which one(s) has the highest caffeine content. We expect that the espresso categories would have the highest caffeine content, and we do not expect that milk type will change the caffeine content. Second, we would like to identify factors that makes a drink healthy, in which we defined healthy as high in calories. We suspect that milk, size, and the type of drink will impact the calories. Lastly, we would like to determine the relationship between various nutritional values such as vitamins, calcium, iron, and fiber across drinks types.
From the word cloud above, “whip”, “cream”, “latt[e]”, and “without” words are very common in beverage names. After investigating by looking through other names, we noticed that it was common for drinks to specify “without whipped cream” as well as a type of coffee latte drink. Other common words include “tea”, “mocha”, “caramel”,“vanilla” and many more, which represent flavors or syrups in the drinks, which is also unsurprising.
This is showing the conditional distribution of caffeine given milk and beverage class. We can see that the center of caffeine remains roughly the same throughout each type of milk, suggesting that milk does not necessarily change the caffeine content, but after we tested the hypothesis using one-way anova, we realized that the type of milk affects the calories (p-value = 0.000492). In terms of comparing different beverage classes between one another, it is unclear, although classic espresso caffeine seems to be centered below 200s, Frappuccino seems to be slightly above 200s, signature espresso centered around 200s, and miscellaneous below 200s.
## Df Sum Sq Mean Sq F value Pr(>F)
## Milk 3 180409 60136 6.159 0.000492 ***
## Residuals 214 2089329 9763
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
The boxplot answers our hypothesis from the previous graph that the beverage class does not have a significant effect on the Calories. The median for the calories are as follow from highest to lowest: classic espresso, frappuccino, signature espresso, and miscellaneous. Frappucino has the smallest variance, while miscellaneous has the largest variance, which makes sense since the drinks included in miscellaneous includes both non-caffeinated and caffeinated drinks.
We see a linear trend for both sugar content and calories across beverage categories, where the amount of sugar has a direct relationship with the number of calories. In relation to our research questions, this demonstrates the importance of sugar as a factor contributing to the nutrition value/calorie levels of drinks, regardless of their type.
This relationship between sugar and total fat is similar to what we expected, in which there is a positive and significant correlation between the amount of sugar and fat in a drink (p-value = 5.94e-09). However, it is surprising that the various sizes have a slope to another and sugar content to one another since we would expect venti (the largest size) to have more sugar and hence fat than short (the smallest size). That being said, the sizes short, tall, grande, and venti seem to have similar coefficient for sugar, whereas unclassified is much different and significant as indicated by the p-value (p = 2.13e-15).
##
## Call:
## lm(formula = Total.Fat ~ Sugars + Size, data = sbux)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.2697 -1.3804 -0.1604 1.0275 8.3706
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -0.189638 0.590879 -0.321 0.749
## Sugars 0.047760 0.007873 6.066 5.94e-09 ***
## SizeGrande -0.594429 0.726230 -0.819 0.414
## SizeTall -0.377215 0.716703 -0.526 0.599
## SizeVenti -0.892986 0.762304 -1.171 0.243
## SizeUnclassified 3.284833 0.626764 5.241 3.85e-07 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.161 on 212 degrees of freedom
## Multiple R-squared: 0.4954, Adjusted R-squared: 0.4835
## F-statistic: 41.63 on 5 and 212 DF, p-value: < 2.2e-16
This plot allows for us to see that drink size does impact the calories for different drink types. For example, we consistently see that the smaller sized drinks (short and tall) have the lowest caloric distribution in comparison to the other sizes. As drink size increases, caloric range increases for all drink types as well, even with combined sizes.
## Warning: The dot-dot notation (`..level..`) was deprecated in ggplot2 3.4.0.
## ℹ Please use `after_stat(level)` instead.
There seem to be 2 modes in the data, with many of the drinks having carbohydrate content between around 50 and 125 grams respectively and calories around 200 on average. This examines the levels of these variables seen across Starbucks drinks.
From the heat map, there is a trend where as the fat level increases from very low to very high, the cholesterol also increases although not necessarily consistent among all beverage classes. In Signature Espresso, the cholesterol with fat level very low seems to have one of the highest cholesterol content, and it somewhat decreases as fat level increases. After exploring this further using one.way anova and box plot, we were able to determine that the there is a difference between the population mean cholesterol of different beverage classes since the p-value is less than 2e-16, specifically that Classic Espresso has a lower mean Cholesterol.
## Df Sum Sq Mean Sq F value Pr(>F)
## Bev_class 3 35580 11860 43.02 <2e-16 ***
## Residuals 214 59002 276
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Standardized
Complete Dentogram, K = 6
##
## 1 2 3 4 5 6
## 133 5 13 24 37 6
Complete Dentogram, K = 6 Leaf Colored by Bev_cat
Beverage category is able to explain the variation in of the drinks of certain drinks but to a limited degree. The blue cluster consists mostly Frappuccino, which has green label.
Standard and Centralized
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 2.4715 1.5835 1.4521 1.16984 1.07465 0.88856 0.54014
## Proportion of Variance 0.4072 0.1672 0.1406 0.09124 0.07699 0.05264 0.01945
## Cumulative Proportion 0.4072 0.5744 0.7150 0.80620 0.88319 0.93583 0.95528
## PC8 PC9 PC10 PC11 PC12 PC13 PC14
## Standard deviation 0.51786 0.43590 0.33292 0.24555 0.15849 0.11681 0.04168
## Proportion of Variance 0.01788 0.01267 0.00739 0.00402 0.00167 0.00091 0.00012
## Cumulative Proportion 0.97315 0.98582 0.99321 0.99723 0.99890 0.99981 0.99993
## PC15
## Standard deviation 0.03235
## Proportion of Variance 0.00007
## Cumulative Proportion 1.00000
1st component explains 40% of total variation, 2nd component explains 16% of total variation.
the elbow we pick is 4
Many of the scatterplots have positive linear relationships, as seen with variable combinations such as Calcium & Protein, Vitamin A & Protein, and Vitamin A & Calcium. These were the strongest positive relationships and most Vitamin/Fiber + Protein variable combinations had a strong correlation coefficient. This graph allows us to see that there is a significant relationship between nutrients (especially the combinations mentioned earlier) across drinks.
Through the observations, we gained various insights about the data and were able to answer our research questions. Here’s the summary of the answers to our research questions: - Beverage class and milk type do not have a significant relationship with caffeine content in a drink - In terms of healthiness (calories and cholesterol), the factors that have a relationship with them include sugar content, size, carbohydrates, and fat level. However, the beverage class does seem to affect the measured values much since the Classic Espresso group has a lower value of carbohydrate and calories in comparison to other classes. - There seems to be some relationship between nutritional measurements, in which the strongest ones would be calcium and protein (correlation coefficient = 0.841), protein and Vitamin A (correlation coefficient = 0.80), as well as Fibre and Vitamin C (correlation coefficient = 0.736).
Prior to the project, we had some ideas about nutritional and health values for drinks such as the larger the size, the more calories it is, and the sweeter the drink (more sugar), the more calories it is. However, we did not understand the true relationship between these variables, so being able to explore how various drinks had different qualities was helpful for us.
Some issues that we ran into include the lack of consistency among data. In the future, it would be nice to have all the drinks available at Starbucks up to date so we can better provide analysis to future customers. Additionally, it would be helpful to have data in specific and consistent formats in order to prevent “unclassified” values in our data, which could affect the result. To add to our work in the future, we could potentially look at whether these levels are consistent across Starbucks chains internationally. We could also include more variables relating to customer information, and research whether customers of certain demographics are more likely to order certain kinds of drink types at Starbucks.