Final Project : Statistical Visualization of Sephora Dataset

Explanation of the Dataset

General explanation of the Sephora dataset we will be working on

The Sephora dataset is a collection of products offered on the Sephora website in April 2020. There are 9168 rows that correspond to each of the individual products in the website and 21 columns of which 15 are categorical and 5 are quantitative. Each of these columns have some data about the products that are sold on the Sephora website. There are some categorical variables that represent quantitative values where if we adapt some revision on them, we will be able to change them into quantitative variables. Some of the columns in the dataset were irrelevant as they only provided text values such as the URL.

Research Question

Given our based data, we constructed a new column called ‘access’ which shows where the designated product is sold. Using the given columns ‘online_only’ and ‘exclusive’, we created 4 categorical variables which were ‘Only in Sephora Website’,‘Only in Online’,‘Everywhere’ and ‘Only in Sephora Website and Store’. Based on where they are sold, we decided to look deeper into the strategies of how Sephora was using their product to sell their products.



Question 1 : Are the explanation of the product showing the characteristics on each of the different strategies

As we first wanted to find any common characteristics or differences between products regarding where they are sold, we looked at the ‘detail’ column, which has the details of the product available on the Sephora. We made four different word clouds for each of the accesibility to show some common words that occur in the details.

Graph 1

After looking at four different word clouds, we were not able find any specific word for each of the accessibility. There were common words but it was difficult to say that these words were words that gave a intuitive explanation about the accessibility. We believe this happened because the details were so descriptive so that many of them had irrelevant words. Additionally, as there were so few products that are sold only in Sephora Website, the word cloud itself was small, and the words that were consisting the word cloud were irrelevant making it hard to catch a pattern.

Question 2 : What is the proportion of the product depending on their accessibility?

The following graph represents the proportion of each of the variable that we constructed in the ‘access’ column. As mentioned above, using the ‘online_only’ and ‘exclusive’ column, we were able to figure where each product was sold. The stacked bar chart shows the proportion of the products depending on where they are sold.

Graph 2

To begin with, we wanted to visualize what proportion of the products were sold with different accessibilites. Without much surprise, more than half of the products in our dataset were sold everywhere.The rest of the products that were not sold everywhere were sold in different environments. Other than being sold everywhere, the other majority of the product were sold only in the Sephora website and store followed by only being sold online. There were minimal amount of products that were sold only on the Sephora website.

Question 3 : How does the size of the brand affect its product popularity?

The following graphs shows the correlation between the brand size and its affect on the brand’s product popularity by creating tree maps of subsets divided by its access difference(everywhere, only in online, only in Sephora website, and only in Sephora website and store). From each subsets, we use ‘brand’, ‘love’, ‘count’ variables to make the box size(count) represents the frequency of the brand products among Sephora, and box color(love) represents its popularity. First, to analyze the correlation between brand size and their products’ popularity, we created a subset of top 20 brands determined by the number of products it has on sephora for each access data set(every, online, website, webStore). Using the new subsets, we created a tree map, where the box size represents the frequency/size of the brand, and the color represents the popularity of the brand’s product. For better visualization, we added a legend of love for viewer’s convenience.

Graph 3

For products that has no limitation on access in ‘every’ data set, we can’t see a strong correlation between the size of the brands and their popularity, because brands with darker color doesn’t necessarily has comparably large box sizes. For products that allows online access, the tree map suggests that ‘The Ordinary’ is the most popular and the most frequent brand among other top 20 brands. Moreover, slightly darker boxes tend to have bigger size than others, which allows us to conclude that for online access items, the brand size has a correlation with its popularity. For products with website access, the overall trend of popularity does follow the trend of box sizes, except for ‘FENTY BEAUTY by Rihanna’. The brand has comparably small sized box, but still has one of the deepest color, which means that it’s more popular than most of the brands in the ‘website’ data set, although the frequency of the brand isn’t that high. For products with Sephora website and offline store access, the trend of box size follows the trend of box color, where ‘Sephora collection’ is the biggest brand and the most popular, and ‘FENTY BEAUTY by Rihanna’ is the third biggest brand and the has second place in product popularity. Overall, we can conclude that the size of the brand does affect the brand’s product popularity for exclusive access such as online, website, and webStore, but not necessarily for the ones that are not exclusive(every).

Question 4 : Which category occured the most across different accessible strategies and did the strategy have a particular effect on the popularity.

We created a mosaic plot to get a visualization of two things. The marginal distribution of the categories and the independence of each of the categories per accessibility. We subsetted the dataset to contain only the top 5 categories that were sold in the initial Sephora dataset. We wanted to check each of their distribution depending on the accesibility and also check whether the categories sold were too large or little.

Graph 4

From the mosaic plot, the marginal distribution of the products are the highest when sold everywhere overall. The Values & Gift category seem to have a fairly equally distributed number of products across all accessibility categories when just seen by eye. According to the Pearson Residual values, the perfume category has significantly higher number and lower number of products than expected under the independence when sold everywhere and when sold only in Sephora websites and stores. The values & Gift section is also similar but it has significantly low number of products than expected under the independence when sold everywhere and high number of products when sold only in online or only in the Sephora website. The rest of the categories seems to not have significantly high or low number of products regardless of where they are sold. Through this, we were able to figure that perfumes are sold everywhere more than what we would have expected under the independence. Because everywhere considers stores not only in online but also offline, we could assume that Sephora considered that people were more prone to buy perfumes offline to smell the scent before their purchase. Thus, we decided to look at the popularity of only perfumes across different accesibilities.

Because we figured that perfume has the highest marginal distribution, we decided to check their popularity for each of the accessibilities. For each of the accessibility, we added all the love and found the average love for each accessibility. We created a pie chart to compare the ratios of each of the category.

Graph 5

The average of loves were the highest for products that were sold everywhere. Considering that this is an average graph, the large number of dataset for products sold everywhere didn’t have much effect. Sephora did a intuitive job of placing most perfumes everywhere so the customers can smell the scent and letting the customers get the full experience of the product more and this is reinforced by the average of likes the perfume products were getting on the everywhere category when compared with other categories.

Question 5 : Difference in discounts on limited editions depending on where they are sold

We also wanted to know how Sephora was dealing with the price range. Therefore, we wanted to observe whether their exists a difference in discounts when it came to limited editions. We used the ‘value_price’ and ‘price’ columns to find this difference. Also, we considered the limited_edition factor to get a better sense of the discount that was occuring between the two price columns. Generally, we consider that limited editions to be something exclusive so we wanted to check if limited editions had different discount rates from the products that were not considered limited editions depending on where they are sold.

Graph 6

Overall, regardless of where they are sold, we were able to observe many of the products that are limited editions had discounted prices as the price decreased for each of the products when compared with the value price. When applying linear regression we can see that the linear line for limited edition is located underneath the linear line applied for data points that were not limited edition. This means that limited edition products have more products that were discounted than the products that were not limited edition. When we look at each of the individual scatterplot depending on their accessibility, we observed the largest discount prices from the original price when sold only in the Sephora store and website. Even though the price range wasn’t high, the discount rate was the highest when sold in Sephora store and website only. We could conclude from this graph that Sephora applies more discount on limited edition products than the products as they are not considered limited edition maybe to interest the customers more in buying their products.

Conclusion

To summarize this project, we examined the relationship of the product popularity and Sephora’s various marketing strategies in various aspects such as a wording of descriptions, brand size, difference in accessibility depending on the product’s category, and different discounts applied for different accessibility. From the visualizations of possible correlation between these and product popularity, we were able to conclude that the more frequently the brand appears, the more popular its products are, and that Sephora assigns different accessibility to certain categories such as perfume, because consumers tend to prefer testing perfumes in person than buying online. Moreover, we discovered that Sephora assigns higher discount rate on its own website & store exclusive items than items that are not, possibly because Sephora wants to draw more attention and have higher selling rate on their exclusive products. Although we were able to find several interesting marketing strategies on Sephora dataset, one of the takeaways of this project was that the date this data was created was 2 years ago, which is before pandemic started. Thus, our results might not necessarily reflect the most recent marketing strategies used on Sephora, because most of the sales would come from their website than offline stores due to social distancing. A second limitation we had in our analysis is that we weren’t able to access the real popularity, such as the amount of the products that were actually sold. The data set we chose is the biggest and most useful Sephora data set that is available online, but it didn’t have many options to evaluate the product’s real popularity, thus we had to use ‘love’ variable as a measure of popularity. Therefore, in future work, it will be more beneficial to collect data from recent period, and with more variables that can reflect the real popularity of the product and some even more interesting aspects we could look into.