Introduction

Abstract

In today’s world, smartphones have become an essential part of our everyday lives, and mobile applications or “apps” play a significant role in our daily activities. Apps have transformed the way we communicate, learn, work, and entertain ourselves. From social media and messaging apps to productivity and utility apps, there is an app for almost everything. Apps have made our lives easier and more convenient by providing instant access to information and services. They have also enabled us to stay connected with friends and family and provided new opportunities for businesses to reach customers or for entrepreneurs to launch their own startups. However, there are several factors that go into determining the success of an app which one can examine through its ratings and number of downloads. For our project, we explored the Google Playstore Apps dataset from Kaggle through statistical and graphical analysis to get a better understanding of determining the success of such apps.

Description of Dataset

App.Name: name of the application. App.Id: unique id of the app. Category: what type of application it is in terms of its use. Rating: a value between 0 to 5, including one decimal point. Rating.Count: number of how many ratings the app has. Installs: how many downloads the app has. Minimum Installs: Approximate minimum app install count. Maximum Installs: Approximate maximum app install count. Free: TRUE if app is free. Price: the price of the app, including two decimal points. Currency: currency of the price. Size: the size of the app. alnum string representing the required size for the app installation. Minimum.Android: the oldest version of the OS that is compatible with the app. Developer.Id: name of creator. Developer.Website: website of the app developer. Developer.Email: email of the app developer. Released: the date the app was released into Google Play Store. Privacy.Policy: link to the website with the privacy policy. Last.Updated: most current date the app was updated. Content.Rating: what age rating the app is appropriate for Ad.Supported: TRUE if there are advertisements. In.App.Purchases: TRUE if there are features you can purchase in the app. Editor.Choice: true if shown in editor’s choice section in play store.

Research Questions

What is the relationship between the density of app installations and content rating? Is there a difference in the distribution of app installations across different content ratings?

We created a stacked bar graph to show the conditional distribution of apps per category given Content.Rating. The bar graph shows that most of the categories have content rating of everyone but in Social category, the majority of the content rating is Teen. We also created a distribution graph of number of apps based on Content.Rating. It shows that the number of apps are mostly for everyone with content rating of teen coming up as second biggest number of apps content rating but the difference between the two are very big.

We also created bandwidth graphs to look at the distribution of number of installs across different content rating. The x-axis represents the number of app installations and the y-axis represents the density of app installations. The graph shows that the number of majority of app installations are between 0 and 10 for apps that have Content Rating of Teen, Everyone 10+, and Everyone. The distribution is positively skewed with a long tail to the right.

## Picking joint bandwidth of 3.97
## Warning: Removed 244842 rows containing non-finite values
## (`stat_density_ridges()`).

We conducted aov test and observed significantly small p-value. Since the p-value is less than 0.05 which is the significance level, we reject the null hypothesis that the maximum number of installs of Applications are the same across Content.Rating are the same. As such, there is sufficient evidence that Content.Rating affects number of Installations of Applications.

## [1] "Everyone"        "Mature 17+"      "Teen"            "Everyone 10+"   
## [5] "Unrated"         "Adults only 18+"
##                    Df    Sum Sq   Mean Sq F value   Pr(>F)    
## Content.Rating      5 1.033e+16 2.066e+15   4.392 0.000533 ***
## Residuals      333893 1.571e+20 4.705e+14                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Is the Rating of an app independent to the way they choose to monetize?

It is important to understand how to monetize an app appropriately. It wouldn’t be ideal if your app was rated at 2 stars - indicating that most people didn’t like it too much, but it was filled with In App Purchases and Ads. Users would be inclined to simply delete the app. To examine this question, we decide to look at three parameters: Free, In.App.Purchases, and Ad.Supported. For each of the three parameters, we create a plot marking each parameter as TRUE or FALSE across the different rating levels of the apps. As we wish to gather information on how apps monetize, we remove all apps with less than a thousand ratings. This way, we only gather data on relatively popular apps that would create sufficient revenue.

From the plots above, we observe that for parameters Ad.Supported and In.App.Purchases, we notice a unimodal distribution that follows the trends of the total number of apps in that rating. We predict that contrary to our intuition, the rating of an app may not have a correlation to how they choose to monetize. Additionally, we decide that statistical testing on the Free parameter may not be useful, as almost all apps are free. We remove this parameter from our consideration. Instead, we create a new parameter, AdPur, which is the 2x2 distribution of the results table of the TRUEs and FALSEs for the Ad.Supported and In.App.Purchases parameters.

From examining the plot above, we again notice that there is a unimodal distribution that follows the trends of the total number of apps in that rating for all four variables we plot. We additionally predict that the AdPur is independent to Rating. To fully examine their relationships, we conduct a chi squared test.

## 
##  Pearson's Chi-squared test
## 
## data:  Adp
## X-squared = 641.23, df = 39, p-value < 2.2e-16
## 
##  Pearson's Chi-squared test
## 
## data:  InAppp
## X-squared = 294.71, df = 39, p-value < 2.2e-16
## 
##  Pearson's Chi-squared test
## 
## data:  AdPurp
## X-squared = 936.55, df = 117, p-value < 2.2e-16

After properly conducting a chi squared test on the three variables, we get a sufficiently low p-value values of below 0.05 for all three tests. Therefore, we have enough evidence to reject the null hypothesis that the rating is not correlated to Ad.Supported, In.App.Purchases, and the AdPur. As we predicted, the three have statistically significant correlation to the rating of the App.

Do all applications share the same distribution for Rating? Is there any covariate to discriminate different Rating distributions?

We first created two graphs to show the distribution of Rating. The left distribution is raw Rating data and the right distribution is transformed Rating data. On the first graph, we can see the most frequent rating is 0. More than 40% of the apps have rating of 0. Then we can see left skewed distribution with high proportion of Rating 5. We assumed that lots of mobile applications with few installs and reviews would influence both Rating 0 and 5. To see the better distribution of Rating, we removed applications less than 1000 installs.

After the transformation, the number of both 0 and 5 rating significantly decreased. Compared to the bimodal graph of the original data, this graph shows the clearer distribution of Rating.

We tried to find any covariate that can better explain the distribution of Rating. We initially used Installs since it greatly decreased the number of possibly false ratings in both 0 and 5. Also, we intuitively expected the strong relationship between Installs and Rating because as more people install, there is a higher chance of having more ratings, which standardize the score instead of extreme score 0 or 5.

Before we see the distributions of Rating, we tansformed Installs into 5 factor levels because it contains 10+ unique string values.

  • Less than 1K
  • Less than 10K
  • Less than 100K
  • Less than 1M
  • More than 1M

Then, we created conditional distribution of Rating for the original Rating data without any transformation. Each graph represents the group of the mobile applications with the similar number of installs. The graph with apps less than 1K installs show great number of 0 rating and slight high number of 5 rating. This was expected as we have previously seen the effect of filtering Installs. As the number of installs increase, we can see the number of 0 rating exponentially decreases. We can see that the graph gradually forms unimodal shape. The applications with more than 1 million installs form a similar distribution like the true Rating distribution we assumed.

After the finding of the strong relationship between Rating and Installs, we also tested our initial assumption that Rating.Count should be related to Installs. The bottom scatter plot shows how the apps with more than 1 million installs generally get higher counts for Rating.Count.

Lastly, we created the violin plot showing the conditional distribution of Rating given the number of Installs. The apps with less than 1K installs show extreme number of rating 5 and fairly large population at rating 5. This graph contains the same information as the previous graph showing different distributions of Rating. We can better see the volume of each population in this plot. As the number of installs increase, the volume of the violin gets increased too.

Finally, we merged our assumptions on the true distribution of rating and our findings. The left bar chart is the expected distribution of Rating we assumed earlier. We removed the applications with less than 1000 installs and whose rating is 0. And the right bar chart is the distribution of the applications with more than 1 million installs. Since they have massive number of installs, their rating must be standardized and represent the true distribution of Rating. As we can see they both have unimodal shape with decreasing number of high ratings around 4.5~5.0.

Conclusion

We initially constructed three research questions to analyze Google PlayStore mobile application dataset. We explored, transformed, created new variables to better answer the research questions. Since our dataset includes many categorical variables with high dimensionality, shrinking dimension by setting intervals was crucial parts to have the right visuals.

Main Takeaways

  • Research Qustion 1. The results show that there is sufficient evidence that Content.Rating of Applications and Maximum.Installs have statistically significant correlation.

  • Research Qustion 2. The correlation between the ‘Rating’ parameter and monetization parameters such as In.App.Purchases, Ad.Supported, and Free were plotted and tested; the results suggest that there is a statistically significant correlation between them.

  • Research Qustion 3. The true distribution of Rating is left-skewed unimodal shape and such shape gets more apparent as the number of Installs increases.

Future Research

We chose the appstore data whose variables are mostly categorical. Since we have limited number of quantative variables, we had difficulties applying various methods such as PCA or contour plot to our dataset since PCA is a useful method when we suffer from high dimensionality of quantative data and a contour plot requires quantative variables.

In the future we can probably find a new data with more quantative values including app functionality or user behavior related metrics. We might also try to use those variables to construct a model predicting whether the app supports advertisement or in-app payment.