Research Questions
What is the relationship between the density of app installations
and content rating? Is there a difference in the distribution of app
installations across different content ratings?
We created a stacked bar graph to show the conditional distribution
of apps per category given Content.Rating
. The bar graph
shows that most of the categories have content rating of everyone but in
Social category, the majority of the content rating is Teen. We also
created a distribution graph of number of apps based on
Content.Rating
. It shows that the number of apps are mostly
for everyone with content rating of teen coming up as second biggest
number of apps content rating but the difference between the two are
very big.
We also created bandwidth graphs to look at the distribution of
number of installs across different content rating. The x-axis
represents the number of app installations and the y-axis represents the
density of app installations. The graph shows that the number of
majority of app installations are between 0 and 10 for apps that have
Content Rating of Teen, Everyone 10+, and Everyone. The distribution is
positively skewed with a long tail to the right.
## Picking joint bandwidth of 3.97
## Warning: Removed 244842 rows containing non-finite values
## (`stat_density_ridges()`).
We conducted aov test and observed significantly small p-value. Since
the p-value is less than 0.05 which is the significance level, we reject
the null hypothesis that the maximum number of installs of Applications
are the same across Content.Rating
are the same. As such,
there is sufficient evidence that Content.Rating
affects
number of Installations of Applications.
## [1] "Everyone" "Mature 17+" "Teen" "Everyone 10+"
## [5] "Unrated" "Adults only 18+"
## Df Sum Sq Mean Sq F value Pr(>F)
## Content.Rating 5 1.033e+16 2.066e+15 4.392 0.000533 ***
## Residuals 333893 1.571e+20 4.705e+14
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Is the Rating of an app independent to the way they choose to
monetize?
It is important to understand how to monetize an app appropriately.
It wouldn’t be ideal if your app was rated at 2 stars - indicating that
most people didn’t like it too much, but it was filled with In App
Purchases and Ads. Users would be inclined to simply delete the app. To
examine this question, we decide to look at three parameters:
Free
, In.App.Purchases
, and
Ad.Supported
. For each of the three parameters, we create a
plot marking each parameter as TRUE or FALSE across the different rating
levels of the apps. As we wish to gather information on how apps
monetize, we remove all apps with less than a thousand ratings. This
way, we only gather data on relatively popular apps that would create
sufficient revenue.
From the plots above, we observe that for parameters
Ad.Supported
and In.App.Purchases
, we notice a
unimodal distribution that follows the trends of the total number of
apps in that rating. We predict that contrary to our intuition, the
rating of an app may not have a correlation to how they choose to
monetize. Additionally, we decide that statistical testing on the Free
parameter may not be useful, as almost all apps are free. We remove this
parameter from our consideration. Instead, we create a new parameter,
AdPur, which is the 2x2 distribution of the results table of the TRUEs
and FALSEs for the Ad.Supported
and
In.App.Purchases
parameters.
From examining the plot above, we again notice that there is a
unimodal distribution that follows the trends of the total number of
apps in that rating for all four variables we plot. We additionally
predict that the AdPur
is independent to
Rating
. To fully examine their relationships, we conduct a
chi squared test.
##
## Pearson's Chi-squared test
##
## data: Adp
## X-squared = 641.23, df = 39, p-value < 2.2e-16
##
## Pearson's Chi-squared test
##
## data: InAppp
## X-squared = 294.71, df = 39, p-value < 2.2e-16
##
## Pearson's Chi-squared test
##
## data: AdPurp
## X-squared = 936.55, df = 117, p-value < 2.2e-16
After properly conducting a chi squared test on the three variables,
we get a sufficiently low p-value values of below 0.05 for all three
tests. Therefore, we have enough evidence to reject the null hypothesis
that the rating is not correlated to Ad.Supported
,
In.App.Purchases
, and the AdPur
. As we
predicted, the three have statistically significant correlation to the
rating of the App.
Do all applications share the same distribution for Rating? Is there
any covariate to discriminate different Rating distributions?
We first created two graphs to show the distribution of
Rating
. The left distribution is raw Rating
data and the right distribution is transformed Rating
data.
On the first graph, we can see the most frequent rating is 0. More than
40% of the apps have rating of 0. Then we can see left skewed
distribution with high proportion of Rating
5. We assumed
that lots of mobile applications with few installs and reviews would
influence both Rating 0 and 5. To see the better distribution of
Rating
, we removed applications less than 1000
installs.
After the transformation, the number of both 0 and 5 rating
significantly decreased. Compared to the bimodal graph of the original
data, this graph shows the clearer distribution of
Rating
.
We tried to find any covariate that can better explain the
distribution of Rating
. We initially used
Installs
since it greatly decreased the number of possibly
false ratings in both 0 and 5. Also, we intuitively expected the strong
relationship between Installs
and Rating
because as more people install, there is a higher chance of having more
ratings, which standardize the score instead of extreme score 0 or
5.
Before we see the distributions of Rating
, we tansformed
Installs
into 5 factor levels because it contains 10+
unique string values.
- Less than 1K
- Less than 10K
- Less than 100K
- Less than 1M
- More than 1M
Then, we created conditional distribution of Rating
for
the original Rating
data without any transformation. Each
graph represents the group of the mobile applications with the similar
number of installs. The graph with apps less than 1K installs show great
number of 0 rating and slight high number of 5 rating. This was expected
as we have previously seen the effect of filtering
Installs
. As the number of installs increase, we can see
the number of 0 rating exponentially decreases. We can see that the
graph gradually forms unimodal shape. The applications with more than 1
million installs form a similar distribution like the true
Rating
distribution we assumed.
After the finding of the strong relationship between
Rating
and Installs
, we also tested our
initial assumption that Rating.Count
should be related to
Installs
. The bottom scatter plot shows how the apps with
more than 1 million installs generally get higher counts for
Rating.Count
.
Lastly, we created the violin plot showing the conditional
distribution of Rating
given the number of
Installs
. The apps with less than 1K installs show extreme
number of rating 5 and fairly large population at rating 5. This graph
contains the same information as the previous graph showing different
distributions of Rating
. We can better see the volume of
each population in this plot. As the number of installs increase, the
volume of the violin gets increased too.
Finally, we merged our assumptions on the true distribution of rating
and our findings. The left bar chart is the expected distribution of
Rating
we assumed earlier. We removed the applications with
less than 1000 installs and whose rating is 0. And the right bar chart
is the distribution of the applications with more than 1 million
installs. Since they have massive number of installs, their rating must
be standardized and represent the true distribution of
Rating
. As we can see they both have unimodal shape with
decreasing number of high ratings around 4.5~5.0.
Conclusion
We initially constructed three research questions to analyze Google
PlayStore mobile application dataset. We explored, transformed, created
new variables to better answer the research questions. Since our dataset
includes many categorical variables with high dimensionality, shrinking
dimension by setting intervals was crucial parts to have the right
visuals.
Main Takeaways
Research Qustion 1. The results show that there is sufficient
evidence that Content.Rating
of Applications and
Maximum.Installs
have statistically significant
correlation.
Research Qustion 2. The correlation between the ‘Rating’
parameter and monetization parameters such as
In.App.Purchases
, Ad.Supported
, and
Free
were plotted and tested; the results suggest that
there is a statistically significant correlation between them.
Research Qustion 3. The true distribution of Rating
is left-skewed unimodal shape and such shape gets more apparent as the
number of Installs
increases.
Future Research
We chose the appstore data whose variables are mostly categorical.
Since we have limited number of quantative variables, we had
difficulties applying various methods such as PCA or contour plot to our
dataset since PCA is a useful method when we suffer from high
dimensionality of quantative data and a contour plot requires quantative
variables.
In the future we can probably find a new data with more quantative
values including app functionality or user behavior related metrics. We
might also try to use those variables to construct a model predicting
whether the app supports advertisement or in-app payment.