In our project, we explored the “Kickstarter Projects” dataset from Kaggle, which contains attributes for 378661 Kickstarter projects. There are 14 variables total, including:
ID
name
: name of projectcategory
: a specific category that the project falls into (ex: Food Trucks, Indie Rock)main_category
: a broader category that the project falls into (ex: Food, Music)currency
: type of currency used to support the campaignstate
: current condition the project is incountry
: country of projectdeadline
: deadline date for crowdfunding timegoal
: fundraising goallaunched
: date launchedpledged
: amount pledged by crowdbackers
: number of backersusd.pledged
: amount of money pledged (in USD)usd_goal_real
However, we focused specifically on the usd.pledged
, backers
, state
, category
, main_category
, pledged
, launched
, deadline
and usd_goal_real
variables in our research.
We came up with the following 4 research questions:
Within this research question, we looked at the possible associations between each of the usd.pledged
, backers
, category
, and main_category
attributes and the state
variable.
First, we wanted to determine whether Kickstarter projects that are “successful” have more money pledged in USD and more backers on average than “unsuccessful” projects. We defined “successful” projects as those that have a “successful” state, while “unsuccessful” projects are those that have a “failed,” “canceled,” or “suspended” state. In addition, we only worked with the subset of the data that excluded “live” and “undefined” states, as they do not necessarily fall into a successful or unsuccessful category. We also created a new variable, success
, which is TRUE if the project is “successful” and FALSE if the project is not.
We wanted to create a scatterplot of usd.pledged
vs backers
colored by state
. However, the scatterplot of this is difficult to interpret, as there are many outliers with a extremely high amount pledged or number of backers, and most of the points are gathered on top of each other in the bottom left corner and we are unable to determine their states. We noticed that most of the outliers are “successful” projects, and among the “unsuccessful” projects, the highest values of usd.pledged
are 4005111.4 and 721036.5, and the highest values of backers
are 20632 and 9895. So, we decided to create a scatterplot based on a subset of the data where usd.pledged
is less than 800000 and backers
is less than 10000, eliminating the “unsuccessful” outlier project with the highest usd.pledged
and backers
values, as well as the “successful” projects above this threshold.
(Slightly related fact: The “unsuccessful” outlier project with the highest number of backers and amount in USD pledged was a campaign for a laser razor, and was suspended by Kickstarter for violating their rules.)
Looking at this scatterplot, we find that the “unsuccessful” projects are almost all concentrated in the bottom left corner, while the “successful” projects are slightly concentrated in the bottom left corner, but also evenly spread throughout the plot. In addition, since the outliers with high amount of backers and USD pledged that we eliminated were also almost all “successful” projects, it seems that the “successful” projects do have more money pledged and more backers on average than other “unsuccessful” projects.
In order to strengthen this claim, we performed a t-test on both backers
and usd.pledged
to determine whether each has different means across success
. Both of our t-tests had p-values of close to 0, so we rejected the null hypothesis in both cases. We have sufficient evidence that the amount of backers on average differs across “successful” and “unsuccessful” projects, and that the amount of money pledged in USD on average differs across “successful” and “unsuccessful” projects.
Next, we wanted to examine if projects of certain categories end up performing better than others. We divided up the dataset into a “successful” subset and an “unsuccessful” subset, as defined earlier, and created word clouds of the “successful” [left] and “unsuccessful” [right] subsets of the data with the category
variable, which is the specific category for each project.
We then look at the most frequent categories in the successful and unsuccessful subsets of the dataset, and it seems that “Product Design,” “Tabletop Games,” and “Documentary” are the most common categories that are successful, while “Product Design,” “Documentary,” and “Video Games,” are the most common categories that are unsuccessful. While this demonstrates that certain categories take up higher proportions in the successful and unsuccessful subsets of the dataset, this does not necessarily prove that certain categories do succeed more often than others. We would need to construct a bar graph of the proportions of success for each category, in order to show which categories have a higher success rate than others.
Though we looked at word clouds of the category
variable in the previous graph, there are too many possible categories to display them effectively in a bar graph. We instead constructed a bar graph of the main_category
variable in the following graph, which is the broader category for each project.
From our bar graph, we find that some of the most successful projects are in the “Dance,”Comics," and “Theater” main categories, where each of those categories have over a 50% proportion of successful projects. It seems that certain main categories are more successful than others, but we want to proceed with a statistical test to determine if the proportion of successful projects is significantly different across the main_category
variable.
Using prop.test()
to test the null hypothesis that the proportions of success across the main categories are equal, we see that our resulting p-value is nearly 0, and we therefore reject the null hypothesis. We have sufficient evidence that the proportion of successful projects is significantly different among the different main categories.
This research question examines whether different categories of kickstarters bring in more pledged money than others.
The boxplot demonstrates the amount of pledged money brought in by the top 10 most popular categories of our kickstarter dataset. The data used to create these boxplots examines the middle 50% of the pledged data in order to better visualize the average effects of category on pledged values (as well as to discard outlier pledges).
Focusing only on the median pledged values for each category (represented by the solid black line in each white box), it seemed that there may be a difference in pledged values between the categories (especailly when comparing the spreads of the categories “Fiction” and “Tabletop Games”).
The following barplot represents just the median pledged values of each of the top 10 most popular categories. This barplot better displays the disparities in the median pledge values for each of these categories, where “Tabletop Games” has a median pledge value which is far above the median pledge value of “Fiction”.
Above is a plot of the average pledge amounts of kickstarters by day, smoothed over a 30 day period (monthly). Some outlier points have been removed to better demonstrate the trend. As we can see, the overall trend is positive both in the data and the smoothed line from 2009 to 2018, although the trend appears to have plateaued in recent years. In addition, there appears to be more variability in project size in later years.
We were also interested in seeing whether or not the distribution of project categories changed over the timeframe (2009-18). As seen in the above mosaic plot, the proportions of project categories did indeed change, as we can see that some categories shifted from having positive to negative residuals such as Film and Video, while some changed from negative to positive residuals such as Design.
Is there a difference between the categories of kickstarter projects? Do specific kickstarter project types require more funding than others? Below we investigate the relationship between how many days the kickstarter campaign lasts and what their goals were for each category.
As seen above, there are quite a few outliers in our data, which is to be expected from such a large dataset. To get a better sense of the overall data, we will only look at the projects that took less than 100 days to fund and put everything on a logarithmic scale. We are now looking to see if there is a relationship between the log(fundraising goal) and the number of days funded.
From the facetted graphs above, we see there appears to be a common trend of points for each category, however it appears to be quite flat or possibly slightly positively slopped. The somewhat flat line could imply more projects stay within a certain range for funding. As there is a lot of varying variances throughout all of the graph, it is hard to conclude that the linear regression may be statistically significant. One interesting thing to note however is that there appears to be a wide range of goal amounts specifically at 30 and 60 days for majority of the categories, which could mean a lot of projects only last 2 or 3 months long. Also, there appears to be much less of a cluster of points after 60 days for each category with an exception to “Music”, which could mean lots of film and video projects require funding for more than 2 months.
We unfortunately did not find much of a trend between the log(fundraising goal) and the days of funding, but do each type of project require a different amount of money? For example, do dance projects require more money technology projects? To investigate, we turn to the boxplot below.
We once again use log(goals) for the y-axes in order to make the data more readable. There appears to be a lot of outliers in our data, but some notable ones are Art and Publishing have extremely low fundraising goal projects. Additionally, There appears to be less drastic outliers for Dance compared to the other categories which appear to have many outliers across a large range. Looking at the boxplots themselves, they mostly seem to overlap, however the Technology boxplot appears to be much higher than most of the other categories, which could imply that Technology projects tend to request more funding than other categories of projects. That being said, overall the boxplots are all very close in heights and means which could imply there are a similar amounts of funding for each category of project.
Overall, we concluded that “successful” projects have more backers and more money pledged (in USD) on average than “unsuccessful” projects, and that within the main_category
variable, projects of certain categories are more successful that projects of other categories. Some of the main categories with the highest proportions of successful Kickstarter projects, all of which have over a 50% success rate, are “Dance,”Comics," and “Theater.” In addition, we found that the graphs supported the notion that different categories of kickstarters can affect the amount of pledges that are received, which can affect other attributes of the kickstarters. We can also see that the variables in the dataset are changing over time. Specifically, as analyzed above, pledged amounts are increasing over time, and the proportions of projects within each category are changing as well. Finally, from the given graphs above, we see that there did not appear to be a relationship between the duration of fundraising and the goal amount, and there also did not appear to be a significant difference between the various fundraising goal amounts for each project. We can conclude that all of the categories appear to have very similar fundraising goals.