Bike Sharing in Seoul and D.C.

Introduction
Data Description
Research Questions
Question 1: Does the shared bike industry perform differently in Seoul and in D.C.?
Question 2: How do the weather conditions change over the year for both Seoul and D.C.?
- Time Series
Question 3: What factors would affect people’s choice on using shared bikes?
Conclusion

Introduction

With the popularity of the bike-sharing industry worldwide brings up the convenience for renting a bike at a particular location and returning it back at another places, the bike-sharing system provides sustainable alternative for short-distance trip, as well as solve the last mile problem, in general. However, does it the whole picture? Do people living in different regions use shared bikes for different purposes? If so, what factors affect their different choices?

To address these questions, our project is interested in exploring the performance of bike-sharing industry in different regions and how external factors, including environmental conditions and cultures, affect people’s behavior of renting bikes.

Specifically, our project digs into these problems by first exploring the difference in different locations and examining the variables that are potentially be influential for people’s choice of using shared bikes. Then, we explore the variables that change as time passed. Finally, we utilize these observations to explore the relationship between the variables that are potentially influential and the total number of bikes in different regions and compare their different performance.

Data Description

Our datasets are obtained from UCI and contain the information of the hourly and daily count of rental bikes between years 2011 and 2012 he corresponding weather and seasonal information, etc. in D.C. and in Seoul within years 2017 and 2018, respectively.

In order to examine the data from the two original datasets we found, as described above, we combine them together by cleaning the data. We first pick variables that are in alignment in both data sets. The dataset from Seoul contains many other variables like dew point temperature, and functional days, which we will not explore. We change variables like date and wind speed so that they can have the same unit. We also multiply humidity from D.C. by 41 because the original data sets divided the number with 41 to normalize the data. The data from Seoul also record rides count for each hour, instead of each day. So we add up rides in the same day, find the mean of temperature, humidity, and wind speed, to compress Seoul’s data into the same format as D.C.’s data. After our manipulation we obtained the “combined” data set and mainly used it for this project.

In particular, we are interested in exploring the relationships between the total number of rental bikes rides and temperature, holidays, seasons, wind speed, and humidity with respect to different locations. Specifically, we are using the following variables in our project:

cnt: Numer of rides recorded on a particular date.
Date: month-day-year (m/d/yyyy)
Seasons: Winter, Spring, Summer, Fall
Temperature: Temperature in Celsius
Wind speed: Wind speed in m/s
Humidity: Humidity in %
Holiday: Holiday/No holiday
Location: Seoul/D.C.

Research Questions

We attempt to address these three research questions with this data set:

Does the shared bike industry perform differently in Seoul and in D.C.?

Do the total count of shared bikes in Seoul and in D.C. have different distribution?

How do the weather conditions change over the year for both Seoul and D.C.?

How do the quantitative variables, including Temperature, Wind speed, and Humidity change over the year for both locations?
How do these changes influence people’s choice of riding in both locations?

What factors would affect people’s choice on using shared bikes?

What factors would affect people’s choice on using shared bikes, in Seoul and in D.C., respectively?
How do the same factors affect people’s use of shared bikes for different regions?

Question 1: Does the shared bike industry perform differently in Seoul and in D.C.?

To gain preliminary insights on how the variables are related in both locations, we first perform some Exploratory Data Analysis on each subset of different locations.

We first look at how number of rides are distributed for these two locations.

We can see that the distribution of number of bike rides in Seoul is bimodel, with one peak at around 6000 and another peak at 26000. The distribution for D.C. is unimodel, with the peak around 5000 rides a day. DC’s bike ride count seems to follow a normal distribution, which we will also test later. With two vertical line indicating the mean of two distribution, we can also see clearly that the total number of bike rides in Seoul is significantly greater than in D.C.. One reason might be that Seoul’s data was collected 6 years later than D.C.’s data, so people might get more used to shared bikes, or maybe people in Seoul are just more fond of this industry. Seoul also have a fair amount of data when there’s no bike rides at all, which is different from D.C..

To test if the two distributions are the same, we use a KS test on them.

## 
##  Asymptotic two-sample Kolmogorov-Smirnov test
## 
## data:  Seoul$cnt and DC$cnt
## D = 0.65927, p-value < 2.2e-16
## alternative hypothesis: two-sided

The null hypothesis for this KS test is that Seoul and DC have the same distribution. We get a p-value much smaller than our alpha so we have enough evidence to reject it and say the distribution is different. By looking at our plot above, we hypothesize that the distribution of DC’s “cnt” follows a normal distribution. We also perform a KS test to examine.

## 
##  Asymptotic one-sample Kolmogorov-Smirnov test
## 
## data:  DC$cnt
## D = 0.047058, p-value = 0.07852
## alternative hypothesis: two-sided

The null hypothesis is that the distribution is normal. With a p value greater than our alpha, we do not have enough evidence to reject the null hypothesis. We conclude that number of bike rides in DC follows a normal distribution.

Question 2: How do the weather conditions change over the year for both Seoul and D.C.?

Time Series

We would like to explore how the number of rides changes over time and how the number of rides is influenced by or related to the weather conditions, such as temperature and wind speed. To achieve this, we create time series plots for the Seoul and D.C. subsets of the original shared bike data. For each location, we also create the time series plots for the number of rides, the temperature, and the wind speed across the date the data were collected.

The time series plots are plotted as below:

As we can see from the plot from the top left corner, the general trend of number of rides in Seoul increases from January 2018 to July 2018 and decreases since then until the end of 2018. To see why this trend happens, we also plot the time series plots for temperature and wind speed across date. From the mid-left plot, we can see that the temperature in Seoul increases from January 2018 to August 2018, and decreases since then until the end of the year. Also, from the bottom-left plot, the oscillations of the wind speed in Seoul is large from January 2018 to April 2018, and is smaller from April 2018 to October 2018, and becomes larger since then until the end of year. Through these observations, we can see that as the weather in Seoul gets warmer and the wind speed decreases, people tend to ride more bikes. The number of rides in Seoul reaches a peak in July 2018, as the weather is warm and the wind speed is at its minimum. Then the number of rides decreases as it gets to winter and the wind speed increases.

The same reasoning also applies to Washington D.C. As we can see from the right side of the time series plot, the overall trend of temperature follows the general trend of the number of rides in D.C., with a shape of “M”. This makes sense since people tend to ride more bikes when the weather is warm and vice versa. The general trend of wind speed seems random, but it reveals that the oscillations of wind speed is smaller around July 2011 and July 2012 than other dates. This corresponds to the peak of number of rides in D.C., which makes sense that people tend to ride more bikes when there is smaller wind speed.

Hence, from the time series plots above, we observe that people in Seoul and in D.C. seem to reveal the same riding habits that they tend to ride more bikes as the weather gets warmer and there is smaller wind speed and, conversely, they tend to ride less as it gets colder and the wind speed is large. This addresses an aspect of our second research question.

Question 3: What factors would affect people’s choice on using shared bikes?

Finally, we want to determine what factors affect people’s use of shared bikes in different locations. In particular, we will examine how the variables Temperature, Seasons, and Holiday affect the total count of rental bikes, with respect to each location. Although there are three quantitative variables in our dataset, including Temperature, Humidity, and Wind speed, we decide to put our main focus on examining the variable Temperature as we think it would be the most relevant factor influencing people’s choice of renting a bike based on the time series plot we discussed in the previous research question.

Humidity and Wind speed

We first want to take a glance at the distribution of Humidity and Wind speed. According to the National Weather Service, a humidity number less than or equal to 55 is “dry and comfortable”, between 55 and 65 is “becoming ‘sticky’ with muggy evenings”, and greater than or equal to 65 is “lots of moisture in the air, becoming oppressive.” So we divide our humidity data into three subsets according to this standard. National Weather Service also suggests a wind speed less than 3 would be very soft and pleasant, so we divide the wind speed data accordingly.

We then present these boxplots to see how humidity and wind speed affect the number of bike rides.

We can see that both humidity and wind speed leads to a difference. This suggests that people tend to prefer a most comfortable humidity, and also a softer wind when riding a shared bike. However, in the sense that there may be some associations between humidity and wind speed in response to temperature, we will discuss the relationship between temperature and count of total rental bikes in details later, as most people would not choose to ride a bike after checking the humidity and wind speed of that day.

Temperature

The scatterplots below show the relationship between temperature (in Celsius) and the number of total rental bikes, colored by seasons and faceted by holiday, for each location. As expected, Summer has the highest temperature, in general, while Winter has the lowest. Hence, we will focus on how ‘Temperature’, ‘Seasons’, and ‘Holiday’ influence the count of total rental bikes and discuss the differences of the performance in the two locations in our EDA.

From the scatterplot above, we can observe that the performance of people’s behavior of renting a bike in Seoul is different from those who live in D.C., where the dots mostly gathered in non-holidays for Seoul and in holidays for D.C. This indicates that people’s riding habits may be distinct in terms of different cultures, for instance, people who live in Seoul may tend to rent a bike for the purpose of commuting, while people in D.C. may use the sharing bikes for a riding trip.

In addition, we can observe that except for Summer, temperature is positively corresponding to the count of total rental bikes given all other seasons. Exceptionally, for non-holidays in D.C., the plot also shows a negative association between temperature and count of rental bikes in Fall, whereas in our data, the sample size (of 86 entries) is relatively small compared to other categories, so we will consider it as an outlier. In general, in the common sense that people tend to choose to ride bikes in warmer weather, so the performance of riding a bike tends to increase until approximately 25 Celsius degrees, and decrease as temperature continues increasing (i.e. the weather gets hotter).

In conclusion, from the pair of scatterplots above, we observe that the trend of renting a bike is different by location as people in Seoul tend to rent a bike during non-holidays and people in D.C. tend to ride a bike during non-holidays. Also, we notice that as the weather gets warmer, the count of total rental bikes tends to increase, while as the temperature continues increasing after reaching approximately 25 Celsius degrees, the count of total rental bikes tends to decrease. Hence, in general, there is a negative association between temperature and count of rental bikes in Winter, while they are positively correlated in Spring, Summer, and Fall.

We also perform a linear regression analysis between temperature and the number of bike rides in both locations as below:

Linear Regression Model for Seoul:

## 
## Call:
## lm(formula = Temperature.C. ~ cnt * Seasons, data = Seoul)
## 
## Residuals:
##      Min       1Q   Median       3Q      Max 
## -11.8530  -3.0770  -0.4167   3.5346  13.8031 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)        1.070e+01  1.163e+00   9.195  < 2e-16 ***
## cnt                1.742e-04  5.341e-05   3.261 0.001224 ** 
## SeasonsSpring     -4.636e+00  1.618e+00  -2.866 0.004415 ** 
## SeasonsSummer      1.732e+01  2.175e+00   7.963 2.59e-14 ***
## SeasonsWinter     -1.999e+01  1.972e+00 -10.139  < 2e-16 ***
## cnt:SeasonsSpring  2.247e-04  7.852e-05   2.861 0.004482 ** 
## cnt:SeasonsSummer -2.658e-04  8.818e-05  -3.014 0.002772 ** 
## cnt:SeasonsWinter  1.074e-03  2.842e-04   3.779 0.000186 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 4.762 on 338 degrees of freedom
## Multiple R-squared:  0.8208, Adjusted R-squared:  0.8171 
## F-statistic: 221.2 on 7 and 338 DF,  p-value: < 2.2e-16

From the linear regression model of the relationship between temperature and total count of rental bikes in Seoul, we can observe that there is a negative correlation between temperature and count of rental bikes in Summer, while they are positively associated in Spring and Winter, as expected. The regression model chooses Fall as the reference category. Also, we notice that all the p-values with the interaction terms are much smaller than the significance level alpha, as well as the p-value of the F-test is less than 2.2e-16, which is nearly zero and extremely smaller than the significance level, then we fail to reject the null hypothesis and have sufficient evidence to conclude that there is a statistical significant of the correlation between temperature and count total rental bikes of given seasons in Seoul.

Linear Regression Model for D.C.:

## 
## Call:
## lm(formula = Temperature.C. ~ cnt * Seasons, data = DC)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -9.7806 -2.5194 -0.0365  2.4788 12.3444 
## 
## Coefficients:
##                     Estimate Std. Error t value Pr(>|t|)    
## (Intercept)       12.4481444  0.8189425  15.200  < 2e-16 ***
## cnt                0.0010344  0.0001630   6.344 3.94e-10 ***
## SeasonsSpring      2.8133950  1.1781920   2.388 0.017201 *  
## SeasonsSummer     16.8791307  1.3526344  12.479  < 2e-16 ***
## SeasonsWinter     -5.4838917  1.0035233  -5.465 6.39e-08 ***
## cnt:SeasonsSpring  0.0003795  0.0002289   1.658 0.097767 .  
## cnt:SeasonsSummer -0.0010997  0.0002464  -4.464 9.33e-06 ***
## cnt:SeasonsWinter  0.0009791  0.0002552   3.837 0.000136 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 3.687 on 723 degrees of freedom
## Multiple R-squared:  0.761,  Adjusted R-squared:  0.7587 
## F-statistic: 328.9 on 7 and 723 DF,  p-value: < 2.2e-16

The same logic and conclusions can be derived from the linear regression model that describes the relationship between temperature and total count of rental bikes in D.C. We also notice that the p-value for Spring might be slightly larger, so we decide to use the significance level alpha 0.1 in our analysis. Hence, since the p-values are all smaller than 0.1, with the p-value of the F-test is less than 2.2e-16, which is nearly zero and extremely smaller than the significance level, we fail to reject the null and have sufficient evidence to conclude that there is a statistical significant of the correlation between temperature and count total rental bikes of given seasons in D.C.

Therefore, we can write down the linear regression models for each location:

Seoul:

Fall: Count = 1.070e+01 + 1.742e-04 * Temperature
Spring: Count = (1.070e+01 + (-4.636e+00)) + (1.742e-04 + 2.247e-04) * Temperature
Summer: Count = (1.070e+01 + 1.732e+01) + (1.742e-04 + (-0.0010997)) * Temperature
Winter: Count = (1.070e+01 + (-5.4838917)) + (1.742e-04 + 0.0009791) * Temperature

Visualization of the linear regression model:

## Conditions used in construction of plot
## cnt: 16610.5

This visualization of the linear regression model of Seoul illustrates the relationship between Temperature and Count given Seasons, where we can observe that all of the four linear regression line seem to be “flat”. This confirms our linear regression models as the slopes are fairly small.

D.C.:

Fall: Count = + 12.4481444 + 0.0010344 * Temperature
Spring: Count = (12.4481444 + 2.8133950) + (0.0010344 + 0.0003795) * Temperature
Summer: Count = (12.4481444 + 16.8791307) + (0.0010344 + (-0.0010997)) * Temperature
Winter: Count = (12.4481444 + (-5.4838917)) + (0.0010344 + 0.0009791) * Temperature

Visualization of the linear regression model:

## Conditions used in construction of plot
## cnt: 4548

This visualization of the linear regression model of D.C. illustrates the relationship between Temperature and Count given Seasons. Similarly to the plot for Seoul, we can observe that all of the four linear regression line seem to be “flat”, which also confirms our linear regression models as the slopes are fairly small.

With these linear regression models, we can readily estimate or predict the number of rides given specific values of temperature.

Holiday

We would also like to know the pattern that people in Seoul and in D.C. ride bikes during holidays and non-holidays. Thus, we perform another EDA on the variable ‘Holiday’ from our dataset and the number of rides.

It is very interesting how for Seoul, the count of bike rides for non-holiday days were higher than holiday days, whereas in D.C., the count of bike rides in Holidays were higher. This may suggest that people in these two regions use shared bike for completely different reasons: People in D.C. tend to use it in Holidays, as a recreational tool. People in Seoul mainly use shared bike system for daily transport, for example everyday work.

To test if this difference is actually significant for both locations, since we do not care that much about variance, we will use t test.

T-test results for Seoul:

## 
##  Welch Two Sample t-test
## 
## data:  SeoulHoliday$cnt and SeoulNonHoliday$cnt
## t = -1.8536, df = 18.797, p-value = 0.07956
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##  -10135.7906    618.5569
## sample estimates:
## mean of x mean of y 
##  11994.17  16752.78

T-test results for D.C.:

## 
##  Welch Two Sample t-test
## 
## data:  DCHoliday$cnt and DCNonHoliday$cnt
## t = 3.3383, df = 19.162, p-value = 0.003423
## alternative hypothesis: true difference in means is not equal to 0
## 95 percent confidence interval:
##   513.7204 2237.9724
## sample estimates:
## mean of x mean of y 
##  4540.110  3164.263

Depending on the alpha value we choose, we can have different interpretations for this two test. If we choose our alpha value to be 0.10, we can conclude that there is significant different between two groups. D.C.’s bike rides are higher on Holiday, while Seoul’s bike rides are higher on non-holidays. This can give us insight on how two groups of people use shared bike differently.

Therefore, based on the plots and EDA above, we can conclude that people in both locations tend to prefer travel by bike in a pleasant climate (i.e. comfortable humidity, a softer wind, and warm weather). Also, we notice that people in Seoul tend to rent bikes during non-holidays, while those who live in D.C. prefer having a bike trip during holidays. This suggests that different cultures may also affect people’s choice of when to use rental bikes, for which the future survey may need to explore and make improvements for the sharing-bike industry in different regions.

Conclusion

From our analysis above, we found that people in Seoul and in D.C. exhibit different patterns of use in using the shared bikes system. We found that people in Seoul tend to ride more bikes during non-holidays and people in D.C. tend to ride more during holidays. This might suggests that people in Seoul ride more shared bikes for getting to work whereas people in D.C. ride shared bikes more for leisure purposes.

We also found that weather conditions play significant roles in people’s frequency or rides. For example, no matter in Seoul or in D.C., people tend to ride more bikes as the weather gets warmer and there is less wind speed. We also develop a linear regression model between temperature and the number of rides for Seoul and D.C. which can be readily used in estimations and predictions.

We acknowledge that there is a lot of room for improvement for our research. Since we combine two totally separated datasets for Seoul and D.C. into one merged dataset, the sample sizes for both locations are not approximately equal. The original DC dataset includes two years of data from 2011 to 2012, leading to a larger sample size than Seoul. The Seoul dataset is also collected in 2018, which has a approximate 6 years of gap between the two datasets. However, we still think it is meaningful to compare and contrast the shared bike systems in Seoul and in D.C. and we indeed made some notable observations. In future research, we would like to gather datasets that include more variables, such as feedback from customers, which could be beneficial for text analysis.