Load Dataset

Data Description and Overview:

Why do transit-infrastructure projects in New York cost 20 times more on a per kilometer basis than in Seoul? While New York is one of the most well-developed country in the world, this outcome is really surprising. What might be the reason of such a big difference? We want to investigate this question using the dataset called ‘transit_cost’. It includes hundreds of transit projects, spans over more than 50 countries, and in total cover more than 11,000 km of urban rail since tha late 1990s.

Overall Theme:

The goal of this report is to help provide statistical support for possible strategies to deliver more economically and timely efficient, as well as high-capacity, transit projects for different countries. We might specifically want to explore this problem by approaching it from three different perspectives: 1. For the first part of our report, we want to explore the factors that relates to the construction period, that is, time efficiency of transit projects. 2. For the second part, we want to see if economic inflation of the country affects our transit projects. More specifically, we narrowed our exploration down to the relationship between number of stations and purchasing power of each city on average. 3. For the very last part, we want to explore relationships in a higher dimensional space through clustering. And also try to figure out what continuous variables are deterministic and how do they relate.

Our dataset transit_cost have 544 unique tranist projects along with 20 features. For instance, we have the features that indicates the location of our transit project like country, city, and line; we also have varaibels that describe the status of those transit projects, including start_year, end_year, rr, length, tunel_per, tunel, and stations; other variables provide some economic content like cost_kn_millions, cost, currency, ppp_rate, real_cost, etc.

In the following code chunk, we show all 20 variables in transit_cost:

## [1] 544  20

##  [1] "e"                "country"          "city"             "line"            
##  [5] "start_year"       "end_year"         "rr"               "length"          
##  [9] "tunnel_per"       "tunnel"           "stations"         "source1"         
## [13] "cost"             "currency"         "year"             "ppp_rate"        
## [17] "real_cost"        "cost_km_millions" "source2"          "reference"

Definitions of Variables: - rr: if the transit project is a railroad or not. 1==railroad - tunnel_per: percent of length completed - ppp_rate: purchasing power parity (PPP), based on the midpoint of construction - real_cost: real cost of the transit project in Millions of USD - cost_km_millions: cost/km in millions of USD - stations: number of stations where passengers can board/leave per location (city)

Exploratory Data Analysis:

First things first, we will do some exploratory data analysis. The following code chunk generates the number of transit projects in each country. We can see that China has the highest count up to 253, while most other countries are about 5 or lower.

## 
##  AE  AR  AT  AU  BD  BE  BG  BH  BR  CA  CH  CL  CN  CZ  DE  DK  EC  EG  ES  FI 
##   3   1   3   4   3   1   6   2   3  10   3   2 253   1  13   1   1   7  15   2 
##  FR  GR  HU  ID  IL  IN  IR  IT  JP  KR  KW  MX  MY  NL  NO  NZ  PA  PE  PH  PK 
##  15   4   1   1   2  29   3  11  15   6   1   2   2   1   2   1   3   1   4   1 
##  PL  PT  QA  RO  RU  SA  SE  SG  TH  TR  TW  UA  UK  US  UZ  VN 
##   4   2   1   1   5   9   5   3   8  20  12   3   3  13   4   5

Since the goal of constructing tranists is to build connections between different regions, plotting the values onto a map might be a good way to visualize and better convey the distribution.

In the following code chunk, we plot the average cost per kilometer (ie, the variable cost_per_km) onto the world map.

According to the graph, we can see that southern parts in North America have the highest cost per kilometer of construction (about 900 or higher), while some Europe Countries have the lowest cost (approximately below 150 Million per km). Costs in the regions adjacent to China, Russia, and South America were also relatively low, approximately below 600 millions per km.

According to the descriptive map, we perceived a difference in average cost per kilometer by each country. In the following of our report, we might want to find out if there are any variables contributes to such a difference.

Question 1: What factors determine the Construction Period?

For this section of the project, we will do some research on what predicts the construction time for different proposed lines. It’s important to know what factors might affect the construction time, because shortening the construction time will help reduce the cost of building transits throughout the cities. In the transit cost dataset, the variables that might be correlated with the construction time are the length of the proposed line, number of stations and whether the line is a railroad. With these variables in hand, we would like to explore if these variables are correlated with the construction time or if there’re other factors out of this dataset are affecting the construction time.

We first made the start_year and end_year into numerical variables, then we make the the construction period to categorical variable. The graph above shows the distribution of construction period. For most cities, the proposed lines were built with in six years. There’s a peak from the period of four to six years. More than 150 of the proposed lines were built within four to six years. The least proportion of the construction period is one to two years, in which there’s only less than 25 lines that were built with in one to two years. It looks like the distribution is right-skewed.

The graph above shows the different construction period and the average line length correspond to that period. There’re some outliers in the line graph above, the average length is not steadily increasing as the construction period increases. For the period one to six years, the average length increases steadily as the construction period increases. However, the length dropped for 8 to 10 years, and then increases again for more than 10 years. I think this is because there might be other factors affecting the construction years, such as weather condition, or lack of labor.

For the mosaic plot above, I chose the categories one year to six year, and the length 0 to 60km because they have the highest proportion out of all the categories. We can see that for construction period from one to two years, there are a relatively high number of lines that are 0 - 20 km, and a relatively low number of lines from 20-40 km. This makes sense because shorter lines tends to have a shorter construction period.

## 
## 100 stations  20 stations  40 stations  60 stations  80 stations 
##            4          410          104            6            5

The graph above compares the number of stations and the construction period. Most of the points are clustered at the bottom left corner, meaning most cities have around 0 to 10 stations with a construction time of 5 years. However, we do not see any correlations between stations and construction period. Thus, we can conclude that construction period does not depend on the number of stations.

## `summarise()` has grouped output by 'construcPeriod'. You can override using
## the `.groups` argument.
## geom_path: Each group consists of only one observation. Do you need to adjust
## the group aesthetic?

For the graph above, we compared the average length of the transit corresponding to the different construction period. We can see a similar trends for both transits with railraod and without railroads. The only difference is that transits with railroads have a shorts average length for each construction period, meaning the the transits with railroad takes a longer time to construct.

Question 2: What factors are related to economic inflation?

Topic Introduction

For this section of our project, we will take a look economic inflation. For the examination of economic inflation, we will use purchasing power parity rate from the transit data set, based on the midpoint of construction. Purchasing power parity or PPP is an economic theory that compares different countries’ currencies through a “basket of goods” approach, similar to the Consumer Price Index or CPI. PPP rate allows comparison between economic productivity and standards of living between countries. With our variables in hand, we will try to see if the rate of economic inflation (PPP_rate) is correlated with the transit system. For this question, we will also use the number of stations (stations). With that being said, we load our data set and functions below.

Because there is an abundance of data sets but very few cities for large countries such as the United States, United Kingdom and others, we will filter the data set by narrowing down to one country with multiple cities and stations. For example, United States has only six cities with stations, but is roughly the same size as China in terms. Filtering through all the countries, China is the only large country with numerous cities available for examination. Due to this weakness in the data set, we will amend it by dividing the data set into Chinese cities, and non-Chinese cities, about a 50-50 split on the data (as the dimension function confirms below).

Number of Stations

## [1] 253  25

## [1] 284  25

Now, we will run two regression plots, one for Chinese cities, and one for other cities. For our regression model, we will define beta0 and beta1 as the definition below:

beta0 = The estimated PPP rate on average when there are no stations in a given city.

beta1 = The estimated change in PPP rate on average by increasing one transit station in a given city.

Chinese Cities

## `geom_smooth()` using formula 'y ~ x'

From the plot above, we observe a slightly increasing linear trend with number of stations and PPP rate. However, there are a lot of outliers outside the 99% confidence interval. Therefore, we will run a goodness of fit test for the linear regression model to see if the relationship is significant.

## 
## Call:
## lm(formula = ppp_rate ~ stations, data = transit_CN)
## 
## Residuals:
##       Min        1Q    Median        3Q       Max 
## -0.032213 -0.020219 -0.016021  0.007734  0.116079 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept) 0.2534716  0.0038323  66.140   <2e-16 ***
## stations    0.0004498  0.0002291   1.963   0.0507 .  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 0.03304 on 247 degrees of freedom
##   (4 observations deleted due to missingness)
## Multiple R-squared:  0.01537,    Adjusted R-squared:  0.01138 
## F-statistic: 3.855 on 1 and 247 DF,  p-value: 0.05072

From the goodness of fit test above, we observe that the p value for Beta1 or slope of the regression line is 0.0507, which is above our usual alpha level of 0.05. Therefore, for our test, we fail to reject the null hypothesis that there is no significant relationship between ppp_rate and number of stations in Chinese cities at a 95% confidence level.

Other Cities

## `geom_smooth()` using formula 'y ~ x'

From the plot above, we observe a decreasing linear trend with number of stations and PPP rate for cities outside of China in general. There are a lot of outliers outside the 99% confidence interval. However, we note that most of the outliers are below the predicted regression line. Therefore, we will run a goodness of fit test again for the linear regression model to see if the relationship is significant.

## 
## Call:
## lm(formula = ppp_rate ~ stations, data = transit_other)
## 
## Residuals:
##     Min      1Q  Median      3Q     Max 
## -1.1077 -0.9051 -0.0912  0.2739  4.0516 
## 
## Coefficients:
##              Estimate Std. Error t value Pr(>|t|)    
## (Intercept)  1.116389   0.080601  13.851   <2e-16 ***
## stations    -0.008397   0.003724  -2.255   0.0249 *  
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## Residual standard error: 1.041 on 276 degrees of freedom
##   (6 observations deleted due to missingness)
## Multiple R-squared:  0.01809,    Adjusted R-squared:  0.01453 
## F-statistic: 5.086 on 1 and 276 DF,  p-value: 0.02491

From the goodness of fit test above, we observe that the p value for Beta1 or slope of the regression line is 0.0249, which is less than our usual alpha level of 0.05. Therefore, for our test, we reject the null hypothesis that there is no significant relationship between ppp_rate and number of stations in other cities at a 95% confidence level.

Summary

To sum up our discovery for this question, we would suggest that there is a significant relationship between the number of stations and decreasing PPP rate for cities outside of China. Having more transit stations in a city is significantly correlated at a 95% confidence level with decreasing PPP rate (decreasing inflation rate) for cities outside of China on average. However, we fail to arrive at a conclusive statement for the relationship of these two variables within Chinese cities.

Question 3: How do the transit lines distribute in higher-dimensional space?

Topic Introduction

From the above EDA section, one can observe some differences and similarities between different groups of urban transits. For example, the North American countries, i.e., the United States and Canada, are building the most expensive transits per kilometer in the world; the United Kingdom retains the largest number of city transits that are connected with railroad. These clues in exploratory level analysis suggests that there may be some clusters of city transits that retain similar characteristics and that some features of the transits may be tightly correlated with one another.

In this section, we are going to explore the potential clusters of these transits, the hidden correlation behind the quantitative variables, and the deterministic effect of each of the quantitative variables.

Data Clean-up and Variable Selection

We first wish to remove the incomplete cases in the dataset. Since this dataset is found out in the wilderness, it is not as clean and tidy as some other we used in class. Therefore, we remove the rows with NA’s or missing values for consistent measurement for further analysis.

For the PCA and clustering analysis, we are going to only use the quantitative variables from the original transit cost dataset, therefore, dropping columns: e, country, line, city, rr, source1, currency, source2, and reference. Since the column cost represents the cost of the individual transit in local currency, which cannot be globally compared across the entire dataset, we are going to drop this column as well. Below is a segment of the dataset we are going to be dealing with for the rest of this report, which retains 427 observations from 10 quantitative variables. We also have to transform some character columns to numeric.

## # A tibble: 6 × 10
##   start_year end_year length tunnel_per tunnel stations  year ppp_rate real_cost
##        <dbl>    <dbl>  <dbl>      <dbl>  <dbl>    <dbl> <dbl>    <dbl>     <dbl>
## 1       2020     2025    5.7      0.877    5          6  2018     0.84     2377.
## 2       2009     2017    8.6      1        8.6        6  2013     0.81     2592 
## 3       2020     2030    7.8      1        7.8        3  2018     0.84     4620 
## 4       2020     2030   15.5      0.57     8.8       15  2019     0.84     7201.
## 5       2020     2030    7.4      1        7.4        6  2020     0.84     4704 
## 6       2003     2018    9.7      0.73     7.1        8  2009     1.3      4030 
## # … with 1 more variable: cost_km_millions <dbl>

Complete Linkage Clustering and Dendogram

We will firstly have to standardize the columns of our cleaned-up dataset. Here is the first a few rows of the resulting standardized dataset:

##   start_year end_year    length tunnel_per    tunnel  stations     year
## 1   301.9715 321.5543 0.2589025   2.403011 0.3203957 0.4267189 350.2417
## 2   300.3271 320.2840 0.3906249   2.739410 0.5510806 0.4267189 349.3739
## 3   301.9715 322.3483 0.3542877   2.739410 0.4998173 0.2133595 350.2417
## 4   301.9715 322.3483 0.7040332   1.561464 0.5638964 1.0667973 350.4152
## 5   301.9715 322.3483 0.3361191   2.739410 0.4741857 0.4267189 350.5888
## 6   299.4302 320.4428 0.4405885   1.999769 0.4549619 0.5689585 348.6796
##    ppp_rate real_cost cost_km_millions
## 1 0.9868948 0.4898524         1.488647
## 2 0.9516486 0.5341147         1.075815
## 3 0.9868948 0.9520100         2.114211
## 4 0.9868948 1.4839240         1.658370
## 5 0.9868948 0.9693193         2.269011
## 6 1.5273372 0.8304330         1.482977

Next up, we calculate the distance matrix that will be necessary for the clustering algorithm.

Using this distance matrix, our clustering algorithm can be applied to the transit dataset with continuous variable. Unlike single linkage where a long, continuous line of data points may be grouped as one cluster, complete linkage clustering has a stronger ability to prevent overfitting.

By the shape of this naive clustering dendogram, the distribution of the data points in higher dimensional space retains the shape wherea largea amount of data forms a dense center, and some smaller clusters form in the outskirts.

Let’s also explore how well our clustering algorithm classification of the transits align with their lcoations in various continents. Here, we are going to use the countrycode library to add a countrycode label to our dataset. By coloring countrycode to the x-axis labels, we can see how out clustering algorithm matches to continent classification. Since there exists five unique continents (i.e., Americas, Europe, Asia, Oceania, and Africa) in the dataset, and since an overwhelming number of data comes from China, we are going to adopt 6 clusters, the five continents and China in separate.

[NOTE] Color scheme adopted in the x-axis label:

Americas: Green;
Europe: Grey;
China: Red;
Oceania: Blue;
Africa: Black;
Asia: Yellow;

Based on our complete-linkage unsupervised clustering model, the continent label roughly align with the clusters in some degrees. For example, the majority of Chinese transits (denoted as red) together with other Asian transits, is clustered in the most center large cluster (denoted as purple in the dendogram). A large number of European transit lines (denoted as grey) are also in the most center cluster. On the other hand, some transits from the Americas (denoted as green) form a distinct cluster by themselves on the left side of the dendogram, far away in distance from the major cluster in purple.

Analysis Deterministic Continuous Variables and Their Relationships

Since we are processing high-dimensional data, dimension reduction would be necessary to answer the question which variables are more deterministic in differentianting between different transits. For this section, a traditional dimension reduction strategy, Principle Component Analysis, will be applied to the transit_cost_quant dataset.

## Importance of components:
##                           PC1    PC2    PC3     PC4     PC5     PC6     PC7
## Standard deviation     1.8745 1.6198 1.1338 0.99482 0.95532 0.50662 0.37857
## Proportion of Variance 0.3514 0.2624 0.1285 0.09897 0.09126 0.02567 0.01433
## Cumulative Proportion  0.3514 0.6138 0.7423 0.84126 0.93252 0.95819 0.97252
##                            PC8     PC9    PC10
## Standard deviation     0.31695 0.31223 0.27720
## Proportion of Variance 0.01005 0.00975 0.00768
## Cumulative Proportion  0.98257 0.99232 1.00000

##             PC1         PC2        PC3        PC4        PC5         PC6
## [1,]  0.1362006  1.93672742  0.1661943 -0.2304216  0.1039463 -0.06142024
## [2,]  1.1047098  0.01308087  0.7287169 -0.1502038 -0.1814996 -0.13760613
## [3,] -0.3838855  2.37628174  0.7106566 -0.7054536  0.1254544  0.14104594
## [4,] -1.1822304  1.74324182 -0.3974263 -0.6359402  0.6977871  0.07303859
## [5,] -0.5628885  2.51858935  0.7025938 -0.8223737  0.2184169 -0.08677346
## [6,]  1.3682331 -0.77600285  0.6688533 -0.6759151  0.7573926  0.03511950
##              PC7        PC8        PC9        PC10
## [1,] -0.11238999  0.3905153 0.09389801 -0.04247573
## [2,]  0.17562227 -0.1329226 0.17052833  0.08733422
## [3,] -0.07827806  0.4604120 0.72779892 -0.03044298
## [4,]  0.09731863  0.3516388 0.62298219 -0.35689730
## [5,] -0.05262706  0.2595876 0.57600944 -0.05616311
## [6,]  0.09030943 -0.1859597 0.87690293 -0.20122048

As one can expect, the amount of variation in the original data set that can be explain by each PC diminishes for the later PC’s. With PC1 accounting for 35.14% of variance by itself, 26.24% for PC2, and 12.85% for PC3, further analysis will only need to incorporate the first 2 to 3 PC’s. This fact can be further demonstrated by the elbow plot below:

As one can tell, the elbow occurs approximately between dimension 2 and dimension 3. However, for further analysis, only the first 2 PC’s will be used.

Are these continuous variables somehow correlated? And how strong do they distinguish various transits and explain the variation in different transits? To answer these questions, a biplot is drawn below with package factoextra.

There are three clear-defined variables groups in the biplot. The time variables: start_year, end_year, and year forms a tight group pointing to the second quadrant. Constrcution scale related variables like real_cost, length, tunnel, and stations form another closely binded group pointing to the third quandrant. Both of the first two groups contains vectors of large magnitude, signifying strong correlation with PC1 and PC2. The third group involves cost_km_millions, ppp_rate, and tunnel_per, which are weaker vectors lying in the vincinity of the origin.

There are few key observations from the plot. The construction scale related variables are all strongly positively correlated with each other, so longer transit lines tend to have more stations, more tunnels, and cost more. The construction scale variables also have strong magnitude. They are deterministic variables in the variance in the original dataset. This property is also shared with the time/year variables in the dataset. Another key observation is that the time vector and construction scale vectors are perpendicular to each other, meaning that they are uncorrelated. The construction scale vector are pointing to the inverse direction as ppp_rate, suggesting that economies with higher purchasing power tend to do less in infrastructure construction.

Future Work iscussion:

Conclusion:

From the above work and analysis we have done, we can see that while considering the cost in time, we might want to take the length of transit projects into account, instead of using the number of stations. However, number of stations still matters because for cities outside of China, there shows a significantly decreasing correlation between the number of stations and country’s purchasing power parity rate, which means it is related to the economic inflation rate of the country/city it belongs to. (Also, we note that the beta1 p-value for Chinese cities is only marginally above the significance level ~0.001. This could be marked significant result depending on modification to the significance level). Finally, when we explore the dataset from a broader view, we noticed that time variables, which describes the construction period and time, along with construction scale related variables, like real_cost, length, tunnel, and stations, each are strongly correlated with each other. However, the time vector and construction scale vector seem to be uncorrelated by reading the biplot.

Future Work:

However, there are still more questions we want to address either by gathering more data, or by operating more models and analysis.

First of all, we are missing some variables that describes the success of these transit projects. It is important because normally we want to know the extent of usage of these railroads, for instance, passenger flow might be a good measurement to quantify whether the transit has high or low usage. With this kind of variables, we can explore whether higher cost is related to higher usage, and give reflection on whether it really worth that much to construct transits in certain regions. This question might be crucial and menaingful for local government or city planners that want to maximize the benefits of their investment.

Additionally, the collection method of the data set could be biased depending on the source. The lack of data for cities outside of China, and missing entries (NA’s) in the data set is potentially problematic. Gathering unbiased data set with using a single source of information would be a better fit if given the chance to do so.

Lastly, we might want to apply some machine learning models to further explore the correlation between variables and make predictions if we feed them with sample data. For instance, while looking at what variables are correlated with high or low construction period, Decision Trees might be helpful because it provides us with a detailed algorithm of splitting on each variable to get a high or low construction period. This reflection is useful and important for researchers who want to see a transparent process of splitting inside models.

Ending Note:

In all, we hope that our report will be a useful resource for elected officials, planners, researchers, journalists, and other people that are interested or passionate in transit-infrascture, and provide a meaningful statistical support or foundation for their further efforts.

36315 Final Project

Roxena Liu, Steven Shou, Victor Wen, Irene Gao

05/02/2022

Load Dataset

Data Description and Overview:

Overall Theme:

Exploratory Data Analysis:

Question 1: What factors determine the Construction Period?

Question 3: How do the transit lines distribute in higher-dimensional space?

Topic Introduction

Data Clean-up and Variable Selection

Complete Linkage Clustering and Dendogram

Analysis Deterministic Continuous Variables and Their Relationships

Future Work iscussion:

Conclusion:

Future Work:

Ending Note: