Why do transit-infrastructure projects in New York cost 20 times more on a per kilometer basis than in Seoul? While New York is one of the most well-developed country in the world, this outcome is really surprising. What might be the reason of such a big difference? We want to investigate this question using the dataset called ‘transit_cost’. It includes hundreds of transit projects, spans over more than 50 countries, and in total cover more than 11,000 km of urban rail since tha late 1990s.
The goal of this report is to help provide statistical support for possible strategies to deliver more economically and timely efficient, as well as high-capacity, transit projects for different countries. We might specifically want to explore this problem by approaching it from three different perspectives: 1. For the first part of our report, we want to explore the factors that relates to the construction period, that is, time efficiency of transit projects. 2. For the second part, we want to see if economic inflation of the country affects our transit projects. More specifically, we narrowed our exploration down to the relationship between number of stations and purchasing power of each city on average. 3. For the very last part, we want to explore relationships in a higher dimensional space through clustering. And also try to figure out what continuous variables are deterministic and how do they relate.
Our dataset transit_cost have 544 unique tranist projects along with 20 features. For instance, we have the features that indicates the location of our transit project like country, city, and line; we also have varaibels that describe the status of those transit projects, including start_year, end_year, rr, length, tunel_per, tunel, and stations; other variables provide some economic content like cost_kn_millions, cost, currency, ppp_rate, real_cost, etc.
In the following code chunk, we show all 20 variables in transit_cost:
## [1] 544 20
## [1] "e" "country" "city" "line"
## [5] "start_year" "end_year" "rr" "length"
## [9] "tunnel_per" "tunnel" "stations" "source1"
## [13] "cost" "currency" "year" "ppp_rate"
## [17] "real_cost" "cost_km_millions" "source2" "reference"
Definitions of Variables: - rr: if the transit project is a railroad or not. 1==railroad - tunnel_per: percent of length completed - ppp_rate: purchasing power parity (PPP), based on the midpoint of construction - real_cost: real cost of the transit project in Millions of USD - cost_km_millions: cost/km in millions of USD - stations: number of stations where passengers can board/leave per location (city)
First things first, we will do some exploratory data analysis. The following code chunk generates the number of transit projects in each country. We can see that China has the highest count up to 253, while most other countries are about 5 or lower.
##
## AE AR AT AU BD BE BG BH BR CA CH CL CN CZ DE DK EC EG ES FI
## 3 1 3 4 3 1 6 2 3 10 3 2 253 1 13 1 1 7 15 2
## FR GR HU ID IL IN IR IT JP KR KW MX MY NL NO NZ PA PE PH PK
## 15 4 1 1 2 29 3 11 15 6 1 2 2 1 2 1 3 1 4 1
## PL PT QA RO RU SA SE SG TH TR TW UA UK US UZ VN
## 4 2 1 1 5 9 5 3 8 20 12 3 3 13 4 5
Since the goal of constructing tranists is to build connections between different regions, plotting the values onto a map might be a good way to visualize and better convey the distribution.
In the following code chunk, we plot the average cost per kilometer (ie, the variable cost_per_km) onto the world map.
According to the graph, we can see that southern parts in North America have the highest cost per kilometer of construction (about 900 or higher), while some Europe Countries have the lowest cost (approximately below 150 Million per km). Costs in the regions adjacent to China, Russia, and South America were also relatively low, approximately below 600 millions per km.
According to the descriptive map, we perceived a difference in average cost per kilometer by each country. In the following of our report, we might want to find out if there are any variables contributes to such a difference.
For this section of the project, we will do some research on what predicts the construction time for different proposed lines. It’s important to know what factors might affect the construction time, because shortening the construction time will help reduce the cost of building transits throughout the cities. In the transit cost dataset, the variables that might be correlated with the construction time are the length of the proposed line, number of stations and whether the line is a railroad. With these variables in hand, we would like to explore if these variables are correlated with the construction time or if there’re other factors out of this dataset are affecting the construction time.
We first made the start_year and end_year into numerical variables, then we make the the construction period to categorical variable. The graph above shows the distribution of construction period. For most cities, the proposed lines were built with in six years. There’s a peak from the period of four to six years. More than 150 of the proposed lines were built within four to six years. The least proportion of the construction period is one to two years, in which there’s only less than 25 lines that were built with in one to two years. It looks like the distribution is right-skewed.
The graph above shows the different construction period and the average line length correspond to that period. There’re some outliers in the line graph above, the average length is not steadily increasing as the construction period increases. For the period one to six years, the average length increases steadily as the construction period increases. However, the length dropped for 8 to 10 years, and then increases again for more than 10 years. I think this is because there might be other factors affecting the construction years, such as weather condition, or lack of labor.
For the mosaic plot above, I chose the categories one year to six year, and the length 0 to 60km because they have the highest proportion out of all the categories. We can see that for construction period from one to two years, there are a relatively high number of lines that are 0 - 20 km, and a relatively low number of lines from 20-40 km. This makes sense because shorter lines tends to have a shorter construction period.
##
## 100 stations 20 stations 40 stations 60 stations 80 stations
## 4 410 104 6 5
The graph above compares the number of stations and the construction period. Most of the points are clustered at the bottom left corner, meaning most cities have around 0 to 10 stations with a construction time of 5 years. However, we do not see any correlations between stations and construction period. Thus, we can conclude that construction period does not depend on the number of stations.
## `summarise()` has grouped output by 'construcPeriod'. You can override using
## the `.groups` argument.
## geom_path: Each group consists of only one observation. Do you need to adjust
## the group aesthetic?
For the graph above, we compared the average length of the transit corresponding to the different construction period. We can see a similar trends for both transits with railraod and without railroads. The only difference is that transits with railroads have a shorts average length for each construction period, meaning the the transits with railroad takes a longer time to construct.
From the above EDA section, one can observe some differences and similarities between different groups of urban transits. For example, the North American countries, i.e., the United States and Canada, are building the most expensive transits per kilometer in the world; the United Kingdom retains the largest number of city transits that are connected with railroad. These clues in exploratory level analysis suggests that there may be some clusters of city transits that retain similar characteristics and that some features of the transits may be tightly correlated with one another.
In this section, we are going to explore the potential clusters of these transits, the hidden correlation behind the quantitative variables, and the deterministic effect of each of the quantitative variables.
We first wish to remove the incomplete cases in the dataset. Since this dataset is found out in the wilderness, it is not as clean and tidy as some other we used in class. Therefore, we remove the rows with NA’s or missing values for consistent measurement for further analysis.
For the PCA and clustering analysis, we are going to only use the quantitative variables from the original transit cost dataset, therefore, dropping columns: e
, country
, line
, city
, rr
, source1
, currency
, source2
, and reference
. Since the column cost
represents the cost of the individual transit in local currency, which cannot be globally compared across the entire dataset, we are going to drop this column as well. Below is a segment of the dataset we are going to be dealing with for the rest of this report, which retains 427 observations from 10 quantitative variables. We also have to transform some character columns to numeric.
## # A tibble: 6 × 10
## start_year end_year length tunnel_per tunnel stations year ppp_rate real_cost
## <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2020 2025 5.7 0.877 5 6 2018 0.84 2377.
## 2 2009 2017 8.6 1 8.6 6 2013 0.81 2592
## 3 2020 2030 7.8 1 7.8 3 2018 0.84 4620
## 4 2020 2030 15.5 0.57 8.8 15 2019 0.84 7201.
## 5 2020 2030 7.4 1 7.4 6 2020 0.84 4704
## 6 2003 2018 9.7 0.73 7.1 8 2009 1.3 4030
## # … with 1 more variable: cost_km_millions <dbl>
We will firstly have to standardize the columns of our cleaned-up dataset. Here is the first a few rows of the resulting standardized dataset:
## start_year end_year length tunnel_per tunnel stations year
## 1 301.9715 321.5543 0.2589025 2.403011 0.3203957 0.4267189 350.2417
## 2 300.3271 320.2840 0.3906249 2.739410 0.5510806 0.4267189 349.3739
## 3 301.9715 322.3483 0.3542877 2.739410 0.4998173 0.2133595 350.2417
## 4 301.9715 322.3483 0.7040332 1.561464 0.5638964 1.0667973 350.4152
## 5 301.9715 322.3483 0.3361191 2.739410 0.4741857 0.4267189 350.5888
## 6 299.4302 320.4428 0.4405885 1.999769 0.4549619 0.5689585 348.6796
## ppp_rate real_cost cost_km_millions
## 1 0.9868948 0.4898524 1.488647
## 2 0.9516486 0.5341147 1.075815
## 3 0.9868948 0.9520100 2.114211
## 4 0.9868948 1.4839240 1.658370
## 5 0.9868948 0.9693193 2.269011
## 6 1.5273372 0.8304330 1.482977
Next up, we calculate the distance matrix that will be necessary for the clustering algorithm.
Using this distance matrix, our clustering algorithm can be applied to the transit dataset with continuous variable. Unlike single linkage where a long, continuous line of data points may be grouped as one cluster, complete linkage clustering has a stronger ability to prevent overfitting.
By the shape of this naive clustering dendogram, the distribution of the data points in higher dimensional space retains the shape wherea largea amount of data forms a dense center, and some smaller clusters form in the outskirts.
Let’s also explore how well our clustering algorithm classification of the transits align with their lcoations in various continents. Here, we are going to use the countrycode
library to add a countrycode
label to our dataset. By coloring countrycode
to the x-axis labels, we can see how out clustering algorithm matches to continent classification. Since there exists five unique continents (i.e., Americas, Europe, Asia, Oceania, and Africa) in the dataset, and since an overwhelming number of data comes from China, we are going to adopt 6 clusters, the five continents and China in separate.
[NOTE] Color scheme adopted in the x-axis label:
Based on our complete-linkage unsupervised clustering model, the continent label roughly align with the clusters in some degrees. For example, the majority of Chinese transits (denoted as red) together with other Asian transits, is clustered in the most center large cluster (denoted as purple in the dendogram). A large number of European transit lines (denoted as grey) are also in the most center cluster. On the other hand, some transits from the Americas (denoted as green) form a distinct cluster by themselves on the left side of the dendogram, far away in distance from the major cluster in purple.
Since we are processing high-dimensional data, dimension reduction would be necessary to answer the question which variables are more deterministic in differentianting between different transits. For this section, a traditional dimension reduction strategy, Principle Component Analysis, will be applied to the transit_cost_quant
dataset.
## Importance of components:
## PC1 PC2 PC3 PC4 PC5 PC6 PC7
## Standard deviation 1.8745 1.6198 1.1338 0.99482 0.95532 0.50662 0.37857
## Proportion of Variance 0.3514 0.2624 0.1285 0.09897 0.09126 0.02567 0.01433
## Cumulative Proportion 0.3514 0.6138 0.7423 0.84126 0.93252 0.95819 0.97252
## PC8 PC9 PC10
## Standard deviation 0.31695 0.31223 0.27720
## Proportion of Variance 0.01005 0.00975 0.00768
## Cumulative Proportion 0.98257 0.99232 1.00000
## PC1 PC2 PC3 PC4 PC5 PC6
## [1,] 0.1362006 1.93672742 0.1661943 -0.2304216 0.1039463 -0.06142024
## [2,] 1.1047098 0.01308087 0.7287169 -0.1502038 -0.1814996 -0.13760613
## [3,] -0.3838855 2.37628174 0.7106566 -0.7054536 0.1254544 0.14104594
## [4,] -1.1822304 1.74324182 -0.3974263 -0.6359402 0.6977871 0.07303859
## [5,] -0.5628885 2.51858935 0.7025938 -0.8223737 0.2184169 -0.08677346
## [6,] 1.3682331 -0.77600285 0.6688533 -0.6759151 0.7573926 0.03511950
## PC7 PC8 PC9 PC10
## [1,] -0.11238999 0.3905153 0.09389801 -0.04247573
## [2,] 0.17562227 -0.1329226 0.17052833 0.08733422
## [3,] -0.07827806 0.4604120 0.72779892 -0.03044298
## [4,] 0.09731863 0.3516388 0.62298219 -0.35689730
## [5,] -0.05262706 0.2595876 0.57600944 -0.05616311
## [6,] 0.09030943 -0.1859597 0.87690293 -0.20122048
As one can expect, the amount of variation in the original data set that can be explain by each PC diminishes for the later PC’s. With PC1 accounting for 35.14% of variance by itself, 26.24% for PC2, and 12.85% for PC3, further analysis will only need to incorporate the first 2 to 3 PC’s. This fact can be further demonstrated by the elbow plot below:
As one can tell, the elbow occurs approximately between dimension 2 and dimension 3. However, for further analysis, only the first 2 PC’s will be used.
Are these continuous variables somehow correlated? And how strong do they distinguish various transits and explain the variation in different transits? To answer these questions, a biplot is drawn below with package factoextra
.
There are three clear-defined variables groups in the biplot. The time variables: start_year
, end_year
, and year
forms a tight group pointing to the second quadrant. Constrcution scale related variables like real_cost
, length
, tunnel
, and stations
form another closely binded group pointing to the third quandrant. Both of the first two groups contains vectors of large magnitude, signifying strong correlation with PC1 and PC2. The third group involves cost_km_millions
, ppp_rate
, and tunnel_per
, which are weaker vectors lying in the vincinity of the origin.
There are few key observations from the plot. The construction scale related variables are all strongly positively correlated with each other, so longer transit lines tend to have more stations, more tunnels, and cost more. The construction scale variables also have strong magnitude. They are deterministic variables in the variance in the original dataset. This property is also shared with the time/year variables in the dataset. Another key observation is that the time vector and construction scale vectors are perpendicular to each other, meaning that they are uncorrelated. The construction scale vector are pointing to the inverse direction as ppp_rate
, suggesting that economies with higher purchasing power tend to do less in infrastructure construction.
From the above work and analysis we have done, we can see that while considering the cost in time, we might want to take the length of transit projects into account, instead of using the number of stations. However, number of stations still matters because for cities outside of China, there shows a significantly decreasing correlation between the number of stations and country’s purchasing power parity rate, which means it is related to the economic inflation rate of the country/city it belongs to. (Also, we note that the beta1 p-value for Chinese cities is only marginally above the significance level ~0.001. This could be marked significant result depending on modification to the significance level). Finally, when we explore the dataset from a broader view, we noticed that time variables, which describes the construction period and time, along with construction scale related variables, like real_cost
, length
, tunnel
, and stations
, each are strongly correlated with each other. However, the time vector and construction scale vector seem to be uncorrelated by reading the biplot.
However, there are still more questions we want to address either by gathering more data, or by operating more models and analysis.
First of all, we are missing some variables that describes the success of these transit projects. It is important because normally we want to know the extent of usage of these railroads, for instance, passenger flow might be a good measurement to quantify whether the transit has high or low usage. With this kind of variables, we can explore whether higher cost is related to higher usage, and give reflection on whether it really worth that much to construct transits in certain regions. This question might be crucial and menaingful for local government or city planners that want to maximize the benefits of their investment.
Additionally, the collection method of the data set could be biased depending on the source. The lack of data for cities outside of China, and missing entries (NA’s) in the data set is potentially problematic. Gathering unbiased data set with using a single source of information would be a better fit if given the chance to do so.
Lastly, we might want to apply some machine learning models to further explore the correlation between variables and make predictions if we feed them with sample data. For instance, while looking at what variables are correlated with high or low construction period, Decision Trees might be helpful because it provides us with a detailed algorithm of splitting on each variable to get a high or low construction period. This reflection is useful and important for researchers who want to see a transparent process of splitting inside models.
In all, we hope that our report will be a useful resource for elected officials, planners, researchers, journalists, and other people that are interested or passionate in transit-infrascture, and provide a meaningful statistical support or foundation for their further efforts.