Data Overview

The dataset we implement in this report is Yellow Taxi Cab trip data made available through the New York City Taxi and Limousine Commission (TLC). The dataset contains information regarding itemized fares and rates, payment types, pick-up and drop-off locations, pick-up and drop-off times, as well as driver-reported passenger counts. The TLC collected this information using authorization under the Taxicab and Livery Passenger Enhancement (TPEP). Each observation (row) in the dataset represents one yellow taxi cab trip. The columns in the dataset correspond respectively to the following information: the ID of the TPEP vendor that provided the data, the date and time the taxi meter was engaged, the date and time the taxi meter was disengaged, the number of passengers in the vehicle (driver-entered), the elapsed trip distance in miles reported by the taximeter, the longitude where the meter was engaged, the latitude where the meter was engaged, the final rate code in effect at the end of the trip, a flag variable indicating whether or not the trip record was held in vehicle memory before sending to the vendor, longitude where the meter was disengaged, latitude where the meter was disengaged, a numeric code signifying how the passenger paid for the trip, the time-and-distance fare calculated by the meter, miscellaneous extras and surcharged, the Metro Transit Authority tax, the improvement surcharge assessed on trips, the tip amount paid by credit card users, the amount of tolls paid, and the total amount charged to passengers excluding cash tips. For the purposes of this report, and limitations in R’s processing capabilities given the extremely large size of the data, we focus largely on data from January of 2017.

Research Questions

In our report, we aim to address the following research questions.

Exploratory Textual Analysis

The dataset does not contain elements that are, at a basic level, “textual.” Therefore, to perform text analysis, we employ the tpep_dropoff_datetime variable which indicates when the passengers traveling in the taxi were dropped at their destination. We convert this variable to a date character variable to use in a word cloud. This gives us an idea of which times of the month demonstrate increased taxi demand. We can see from the word cloud that on average, there are not times of the month that exhibit extreme prominence over others. The fewest trips occurred on January 2nd. Note that this word cloud is constructed on a limited version of the dataset with fewer observations, so the conclusions drawn do have some limitations with regard to their external validity.

The Relationship Between Payment Type and Total Trip Costs

In this section, we look to determine how the payment type chosen by the passenger(s) is related to the total amount paid for the trip. Before investigating this relationship, we look at how each payment type is distributed across pick-up time.

The stacked bar plot above indicates that, overall, credit card is the most used method of payment. The second most frequent payment method in the dataset is cash. These two methods consist of nearly all observations, with the remaining modems of payment (no charge, dispute, unknown, and voided trip) making up a very, very small fraction of observations in the dataset.

We then categorize the total amount charged for the trip into three categories (representing high, medium, and low overall costs for the trip), and perform a chi-squared test to check for independence between the total amount of the trip per the factor indicator and payment type. The Chi-squared test statistic is very large with a p-value of less than 0.05, indicating that we have statistical evidence to reject the null hypothesis that the variables are independent in favor of the alternative hypothesis, that they are not independent and do have a relationship. We confirm this conclusion by constructing a mosaic plot colored by Pearson residuals.

As shown in the mosaic plot, the combination of average total trip amounts and cash payment type, along with the combination of low payment amounts with credit card payment type have much lower counts than we would expect. The opposite is true for the near-opposite combinations, credit card payment and average total trip amount and cash payment and a lower trip amount have unexpectedly high counts. This is not the conclusion that we would expect to obtain should the variables actually be independent. Therefore, we bolster the conclusion of the Chi-squared test for independence, and conclude that payment type and the total trip amount charged to customers are not independent in January of 2017.

Future Work

There are socioeconomic implications to the analysis performed in this section of the report. That is, different populations in New York City are likely to frequent various payment methods depending on their respective financial and economic standings, the destination of the traveler, etc. Relationships between method of payment and the amount charged for the trip, therefore, may be meaningful to those with lower socioeconomic standing who may pay more frequently with cash than with credit, etc. What would be helpful is more complete data reflecting passenger demographics, such that we could perform a more detailed analysis of who is using specific payment types and how their individual and demographic experience is affected by their choice.

Variations in Trip Features As Passenger Counts Vary

The research question that motivates this section of our report is as follows. How do features of a yellow taxi trip change as the number of passengers varies? Note that in this section we limit the data to a random selection of one hundred thousand observations due to constraints with programming and memory capabilities in R.

A simple exploratory figure shows that passenger counts did not vary particularly over time in January 2017. An additive model fit using the gam specification in R is also displayed in the figure. We can see that this curve is quite flat, indicating that average passenger counts did not/do not vary with time significantly.

Variation in Fare/Rate Variables Across Number of Passengers

In this grouping of figures we can see that, on average, the total dollar amount of the trip decreases as the number of passengers increases. More specifically, the key variable changing the total dollar amount of the trip is the overall fare amount. Fare amount seems to notably decrease on average as the number of passengers increases. Tip amount seems to remain relatively constant, as do tolls as evidenced by the plot. Note that the y-axis on these plots is limited to exclude outlier trips that are not representative of the overall population.

These plots leave us curious as to why fare rates are lower on average for trips with more passengers. To better answer this facet of our research question in this section, we make use of a heat map. This heat map is based on the distance of the trip, which would indicate whether trips are longer or shorter depending on the number of passengers in the car.

From the above heat map, we can see that density is spread relatively uniformly across certain combinations of passenger counts and trip distances. This plot confirms the expectations we had originally, that trip distance decreases on average as the number of passengers increases. The combination of the two variables with the greatest density is one passenger with a very short trip, approximately two miles. The combinations with the greatest subsequent densities are one passenger and a near zero mile trip, one passenger and an approximately three mile trip, and two passengers and an approximately two mile trip. Again, this graph confirms our expectations that on average the length of a yellow taxi trip in New York City decreases as the number of passengers increases. This would attribute to the decreased total costs and decreased overall fares associated with greater numbers of passengers that was identified in the earlier figures.

We now move to consider whether or not there is statistical evidence of decreased trip distances being responsible for decreases total rates in yellow taxi trips. We do this by performing a Fisher test to determine whether or not these two continuous variables are independent.

fisher.test(table(taxi$passenger_count, taxi$trip_distance), simulate.p.value = TRUE)
## 
##  Fisher's Exact Test for Count Data with simulated p-value (based on
##  2000 replicates)
## 
## data:  table(taxi$passenger_count, taxi$trip_distance)
## p-value = 0.0004998
## alternative hypothesis: two.sided

From the Fisher Test, we obtain a p-value of less than 0.05. Therefore, we can reject the null hypothesis that the two variables are independent in favor of the alternative hypothesis, that they are not. We conclude that there is a non-zero dependency relationship between Yellow Taxi passenger count and trip distance, and that this may be a reason that the cost of a taxi trip decreases as passenger count increases.

Future Work

There is considerable room for future work with regards to this research question. Namely, we were forced to limit the size of the working dataset used in analysis for this section because of limitations with runtime and functionality in R. Using a software with greater capacity for extremely large datasets would render this analysis more accurate and the conclusions obtained more externally valid. Generally, this dataset was extremely large and presented in a format that was not quite “tidy”, per typical standards. Therefore, the initiative undertaken to simply clean and trim down the data to a usable format was considerable. More nuanced statistical techniques and softwares with which to analyze this data could be more practical to obtain population-level conclusions.

Apart from technical/programming limitations, the conclusions drawn in this section of the report do also create interest into further, more nuanced inquiries. In particular, we would be interested in better conceptualizing why trip distance decreases with increased number of passengers. Additional data would be helpful to understand this question–for example, if the dataset contained a variable representing the category of the location where the passengers were dropped off (e.g., “bar”, “restaurant”, “residence”, etc.), then we could better interpret variations in combinations of trip distance and passenger count.

Yellow Taxi Cab Market Changes Over Time

One strength of the data made available through the Taxi and Limousine Commission is that it makes readily available an extremely large quantity of data across an extremely wide range of dates. Therefore, we were curious to investigate the progression of the taxi ecosystem in New York City over time. To achieve this comparison, we compare our base dataset, January 2017’s data, to January 2022. In this comparison we primarily investigate overall trip pricing.

First, to identify simple differences in the total amount paid for taxi trips in 2017 and 2022, we filter the dataset to include trips paid using cash or credit with non-zero payment, and with a total amount paid within three standard deviations of the mean. This cleaning of the data allows us to eliminate outlier trips that are not representative of the average yellow taxi trip. Then, we calculate the inflation-adjusted total amount of each trip using the United States Bureau of Labor Statistic’s Inflation Calculator. This allows for more straight-forward comparisons between the two years.

The figure below is an overlaid smoothed density graph of the total amount paid for a yellow taxi trip in 2017 versus 2022. Note that in this section, any data from 2017 or 2022 incorporated is filtered to only include one million observations due to processing limitations.

Performing this comparison indicates that rides have generally become more expensive over time, accounting for inflation. The average ride in 2017 was 13.68 dollars compared to 14.36 dollars after inflation. This translates to a 4.99% increase in prices. This increase can be attributed to many of factors, one of the most obvious being an a potential increse in tipping amounts and frequency of tipping. Unfortunately, because the amount paid for the trip and the amount tipped are directly related, constructing a linear model to analyze the effect would be ineffective. Therefore, we construct a scatterplot of the tipped amount and fare amount for each year. We also add contour lines to better visualize where tips are centered in each year.

These scatterplots demonstrate some interesting trends. First, we can see that people tend to tip in whole number increments, evidenced by the horizontal lines of points at whole numbers. Additionally, people tend to tip in set percentages off the fare amount, as evidenced by the diagonal sets of points in each graph. Lastly, and perhaps most importantly, we see that people tended to tip more in 2022, as evidenced by the contour lines. The center of the data is also much higher vertically in 2022.

We see that the average tip in 2017 was 1.35 dollars compared to 1.66 dollars in 2022 after inflation, which is an increase in 22.89%. This offers a potential explanation for some, but not all, of the increased taxi trip prices in 2022.

In the remainder of this section, we further investigate this question by looking into whether or not the taxi system has begun to charge people differently over time. To do this, we construct two linear models to predict the amount paid using passenger count, trip distance, and payment type as predictors. We obtain the following results for 2017:

We obtain the following, inflation-adjusted results, for 2022:

To visually confirm this finding, we construct a hexagonal heat map of distance traveled and total cost for each year.

Examining these two heat maps, we can see that the TLC has increased trip prices to be uniformly more expensive, regardless of the distance travelled. This supports our earlier findings. The increase in the base price of taxis paired with the lower cost per mile rate created a 2.42 cent increase in average tip cost. We also determined that 13.7% more people use credit cards in 2022, which corresponds to a 38.24 cent increase in prices. Therefore, we can conclude that the increase in overall taxi prices and changes in the taxi market from 2017 to 2022 is largely related to an increase in tipping and wider adoption of credit card usage.

Future Work

This dataset fails to include cash tips in both the tip_amount field and the total_amount field, which misrepresents the actual amount paid by passengers. Therefore, future inquiries with regards to the research question posed in this section would be best conducted with more complete data that reflects all the money paid by customers across mediums to their driver. The conclusions drawn in this section, while certainly substantiated by data, may not wholly accurately represent population-level trends and relationships because of this deficit in the information provided in the dataset.

Conclusion

The Taxi and Limousine Commission’s yellow taxi cab trip data is a robust and large dataset that offers relatively detailed information on yellow taxi trips in New York City and certain key variables associated with a taxi trip, like the cost of the ride, taxes charged, the pick-up and drop-off details of the trip, etc. In this report, we were interested in contextualizing this data and analyzing three of its facets. Namely, we analyze how payment type is related to the overall cost of the trip, how elements of the trip vary based on the number of passengers on the trip, and how the market/yellow taxi ecosystem has changed over time. We find that payment type is related to the overall cost of the trip. We also find that the overall price of a trip decreases on average as the number of passengers increase, likely because the distances travelled by many passengers are shorter relative to those travelled by fewer passengers. Finally, we find that the cost of a yellow taxi trip has increased from 2017 to 2022 due to inflation and likely because of an overall increase in tipping and in credit card usage. With regards to future work, we would ideally like to utilize a more complete dataset and more robust statistical software to perform a deeper analysis of this dataset, with more factors related to the features of the passengers and inclusion of all payments in calculation of costs.