In order to better understand the resale car market, we explored a dataset gathered from Craigslist, the world’s largest collection of used vehicles for sale, that contains various information about the sales of used cars in the U.S. Our dataset includes 26 columns and 426880 samples in total. This dataset includes variables that describe features of the vehicles being listed on Craigslist, including region, price, condition, year, manufacturer, model, type, and other categories. We are interested in examining the kinds of words sellers tend to use in their sales descriptions for specific brand, the most popular vehicles resold, as well as the sale levels in each state.
To start, we have cleaned up the data by removing samples with null values and columns such as url and region url that are not needed in our analysis. After data cleaning, we are left with 115435 samples and 19 variables. Below we have the description of each variable:
region
: this describes where the vehicle is listedprice
: in dollars($)year
: year the vehicle was manufacturedmanufacturer
: manufacturer of the vehicle listedmodel
: model of the vehicle listedcondition
: condition of the vehicle listedcylinders
: number of cylinders for the vehicle
listedfuel
: fuel type of the vehicle listedodometer
: odometer of the vehicle listedtitle_status
: status of the title of the vehicle
listedtransmission
: transmission of the vehicle listeddrive
: drive of the vehicle listedtype
: type of the vehicle listed (truck,
pickup,..)paint_color
: color of the paint of the vehicle
listeddescription
: Description of the vehicle from the
sellerstate
: state where the vehicle was listedlat
: latitude of the area where the vehicle was
listedlong
: longitude of the area where the vehicle was
listedposting_date
: date the vehicle was posted on
craigslistIn this project, we will investigate these research questions:
What are similarity or difference among the most popular manufacturers?
Are used car markets different across each state?
What kind of words do sellers like to use in their descriptions?
We build our research questions from a top-down perspective. First, we attempt to have a general picture of the national resale vehicle market by studying the best selling manufacturers and types. Next, we move to smaller markets and study them by geographical locations and examine any difference in business indicators and consumers’ preferences of these submarkets. Finally, we dig into individual seller behaviors. Using text analysis, we will learn more about how sellers promote their vehicles and what type of words are used in this process.
We were interested in exploring some of the most popular manufacturers in the US and types of vehicles sold under them. We found that the most popular brands listed in this dataset are Chevrolet, Ford, and Toyota and the most popular type of vehicle being SUVs, sedans, and pickup trucks.
This scatterplot shows the changes in price across manufacturer years from 1950 to 2022 among the top 3 most popular brands resold: Chevrolet, ford, and Toyota. We can see that Ford and Chevrolet have the most resales in older vehicles before 1960. After 2000, all three manufacturers seem to experience an increase in price. Chevrolet vehicle resales have a very clear U-distribution where it had the highest prices approximately before 1970 and after 2000 with a drop in years between.
As we can see here, most of the vehicles are still fueled by gas while some by diesel, other fuels such as electric and hybrid appear less in our data. We also see that pick up, sedan and SUV are most common in gas data. While white, grey and black seem to be the most common color. From this we’re able to analyze the general preference of vehicles and what types are popular among customers. We are also able to further use this data in the future to analyze it with sale.
The stacked bar plot shows the top 3 most popular manufacturer brands: Toyota, Chevrolet, and Ford and the three most popular vehicle types: pickup, sedan, and SUV. We can see that Ford is the most resold manufacturer across all types with over 10000 resales and Toyota has the least. Ford has the highest amount of SUV resales and the least amount in sedans. Chevrolet has approximately the same amount of resales across pickups, sedans, and SUVs. Toyota is the most popular in their sedan resales and the least popular in their pickups. Within all pickup trucks, Ford and Chevrolet have the most resales. Sedans are resold the most by Toyota but are also popular in Chevrolet and Ford. SUVs have the highest resale frequency in Ford with approximately 3750 listings and the least amount by Toyota with under 2500 listings.
After learning about the most popular manufacturers in the market, we want to gain more knowledge about whether the used car market acts the same across the United States.
The first part of investigation into the difference of used car market in each state is aiming at the key response factor: whether the prices are the same. Given the right skewed distribution and potential influence of extremely high label prices to the average, we decide to use median price in the local market as a more stable metric. This colored map shows the median price of used car in each state, red meaning high and blue meaning low. From the plot, we can see that West Virginia, with 22300 dollars, appears to have the highest median price for used cars. Central and Northeastern states have lower prices compared to Northern and Western states.
We would like to learn more about used car sales in each state. Therefore, we count numbers of Craigslist posts in each state in the dataset and plot them on a colored U.S. map. Blue means lower number of posts and red higher. From the map, we can tell from the highly saturated red that California has significantly more used vehicle listings than other states with 12731 listings. Other states with a relatively large number of used vehicles posts are Florida, New York, Texas and Ohio. There is no significant regionally differences observed.
Another topic we are interested in is whether people are fond of any feature of a used car depending on their geographical location. This map illustrates the most popular drive in each state. Visually, the trend is noticeable.The red color representing 4wd is the most prevalent in the country. The Northeast, North, and Central states all have four wheel drive as the most popular,except Ohio, the only blue surrounded by red. Also, Louisiana is the only state in the U.S. that have more rear wheal drive used vehicles than other drives. Front wheel drive cars are popular in Western and Southern states.
One implication is that four wheel drive may perform better in snowy days where northern state residents are more likely to encounter. Hence they sell better there and there are more used ones.
As we can see from the GMC graph, a lot of the most common words are mostly about customers, thus we can further tell that GMC cars are mostly used for cabs and it is generally a cheaper, more efficient option. “Safety”, “engine”, and “power” are some words that stand out the most.
As we can see from the BMW graph, the words that stand out are more about the cars itself. Descriptive words such as “new”, “clean”, and “sport” are used to describe the condition and functionality of the vehicles.
When describing vehicles, sellers tend to mention the main selling point of the vehicles depending on the brand’s target customers and usage of the cars.
First, we looked at the most popular manufacturers and types of vehicles resold. We discovered that Chevrolet, Ford, and Toyota were the most resold brands and SUVs, sedans, and pickup trucks were the most resold type. Ford had the most resales overall, which is unsurprising, as they are one of the earliest manufacturers. Between the top three types of used vehicles, SUVs are the most resold. Among the most popular brands and types, Ford had the most resales in pickup trucks and SUVs while Toyota and Chevrolet had more resale of sedans.
Then we looked at the vehicle resale trend geographically, specifically, we looked into the median price, the number of listings, and the most popular drive-by state. We’ve observed several interesting trends. We found that West Virginia has the highest median prices for used cars, and California has the most used cars for sale in the market. Four wheel drive is dominantly popular across the country but some states in the South and West prefer front wheel drive demonstrated by their market responses.
Lastly, we looked at how sellers describe the vehicles they’re listing. We were interested in analyzing the type of words or language sellers use to boost their sales. Interestingly, sellers use distinct language based on the vehicle type and the target customer. For instance, GMC is known for its primary focus on trucks and utility vehicles, while BMW is known for its luxury vehicles. Thus when describing the GMC, sellers use words to emphasize its efficiency and customer-focused features, while performances, service, and cleanliness are mentioned more for BMW.
Although we were able to find many interesting trends through our research questions, we still encountered some limitations that restricted our areas of analysis. One limitation is that the dataset only includes listings in the US. Thus, all of our findings can only be generalized within the US population. It would be interesting to see whether vehicles sold in other parts of the world would have similar manufacturers, markets, and advertisement style as well.
Due to the immense capacity of the dataset and the time constraints we had, we were not able to explore many of the other variables that could also be relevant to our research questions and also new research questions. In the future, we can explore more detailed characteristics of vehicles such as condition, title status, fuel, and odometers to determine how they affect the listing price. We can also compare the popularity of regular versus luxury vehicles resold and see if there is a trend.