Introduction

Air quality is an important factor in public health with notable implications on determining where individuals choose to live, influencing decisions regarding residential locations and property values. Various factors play a role in how “healthy” air is, including concentrations of pollutants such as particulate matter (PM), nitrogen dioxide (NO2), ozone (O3), sulfur dioxide (SO2), carbon monoxide (CO), as well as other contaminants. Air Quality Index (AQI) is a commonly used metric of how air is polluted that determines an overall air quality measure based on the various concentrations of pollutants. Higher values on the AQI imply a worse overall air quality.

We would like to investigate how Air Quality influences housing choices using the proxy of Home Prices. Our hypothesis is that areas of healthier air should be more desirable to live in, so they should have higher home prices. To answer this question, we look at the relationship between several key variables, including air quality, geographic location, time, home price, and population.

Data

Our report utilizes data from Kaggle Collected at Outdoor Monitors Across the US and can be accessed via https://www.kaggle.com/datasets/epa/epa-historical-air-quality.

Our variables of interest include: Location (Longitude, Latitude, by city), AQI, and Time (Year)

Data about housing prices and rental prices have been broken down according to city and state and number of bedrooms using Zillow House Price Data. Our variables of interest are Location (Longitude, Latitude, by city), Time (Year), and Home Price

We also use annual population data from the US Census at the State and Metro level. Our population data includes the variables City, State, Population, and Year

Research Questions

Below, we have written the questions we will explore throughout this report.

  • How has air quality changed over time in the US?
  • Does population impact AQI on a state level?
  • How does air quality impact housing prices by region?

Question 1: How does Air Quality Depend on Geographical Location?

We also seek to investigate how air quality depends on location in the United States. For the first graph, we take a look at how many air quality sensors there are in the U.S. and where they are located. The color of each point indicates how bad the air quality is as measured by a particular sensor.

We see that around the year 2000 there is a spike in the number of locations that are being measured, but that the overall spatial trend in air quality seems relatively stable over time. Interestingly the number of sensors has decreased in recent years, but the coverage of the country is still relatively uniform. This may be due to the EPA not needing as many sensors to get the information needed.

To further investigate this spatial dependence, we fit an additive model of the form

\[ \mathbb{E}\left[\text{AQI}\mid \text{Lat},\text{Lon},\text{Year}\right] = f(\text{Lat},\text{Lon},\text{Year}) = f_1(\text{Lat}) + f_2(\text{Lon}) + f_3(\text{Lat},\text{Lon})+f_4(\text{Year}) \]

Where each of \(f\) are linear smoothers, specifically smoothing splines. We use this model to interpolate to regions in which sensors are not located, as well as get an idea of how the expected yearly averaged air quality depends on Latitude and Longitude. We include the year as a covariate to smooth out potential temporal dependence, but nonetheless find that the function \(f_4\) contributes very little to the sum.

The final plot overlays the interpolated contours on a map of the United States for the year 2021, which happens to be the year with the most data, however we note that one of the big takeaways from the additive model is that the choice of year is not very important for understanding the relative air quality between areas.

Note that the units on the color scale represent the normalized air quality, that is, after dividing through by the (interpolated) standard deviation and subtracting out the (interpolated) mean. We also place points at the five largest cities in the U.S. for reference.

We notice that simply being located in and around a large city does not necessarily mean an area will suffer from poor air quality. This is almost certainly because the AQI is influenced by a number of other factors that depend heavily on the environment and weather of a region. Nonetheless, we can conclude that there is a strong dependence of air quality on geographical region, especially on the longitudinal coordinate. We do not display the partial response functions of the GAM individually, but it is worth noting that the contours of the function \(f_3\) were typically quite vertical, and the contributions of the function \(f_2\) far outweight \(f_1\).

Question 2: How does population impact air quality on a state level?

Are we able to visually see any trends relating population growth to a significant change in air quality? The obvious answer would seem to be that increasing populations would inherently lead to more cities, more pollution, and thus worse air quality. To answer this question, we combined US census data dating back to 2010 with the relevant data within our air quality dataset.

Combining these datasets involved cleaning the original AQI dataset of all pre-2010 observations, manipulating the census data to be in (state, year, population_estimate) format, then merging the two tables on state and year respectively.

First, we created side-by-side boxplots to see if there is a visual trend in AQI values between states. Assuming there is indeed a correlation between population size and AQI, we would see clear differences in boxplot ranges of AQI as population increases. To hone in on a few states in order to present clear, coherent data, our philosophy was that focusing on the most populous and least populous states, the two extremes, would be best since the difference in AQI we expect to see should be at its most apparent when comparing the observations furthest apart in terms of our input variable.

Within this plot, we actually do see a trend as the medians and quartile ranges on display generally climb the y-axis as the boxes move along the x-axis to signify an increase in population. The difference is clear enough that the quartile ranges between the least populous states and most populous states do not overlap, all except for Alaska which stands as a very clear outlier. The most likely explanation for this would be the fact Alaska is not within the continental US and is thus subject to far different circumstances than the others displayed.

To remedy this type of issue, we decided to see if, rather than population size, population change brings about significant change in AQI. To do this, we created two time series plots displaying the percentage changes from year to year between average AQI and estimated population size respectively by state. This gives us a metric that brings each state to the same standard of measurement. Once again, to both clean our graphic and direct the viewer’s attention to where change should be the most apparent, we highlighted 5 particular states. Specifically, the 5 states highlighted contain the greatest absolute value percentage changes in AQI within a year. The idea is that if a change in AQI comes with a change in population and vice versa, the two plots should, in a way, act as reflections of one another.

Between these two plots, there are not many similarities. For example, Utah exhibits the single most drastic increase in average AQI across the state in 2021, doubling its value in 2020. We would expect this to result in a drastic, or even a remote, change in population at around the same time to explain the bizarre spike, however when we look at the population growth plot, we do not see a distinct spike in Utah’s growth at around the same point on the x-axis. This would lead us to believe population growth does not directly impact changes in a state’s AQI.

Question 3: How Does Air Quality affect Home Value?

Although better air quality is desireable, areas with high population density generally have higher home values and worse air quality. We would like to explore how air Quality and home value are related, and what role population size may play in this relationship.

For this section, our data points are the various metropolitan areas of the United States. We use metropolitan areas from the Core-Based Statistical Areas (CBSA), which are geographic areas defined by the Office of Management and Budget which include an urban center and neighboring regions with high levels of integration, like suburbs or sister cities. By combining our different datasets, we can get Population, Air Quality, and Home Value for each metropolitan area.

We explore two different aspects of our question. First, we explore how air Quality and home value vary together for a single point in time. Then, we look at how changes in air quality relate to changes in home value over time.

Conclusions

We will summarize our main findings below:

  1. Our first question examines how air quality depends on location in the United States using a contour plot visualized on a geographical map. Visualizing air quality data on maps, we find that air quality is highly dependent on geographic data. There doesn’t appear to be any worsening of air quality over time as shown by the consistency in the coloring of the contours. In the map provided, there is a scale that ranges from -1.6 to +1.6, which is a form of index that’s been normalized around a mean value. Values below 0 on these maps would indicate worse-than-average air quality and values above 0 would indicate a better than average air quality. Based on the visualization, areas along the coast and in the midwest/northeast have the worst records of air quality. However, in the west, near areas such as Wyoming and Montana, there are notable reports of high air quality. It can be inferred that areas of higher elevation have fewer pollutants in the air.

  2. Our second question discusses how population impacts air quality on a state level. Using side-by-side box plots, the difference in AQI in the most populated and least populated areas can be observed more closely. Wyoming shows the lowest average AQI, indicated by its lower median and shorter box, suggesting generally good air quality and less variation over the observed years. Alaska, however, although less populated, has a higher average AQI and a larger range, suggesting more variability in air quality. Among the more populous states, California has a notably higher average AQI and a wide interquartile range, indicating significant variability and generally poorer air quality. The other two graphs offer insight into the relationship between air quality and population changes in Utah, Montana, Alaska, Oregon, and Nevada. The graph depicting the percent change in AQI, shows significant year-to-year variability with values above 0 indicating a decline in air quality compared to the previous year, with values below indicative of improvement. We can conclude that there are no consistent long-term trends in air quality that are evident across the states, based on the significant fluctuations. The Percent Change graph displays less drastic changes, with more gradual lines being depicted. Each state appears to show unique patterns of air quality change likely influenced by policy and environmental factors such as wildfires, etc. Nevada and Oregon’s large variability may be investigated in the context of specific events/policy changes. Although air quality and population depict yearly changes, air quality is significantly more volatile.

  3. For our third question, we focus on air quality and home prices for specific metropolitan areas. We investigate how home price and air quality relate over a single point in time, as well as how changes over time in home price and air quality relate to each other. Looking at 2019 data, we find that for a single point in time, AQI and mean home value are moderately correlated, but much of this correlation can be accounted for by changes in population. Looking at changes in air quality and home value over time from 2010 to 2020, we find that decreasing air quality is associated with increasing home value, even accounting for population. While this result is counterintuitive, we hypothesize that another confounding factor such as economic activity might be correlated with higher home value and worse air quality. Overall, we find that the data does not necessarily support our original hypothesis that better air quality is associated with higher home value.

Limitations and Future Research

While our original goal was to investigate the link between air quality and housing price, we find that there are many confounding variables to account for, like geographic region and change over time. Although we investigate some of these confounding factors, we find that our results are still counterintuitive, indicating that there are still important variables we have yet to take into account.

A major limitation of our methodology is our relatively coarse-grained analysis. Since our analysis focuses on City, State, or Regional differences, many confounding variables might be at play that are difficult to take into account.

One way we could correct this in future work is by focusing on a smaller region, like the relationship between home price and air quality within a single city. This would alleviate the need to account for city-specific variables. While it would require finer-grained data than the government-collected samples we have, we could analyze sources like PurpleAir that have geographically dense air quality data.