## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr     1.1.4     ✔ readr     2.1.5
## ✔ forcats   1.0.0     ✔ stringr   1.5.1
## ✔ ggplot2   3.4.4     ✔ tibble    3.2.1
## ✔ lubridate 1.9.3     ✔ tidyr     1.3.0
## ✔ purrr     1.0.2     
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::lag()    masks stats::lag()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
## Rows: 32561 Columns: 15
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (9): workclass, education, marital_status, occupation, relationship, rac...
## dbl (6): age, fnlwgt, education_num, capital_gain, capital_loss, hour_per_week
## 
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

Introduction:

Census data serves as a cornerstone in offering rich demographic details across diverse populations. This project intends to utilize census data to explore various socioeconomic factors, aiming to gain insights into how these elements interact and influence each other within society.

Our exploration delves into multiple dimensions of the data, including income, education levels, race, and gender. By examining these variables, we seek to identify trends that may not be immediately apparent, thereby offering a more nuanced understanding of the socioeconomic landscape.

Data Overview:

The data for this dataset was extracted by Barry Becker from the 1994 Census database. The extraction focused on selecting clean records based on specified criteria: age over 16, adjusted gross income over $100, final weight greater than one, and hours worked per week greater than zero. Variables in the data include:

-Age (age): An integer representing the age of the individual.

-Work Class (workclass): A categorical variable indicating the employment type of the individual.

-Final Weight (fnlwgt): An integer representing the number of people the census believes the entry represents.

-Education (education): A categorical variable denoting the highest level of education attained by the individual.

-Education Number (education_num): An integer representing the number of years of education completed.

-Marital Status (marital_status): A categorical variable describing the marital status of the individual.

-Occupation (occupation): A categorical variable describing the individual’s occupation.

-Relationship (relationship): A categorical variable indicating the individual’s role in the family.

-Race (race): A categorical variable stating the individual’s race.

-Sex (sex): A binary variable indicating the gender of the individual, either Female or Male.

-Capital Gain (capital_gain): An integer representing the total income from investment sources, apart from wages/salary.

-Capital Loss (capital_loss): An integer indicating the total loss from investment sources.

-Hours per Week (hour_per_week): An integer representing the typical number of hours worked per week by the individual.

-Native Country (native_country): A categorical variable representing the country of origin of the individual.

-Income (income): A binary variable indicating whether the individual’s income exceeds $50,000.

Research Questions:

How do Gender and Race Influence Income Levels?

We are conducting this research to explore the extent and causes of income inequality across gender and racial demographics. Our aim is to uncover systemic issues and biases within the job market and other areas that contribute to economic disparities.

How Does an Individual’s Native Region Influence Their Educational Attainment and Income Levels?

We are researching this topic to analyze how the socioeconomic background associated with an individual’s country of origin affects their educational achievements and economic status. This research is crucial for understanding the specific challenges faced by persons of differing nationalities, which can inform the development of more effective educational and economic policies.

How Does Age Influence Income and Weekly Work Hours?

We are investigating how age impacts income and the number of hours individuals dedicate to work each week. This research seeks to identify the challenges various age groups encounter when entering specific professions or striving for work-life balance. We will also explore how career trajectories and working hours change through different life stages, potentially influenced by evolving priorities and capabilities.

Exploratory Data Analysis:

To enhance our understanding of the research questions, we will perform comprehensive exploratory data analysis across various variables. This process will include examining distributions and identifying patterns to assess the underlying structure of the data. We will utilize statistical summaries and visual tools to investigate relationships and trends among the variables.

Exploring Income, Race, and Gender

To investigate the relationships among income, race, and gender, we will utilize a range of statistical analyses and graphical representations. Our goal is to uncover patterns and associations that may indicate disparities among different demographic groups.

Chi-Squared Tests

We will begin our analysis by conducting a series of chi-squared tests. These tests will help us determine if there are statistically significant associations between income levels and the variables of race and gender. Identifying these associations is crucial for understanding potential income disparities across different demographic groups.

## 
##  Pearson's Chi-squared test
## 
## data:  table(census_data$income, census_data$race)
## X-squared = 330.92, df = 4, p-value < 2.2e-16

## 
##  Pearson's Chi-squared test with Yates' continuity correction
## 
## data:  table(census_data$income, census_data$sex)
## X-squared = 1517.8, df = 1, p-value < 2.2e-16

The results from our test yield p-values significantly less than our chosen alpha of 0.05, which indicates that for race and gender with respect to income, there is a statistically significant association. This finding suggests that differences in income levels are not due to random variation but can be attributed to variations across race and gender groups.

Barplot for Income vs. Race

In this graphic, we present a bar plot that displays the distribution of different races across two income brackets. The plot features large purple bars which indicate that individuals identified as White are the most frequently represented race within the dataset. Conversely, the races Amer-Indian-Eskimo and “Other” appear with much lower frequency, indicated by much smaller sections in the plot. This visualization highlights the uneven distribution of racial categories in our dataset, which is crucial for interpreting any race-related data analyses accurately. Such disparities suggest the need for caution when generalizing results across races, as the representation is not uniform. This information is essential for understanding the context of our findings and for designing future studies that aim to explore the intersection of race and income more comprehensively.

Mosaic Plot for Income and Sex

This mosaic plot represents the distribution of income across two gender categories.The size of each colored section reflects the proportion of individuals within that specific income and gender category. The plot visually suggests that the number of males in the greater than 50K income bracket is larger than that of females, and the same holds for the less than 50K bracket, although the difference between genders in the higher income bracket seems less pronounced.

Exploring the Impact of Native Region on Years of Education and Income Levels

We will now conduct several analyses to examine how an individual’s native region influences the amount of education they receive and the income they earn.

Histogram Displaying Years of Education with Income

Here, we present a chart that illustrates the marginal distribution of income brackets across various education levels. The chart clearly demonstrates that as education levels increase, the proportion of individuals earning more than $50,000 also rises correspondingly. Additionally, it highlights that there are significantly more individuals in both income brackets with education levels above 8, compared to those with education levels below 8, indicating a strong correlation between higher education and increased income potential.

Empirical Cumulative Distribution Function

Here, we present an Empirical Cumulative Distribution Function graph that illustrates the correlation between educational attainment and high income levels. (In our analysis, we focus on a subset of data representing individuals who earn more than $50,000 annually. From this subset, we calculate the mean and the standard deviation estimate to gain insights into the characteristics and distribution within this specific income group.) The graph demonstrates a clear trend: as the level of education increases, so does the likelihood of achieving a higher income. Notably, there is a significant increase in the frequency of high-income individuals at certain educational thresholds. The first notable rise occurs at an education level of 10, corresponding to the completion of some high school. A more pronounced surge is observed at an education level of 13, which typically represents the completion of an undergraduate degree. These inflection points suggest that higher educational milestones are critical in enhancing earning potential, underlining the value of educational advancement in securing financial success.

Violin Plot of Income vs. Education in Years and Region

The violin plot provides a visual representation of the correlation between education levels and income across various global regions. A notable observation from the graph is the stark contrast in the education levels of individuals earning less than 50k compared to those earning 50k or more. Particularly striking is the density of educational attainment in Asia for those earning under 50k, which is markedly higher relative to similar income bands in other regions. Despite higher education levels, this demographic still falls within the lower income bracket. A somewhat similar, though less pronounced, pattern is observed in North America, where there is a greater diversity in years of education among those earning less than 50k, yet a narrower distribution for those with incomes at or above $50,000. This suggests a potential disconnect between education and income levels in these regions.

Analyzing the Relationship Between Age, Hours Worked, and Income

Finally, we will explore how age, hours worked, and income are related. This analysis aims to understand the interconnected dynamics between these variables, which could provide insights into career progression, work-life balance, and economic stability across different life stages.

Histogram of Age and Income

Here, we are looking at a histogram showing the way that the two income groups are distributed across age groups. The main finding of this distribution is that the less than 50k income group is spread from about 10 to 90 while the greater than 50k group is more spread from 25-75. This is important for us to consider in our analysis of age and income as some age groups will be unlikely to be present in our data due to not having any observations in certain groups.

Density plot of Hours Worked Per Week and Income

The density plot illustrates the distribution of individuals across different income categories—either below or above 50k. For those earning 50k or less, the plot displays several distinct peaks, suggesting concentrated clusters where a significant number of individuals report working around 40 hours per week. Conversely, the density curve for those earning above $50,000 shows a single, pronounced peak at the 40-hour mark, indicating a common workweek duration among higher income earners.

ANOVA Testing to Examine the Relationships Between Age, Weekly Work Hours, and Income

##                  Df  Sum Sq Mean Sq F value Pr(>F)    
## hour_per_week     1   28639   28639   162.9 <2e-16 ***
## income            1  304626  304626  1732.4 <2e-16 ***
## Residuals     32558 5724894     176                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

This ANOVA output examines the relationship between age, hours worked per week, and income. The analysis reveals that both the coefficients for hours per week and income have p-values less than 0.05, indicating that age is statistically significant in relation to these variables. This suggests that age significantly influences variations in income levels and weekly work hours, highlighting that different age groups may experience distinct economic outcomes and work patterns.

Logistic Regression and Predictive Analysis:

Income Prediction Logistic Regression

We developed a logistic regression model to predict the likelihood of an individual earning above or below $50,000 annually. This model incorporates variables such as age, sex, region of birth, race, education level, hours worked per week, and work class, allowing us to assess the impact of these factors on income outcomes.

Logistic Regression Coefficients
	Coefficient	Pr(>\|z\|)
(Intercept)	-7.4467710	0.0000000
age	0.0466649	0.0000000
education11th	0.1052377	0.5954420
education12th	0.4561947	0.0652936
education1st-4th	-0.8150823	0.0677167
education5th-6th	-0.6924197	0.0366945
education7th-8th	-0.6248010	0.0063225
education9th	-0.3893149	0.1304150
educationAssoc-acdm	1.7794357	0.0000000
educationAssoc-voc	1.7286610	0.0000000
educationBachelors	2.4322628	0.0000000
educationDoctorate	3.3820232	0.0000000
educationHS-grad	1.0221504	0.0000000
educationMasters	2.8527730	0.0000000
educationPreschool	-11.8604294	0.9225116
educationProf-school	3.4725581	0.0000000
educationSome-college	1.4593122	0.0000000
raceAsian-Pac-Islander	0.4188790	0.0788836
raceBlack	0.1813796	0.3783666
raceOther	-0.1569968	0.6300644
raceWhite	0.5735236	0.0036294
sexMale	1.1623719	0.0000000
regionEurope	0.1399724	0.4600806
regionNorth America	0.0307746	0.8398087
regionOceania	-13.1417005	0.9516972
regionSouth America	-1.3749259	0.0015561
hour_per_week	0.0334517	0.0000000

Logistic Regression Model Accuracy
5 fold accuracy	Base rate accuracy
0.7965576	0.751053

From our analysis of the logistic regression model coefficients, we observe that age is positively correlated with income. Specifically, the expected difference in the log-likelihood of earning more than 50,000 in income between individuals who differ by one year in age is 0.0467. Additionally, our model demonstrates reasonable performance in terms of out-of-sample accuracy. If we predicted income using the majority class (individuals earning less than $50,000), our accuracy would be 75%. However, our model surpasses this baseline with a 5-fold cross-validation accuracy of approximately 80%. This indicates that our model effectively captures significant predictors of higher income beyond the majority class baseline.

Highest Education Prediction

We conducted a multinomial logistic regression analysis to model educational outcomes based on several predictors: age, race, sex, region of origin, and hours worked per week.

## Warning: package 'kableExtra' was built under R version 4.3.3

## 
## Attaching package: 'kableExtra'

## The following object is masked from 'package:dplyr':
## 
##     group_rows

## # weights:  208 (180 variable)
## initial  value 83596.322564 
## iter  10 value 63196.313056
## iter  20 value 62839.504212
## iter  30 value 61822.027904
## iter  40 value 61103.738872
## iter  50 value 59830.039668
## iter  60 value 59356.155948
## iter  70 value 59233.676755
## iter  80 value 59195.012206
## iter  90 value 59175.369668
## iter 100 value 59172.850299
## final  value 59172.850299 
## stopped after 100 iterations

## # weights:  208 (180 variable)
## initial  value 41796.774988 
## iter  10 value 34151.762882
## iter  20 value 33727.374438
## iter  30 value 33053.535207
## iter  40 value 31666.170916
## iter  50 value 30164.849104
## iter  60 value 29860.336226
## iter  70 value 29777.343177
## iter  80 value 29762.391654
## iter  90 value 29756.431946
## iter 100 value 29753.276705
## final  value 29753.276705 
## stopped after 100 iterations
## # weights:  208 (180 variable)
## initial  value 41799.547576 
## iter  10 value 33786.634391
## iter  20 value 33330.158195
## iter  30 value 32824.083203
## iter  40 value 31515.123813
## iter  50 value 29985.680516
## iter  60 value 29526.910361
## iter  70 value 29372.515701
## iter  80 value 29340.719090
## iter  90 value 29330.517913
## iter 100 value 29324.122016
## final  value 29324.122016 
## stopped after 100 iterations

Region coefficients
	regionEurope	regionNorth America	regionOceania	regionSouth America
11th	-0.7157470	-0.6372022	-0.9743065	-1.3468681
12th	0.9882774	0.1839269	-0.3499489	0.2423979
1st-4th	1.4755748	-0.6081388	-0.3076999	-0.2487729
5th-6th	1.1836838	-0.2609360	-0.3740031	-0.3845135
7th-8th	0.3583671	-0.6236102	-0.8292603	-0.8565429
9th	0.1070572	-0.3736848	3.1583473	-0.2395502
Assoc-acdm	0.8707809	0.0164486	-0.9141522	-0.1093420
Assoc-voc	1.1501425	0.5416911	-0.9011715	0.5201371
Bachelors	0.0462274	-0.6249684	2.2807833	-1.5698207
Doctorate	-1.3078320	-2.6400423	-1.4302589	-3.1102769
HS-grad	0.3129203	0.0274183	1.7687843	-0.2284376
Masters	-0.7457084	-1.3136273	-2.0026686	-2.0402370
Preschool	-3.0504334	-2.3357974	-0.1283189	-3.0753249
Prof-school	-0.8726039	-1.3119564	-0.9251973	-1.5252307
Some-college	0.2387012	0.0568516	2.9438124	-0.4900494

Education model accuracy
2 fold accuracy	Base rate accuracy
0.338861	0.3261915

Our analysis of the coefficients from the multinomial logistic regression reveals that individuals from the European region tend to have higher rates of educational attainment across all levels, except for advanced degrees, when compared to those from the Asian region. Additionally, the North American group shows higher rates of high school completion and some college education relative to the Asian reference group. Data from the Oceania region is too sparse to draw significant conclusions, and individuals from the South American region appear less likely to achieve various educational levels compared to their Asian counterparts.

Despite these observations, the model’s overall performance is only slightly better than random chance, introducing considerable uncertainty into our findings. The noise in the data prevents us from drawing meaningful, reliable conclusions about the relationships between region of origin and educational outcomes for this sample.

Discussion: What Are Some Things Not Answered?

Is Older Age the Primary Cause of Fluctuating Hours per Week Worked in Individuals?

To determine if older age is the primary factor influencing fluctuations in weekly work hours, further empirical research is required due to the current limitations in our data resources. Existing data may suggest a correlation between age and variability in work hours, but establishing causality necessitates controlled studies. A dedicated research team could conduct experiments to explore not only the direct impact of age but also other contributing variables such as income level and employee satisfaction. This approach would clarify whether age itself or a combination of factors accounts for the differences in weekly work hours across age groups.

Does Access to Higher Education Equate to Job Market Success in Different Regions?

To fully understand the relationship between higher education access and job market outcomes across various regions, additional targeted research is essential. Our current dataset suggests a potential link, but the dynamics of job market success are influenced by numerous factors including economic policies, market demands, and the quality of education. Researchers should investigate not just the presence of higher education institutions but also their efficacy and relevance to the job market. A cross-regional comparative study could shed light on how different educational systems prepare individuals for the economic realities of their respective regions.

Conclusion:

The insights derived from the census dataset provide a valuable overview of socioeconomic trends and disparities. It’s evident that income distribution is uneven across racial and regional lines. The connection between education and income is complex and varies significantly by region, with some areas showing high levels of education not necessarily correlating with higher income brackets.

These findings suggest that while education is a critical factor, it is not the sole determinant of economic prosperity, and other factors such as economic policies, labor market dynamics, and perhaps systemic biases play crucial roles. The patterns observed also raise important questions about the economic value of education in different contexts and the challenges of translating educational achievements into financial stability.

Uncovering Socioeconomic Patterns Through Census Data

Angelina Ohlinger, Nathan Barretto, & Noah Gonzalez