Description of the data set
The link of data: https://cmustatistics.github.io/data-repository/medicine/vietnam-health.html
This dataset contains 2068 valid observations, offering a detailed
look at factors that influence healthcare behaviors. The data is
organized into four main categories: socioeconomic factors, personal
information, attitudes, and health exam behaviors. Socioeconomic factors
include things like Jobstt
(job status),
Edu
(education level), and HealthIn
(and health
insurance), which provide insight into people’s social and economic
backgrounds. Personal information covers details such as
Age
, Sex
, height
,
weight
, and BMI
, which are important for
understanding individual health profiles. Attitude variables focus on
perceptions, such as concerns about diseases, trust in healthcare
providers, and satisfaction with the quality of health information.
Lastly, the behavioral data looks at patterns in health exam
participation, including recent check-ups, preventive exams, follow-ups,
and treatment-seeking habits. These categories together provide a
well-rounded view of the factors that shape healthcare behaviors and
practices.
Motivation Section
We are interested in this dataset because it ties closely to our curiosity about why regular health check-ups aren’t more common in Vietnam. We’ve learned some basics about public health, and we are excited to see if what we’ve studied matches up with what the data reveals. The variety of variables in the dataset makes it especially interesting, as it allows us to explore how different factors like socioeconomic status, personal traits, and attitudes might connect. We think this dataset could answer a lot of important questions about public health and give insights into improving healthcare participation.
Research Questions
We aimed to uncover the interplay between demographic, social, and experiential factors in shaping proactive healthcare practices, offering actionable insights to improve preventive healthcare participation.
Research Question 1: How does personal condition (Sex, Age, health condition) relate to the frequency of taking health exams?
Research Question 2: How does social factor (marriage, education level, job status, insurance) affect their frequency of taking exams due to non-illness reason?
Research Question 3: How does the past experience (attitude) of health exams affect the frequency of taking exams?
Research Question: How does personal condition (Sex, Age, health condition) relate to the frequency of taking health exams?
To address this research question, we need to select two sets of variables: one to represent the frequency of taking health exams and another to reflect personal health conditions. We have chosen “RecExam” and “RecPerExam” as indicators of the frequency of medical check-ups. “RecExam” refers to the time since the last visit to a doctor when experiencing symptoms of some diseases, while “RecPerExam” refers to the time since the last visit without symptoms. Both variables are categorical and have five possible values: less12 = less than 12 months, b1224 = between 12 and 24 months, g24 = over 24 months, and unknow = respondent doesn’t know. Although these variables do not directly measure the frequency of medical check-ups, a longer time since the last visit generally implies a longer interval between visits and, consequently, a lower frequency.
Our explanatory variables include “Age,” “Sex,” “height,” “weight,” “BMI,” “EvalExer,” and “PerTrmt.” “Age,” “Sex,” “height,” and “weight” provide basic personal information about the respondents, while “BMI” represents the body mass index derived from height and weight. “EvalExer” indicates the respondent’s level of physical activity, with values ranging from verysuff = ‘more than enough’ to trivial = ‘none or almost none.’ Finally, “PerTrmt” shows whether the respondent is receiving long-term medical treatment.
Before creating visualizations and performing tests, we need to clean the data for the “RecExam” and “RecPerExam” variables. Specifically, we will exclude data labeled as “unknow,” as it does not provide meaningful information for analyzing the relationship between our explanatory and response variables. By removing these entries, we can ensure more accurate and insightful analysis.
We would first like to explore the distribution of proportions for the time since the last visit, both with and without symptoms, given “Sex” and “PerTrmt.”
RecExam plot: Among respondents receiving long-term treatments, we notice a smaller proportion who visited a doctor for symptoms within the last 12 months. However, they exhibit a higher proportion who took a check-up within the last 12 to 24 months compared to those not receiving long-term treatments. The “Sex” variable does not appear to influence “RecExam,” as the distributions in the faceted graphs for males and females seem similar.
RecPerExam plot: For respondents receiving long-term treatments, there is a smaller proportion who had a check-up without symptoms in the last 12 months. The other proportions remain consistent across “PerTrmt” groups. Similar to “RecExam,” the variable “Sex” does not seem to have an impact on “RecPerExam,” as the distributions across males and females show little variation.
We observed a general pattern that receiving long-term treatment seems to be linked to a higher proportion of visits occurring between 12 to 24 months, but a lower proportion of visits within the last 12 months, regardless of whether the visit was due to symptoms or not. To better understand this difference, we will perform chi-square tests to determine if it is statistically significant.
##
## Pearson's Chi-squared test
##
## data: RecExam_PerTrmt_table
## X-squared = 25.657, df = 2, p-value = 2.683e-06
##
## Pearson's Chi-squared test
##
## data: RecPerExam_PerTrmt_table
## X-squared = 14.009, df = 2, p-value = 0.0009079
The chi-square tests examine whether there is an association between receiving long-term treatment (“PerTrmt”) and the two response variables, “RecExam” and “RecPerExam.” Both tests yield very low p-values (respectively 2.683e-06 and 0.0009079), which are far below the alpha of 0.05. This provides strong evidence to conclude that there is a statistically significant association between “PerTrmt” and both “RecExam” and “RecPerExam.” These results align with our earlier observations. We could claim that receiving long-term treatment appears to be associated with a higher proportion of visits occurring between 12 to 24 months and a lower proportion of visits within the last 12 months. This might be because individuals on long-term treatment often follow a schedule that spans more than a year and may not feel the need for additional medical check-ups during that time. We will then move forward to other variables.
We plotted two mosaic plots: one for the table of “RecExam” and “EvalExer,” and the other for the table of “RecPerExam” and “EvalExer.” These plots were created to determine any potential relationships. We observed no standardized residuals strongly deviating from 0 since all mosaic pieces show only white color. Based on this, we conclude that exercise level, “EvalExer,” is not related to either “RecExam” or “RecPerExam.”
We are investigating how personal health information is related to the timing of the most recent medical check-up, categorized into three groups. These two dendrograms help us generatively classify the subjects based on their “Age,” “height,” “weight,” and “BMI”. Dendrograms are useful because they group subjects based on their health conditions and the four variables, which allows us to perform further analysis on these relationships. Age, height, weight, and BMI are four key indicators of a person’s health condition. We aim to explore whether individuals with poorer health conditions are more likely to have had a check-up more recently. The labels in the dendrograms are colored to represent the time range of the most recent check-up, both with and without symptoms. Specifically, for the “RecPerExam” variable, the color scheme is as follows: “blue” for those who had a check-up in the last 12 months (“less12”), “green” for those within 12-24 months, and “red” for those over 24 months. The same color scheme applies to “RecExam” as well.
From the dendrograms, we observe that there is no clear dominance of any one color in either graph, indicating that health conditions, based on age, height, weight, and BMI, do not seem to be strongly related to the timing of the most recent check-up.
In conclusion, “PerTrmt” (indicating whether the subject is receiving long-term treatment) is the only factor found to be associated with both “RecExam” and “RecPerExam.” In other words, it is the only personal information variable that appears to relate to how often people take check-ups, whether with or without symptoms. The other variables and relationships we explored did not reveal any significant associations.
Research Question: How does social factor (marriage, education level, job status, insurance) affect their frequency of taking exams due to non-illness reason?
Our second research question is driven by the need to understand how
social factors—such as marital status, education level, job status, and
insurance coverage—affect the frequency of health exams conducted for
non-illness-related reasons. We choose to use the frequency of health
exams conducted for non-illness-related reasons as the dependent
variable because based on previous analysis, the RecPerExam
and RecExam
are highly correlated. Then, we evaluate the
effect of variables Jobstt
, Edu
, and
HealthIn
.
We first clean the data for RecPerExam
and
Edu
to ensure that their classification group is clear to
understand.
Then, we want to have some basic understanding of the data
distribution in order to draw better visualization. Specifically, we
looked at the distribution of Edu
and JobStt
variable to see whether there is any majority/minority group that we
need to take into consider.
## Education_Level Count
## 1 Middle School 416
## 2 High School 142
## 3 College or University 1383
## 4 Post-Graduate 127
## Education_Level Count
## 1 Stable 1123
## 2 Unstable 171
## 3 Housewife 85
## 4 Student 548
## 5 Retirer 37
## 6 Other 104
From the above two tables, we can see that distribution of education level and job status is uneven. The majority of participants received high education (college or university) and have stable job or being a student, which dominates the population.
Now we want to investigate how Jobstt
, Edu
,
and HealthIns
are correlated with each other. Based on the
uneven distribution in Jobstt
and Edu
, we
choose to use proportion instead of raw count to better display them,
eliminating the potential issues caused by the uneven distribution.
Since education level, job status, and whether a person has health insurance are very likely to be related, we want to draw a plot to see their relationship and also investigate their distribution. With the context in 2.1, we decide to use proportion instead of actual count so that we can see the comparison more clearly. From the plot, we can see that in general, more participants have health insurance. The proportion of participants who have insurance within college or university and post-graduate category significantly dominate those who do not. Additionally, the proportion of participants who have insurance within stable job and student category significantly dominate those who do not. It indicates that people with higher education level and with a stable job or being a student tend to have health insurance.
Then, we want to answer: How do those social factors affect their frequency of taking exams?
We choose to use a Dendrogram to investigate the clustering.
Then, we did some analysis for the Dendrogram to get the information for the groups.
## clusters
## 1 2 3 4 5 6 7 8
## 1001 447 251 65 54 153 75 22
## # A tibble: 8 × 5
## Cluster Common_RecPerExam Common_Edu Common_Jobstt Common_HealthInc
## <int> <chr> <fct> <chr> <chr>
## 1 1 < 12 college or university stable yes
## 2 2 unknow college or university student yes
## 3 3 < 12 college or university stable yes
## 4 4 < 12 college or university other no
## 5 5 unknow middle school housewife yes
## 6 6 unknow middle school stable yes
## 7 7 < 12 middle school other yes
## 8 8 unknow college or university other yes
From the table, we can see that the largest cluster is the first cluster, significantly dominates other clusters, with the majority of participants who visited a doctor for non-illness reason within the last 12 months, have college or university degree, stable jobs, and health insurance.
Also, it is noticeable that all cluster with majority as student tend to be unknown about their last time of visited a doctor for non-illness reason. It indicates that students tend to have less non-illness check-ups or their parents take them to do check so they did not remember the exact time.
Additionally, participants who received middle school education tend to also be “unknow” about their last time of visited a doctor for non-illness reason. It indicates that they might care about frequent health check less.
Research Question: How does the past experience (attitude) of health exams affect the frequency of taking exams?
Our third research question is motivated by the practical importance of understanding how individuals’ perceptions of healthcare experiences influence their willingness to engage in regular check-ups. This question addresses a critical concern: whether fostering a better healthcare environment and improving the quality of healthcare services could potentially enhance the frequency of health exam check-ups. To investigate this, we use “RecExam” (time since the last check-up with symptoms) and “RecPerExam” (time since the last routine check-up) as indicators of check-up frequency, as they capture both reactive and proactive behaviors towards healthcare visits. These variables, though categorical, are practical proxies for frequency, offering insights into whether past attitudes like trust in medical staff Tangibles (Perceived quality of medical equipment and personnel at check-ups), Reliability (Perceived ability of examiner to perform medical services that meet the patient’s), Empathy (Perceived thoughtfulness and sense of responsibility of medical staff) or satisfaction with informational quality SuffInfo (Respondent’s rating of the sufficiency of information they received in check-ups), AttractInfo (Respondent’s rating of the attractiveness of information they received in check-ups), ImpressInfo (Respondent’s rating of the impressiveness of information they received in check-ups) can encourage regular engagement with healthcare services. Our research aims to reveal actionable insights for improving patient experiences and promoting healthier check-up habits.
We began with this heatmap to explore the relationship between respondents’ perceptions of healthcare experience attributes (e.g., Tangibles, Reliability, Empathy) and their frequency of check-ups (RecExam and RecPerExam). This plot provides an initial overview of how different healthcare attributes are rated across various time intervals since the last check-up, helping to identify key attributes that might influence health exam frequency.
From this heatmap, we can observe a clear trend that respondents who have had more recent check-ups like less than 12 months tend to rate these attributes higher across all categories, particularly in Tangibles, Reliability, and Empathy. As we can see from the plot the column of less12 generally show deeper color like red and orange compared to the other two columns for all six variables. Conversely, those who have not had a check-up in over 24 months report generally lower scores. However, for Empathy, Reliability and Tangibles, the frequency of between 12 and 24 months show slightly deeper color compared to greater 24 months. Also no matter if the respondents visited the doctor due to the symptoms of a disease or not, the pattern is similar. This pattern suggests that a more positive perception of healthcare quality is associated with more frequent check-ups, revealing that people who believe in staff and the healthcare would do check-up more frequently than others.
After identifying trends in perceived healthcare attributes through the heatmap, we then created faceted ridgeline plots to visualize the distribution of ratings for each healthcare attribute (Tangibles, Reliability, Empathy, SuffInfo, AttractInfo, ImpressInfo) across different check-up frequencies. We also decided to only focus on RecExam as we found out from previous section that RecExam and RecPerExam had similar patterns. This approach allows us to compare how respondents’ perceptions of these attributes vary with their RecExam categories, providing a clear and detailed view of patterns for each attribute in a single, cohesive visualization.
The ridgeline plots reveal how healthcare attribute ratings vary across RecExam frequencies (less12, b1224, g24). AttractInfo and ImpressInfo show higher scores for less12 and b1224 compared to g24, indicating more positive perceptions among those with recent check-ups. SuffInfo has similar distributions across all categories, suggesting consistency in perceived informational sufficiency. Empathy and Reliability have higher ratings for less12, reflecting stronger perceptions of interpersonal and trust-related factors among frequent check-up respondents, while b1224 and g24 show similar lower scores. Tangibles exhibit consistent patterns across all groups, implying less variation influenced by check-up frequency. Overall, interpersonal and trust-based attributes are more influenced by recent interactions, while structural attributes like tangibles remain stable over time.
We used PCA at the end of the analysis to explore the overall structure and interrelationships among the healthcare experience attributes. We decided to mainly focused on the visits with occurence of syptoms of disease. By reducing the dimensionality of the data, PCA allows us to visualize how these attributes interact and identify which factors contributes the most to variations in healthcare perceptions.
From this plot, we can see that the first principal component (PC1) captures the majority of the variability, with strong positive contributions from all attributes. This indicates that higher ratings in these attributes are strongly associated with more frequent check-ups (less12, represented by blue points). The second principal component (PC2) captures less variability but highlights distinctions between attribute groups. Specifically, ImpressInfo, SuffInfo, and AttractInfo have smaller PC2 contributions compared to Empathy, Tangibles, and Reliability, suggesting that the latter group of attributes exhibits greater variability and differentiation. In contrast, the information-related attributes (ImpressInfo, SuffInfo, AttractInfo) are more closely aligned, reflecting their similarity. This pattern reinforces the relationship between attitude attributes and exam frequency, highlighting the greater variability observed in Empathy, Reliability, and Tangibles. This is understandable, as respondents are likely to provide more nuanced and varied evaluations of interpersonal and tangible aspects of healthcare experiences. However, when asked about their perceptions of informational attributes during check-ups, respondents tended to provide more consistent ratings, reflecting the uniform nature of their experiences in this area.ards the information they got in checkups, they tended to reponse with similar scores.
In conclusion, the analysis reveals that respondents with more recent check-ups rate healthcare experience attributes higher, especially Tangibles, Reliability, and Empathy. The heatmap emphasize this trend, while the regression plots shows that Empathy, AttractInfo, and ImpressInfo consistently influence both routine and symptom-driven check-ups, while those practical factors like Reliability and Tangibles more relevant for symptom-driven visits. PCA confirms that higher ratings across all attributes correlate with frequent check-ups, with greater variability in interpersonal and tangible aspects. Overall, positive healthcare perceptions strongly drive check-up frequency, which means that there is the need to enhance patient experiences to encourage regular health exams.
For our first research question, we find out that the personal health information, as indicated by age, height, weight, and BMI, does not appear to have a strong relationship with the timing of the last medical check-up without symptoms. The lack of a dominating color in the dendrogram groups suggests that individuals’ physical health conditions, as captured by these variables, are not strongly predictive of whether they perform check-ups more recently. This implies that we need to consider other factors that potentially affect the frequency of checkups. For the second research question, social factors like education level, job status, and health insurance significantly influence the frequency of non-illness check-ups. Participants with college or university degrees, stable jobs, and health insurance are more likely to have recent non-illness check-ups. In contrast, students are less likely to engage in routine check-ups, often reporting “unknown” for their last non-illness visit, indicating that their frequency of such visits is notably lower. This outcome is understandable because individuals with higher education levels, stable jobs, and health insurance are more likely to prioritize the importance of routine checkups. For the third question, we find out that individuals with more recent check-ups tend to rate healthcare experience attributes higher, particularly Tangibles, Reliability, and Empathy, as highlighted in the heatmap.The ridgeline plots further illustrate that positive healthcare perceptions, particularly attributes like Empathy, Reliability, and AttractInfo, are strongly associated with more frequent check-ups and PCA further demonstrate that positive healthcare perceptions, especially interpersonal and informational attributes, strongly drive check-up frequency, emphasizing the importance of enhancing patient experiences to encourage regular health exams.
While our analysis addressed the three research questions and provided meaningful insights, certain aspects remain open for future exploration due to limitations in data and methodology. For instance, we separated factors into personal information, socioeconomic status, and previous experience perceptions but did not analyze their interconnections, as this requires more nuanced statistical techniques, such as structural equation modeling, which were beyond the scope of this project. Furthermore, identifying the most impactful factors driving frequent check-ups (less than 12 months) would benefit from additional data and advanced predictive models like machine learning approaches. Also, since the current dataset was derived by surveys, its limitation lies in capturing data from only one time stamp. Longitudinal data could reveal changes in perceptions and provide deeper insights into the third research question.