We will be examining the factors that are related to cirrhosis disease, which is chronic liver damage caused by liver diseases and conditions such as hepatitis and alcoholism. Cirrhosis causes liver damages such as scarring and in worst cases, liver failure. The damage done cannot be repaired and the symptoms can only be alleviated through controlling certain factors like the ones that we will be studying. In severe cases, a liver transplant may be necessary.
The Cirrhosis Prediction Dataset that we will be using contains data collected from the Mayo Clinic trial conducted between 1974 and 1984. While the data may be slightly old, it is one of the most complete and comprehensive data set out there regarding cirrhosis; therefore, we still find merits in studying it.
The data itself contains 418 rows, each representing an individual patient, and there are 20 variables. Below are the variables:
The variable ID
is an identification variable that we will not really need. The variable Stage
will be our main variable of interest (response). This is because higher stages mean greater severity of cirrhosis and we want to potentially study what could be related to that.
Also, note that we have decided to change Age
from days into years by dividing each value by 365 as years is a more conventional unit. We also created a new variable called Age_Group
which is a categorical variables, defining whether a patient is Under 40
, 40-49
, 50-59
, 60-69
, and Above 70
. We define it like this because the minimum age in our dataset is around 26 and the maximum is around 79. Note that all partial ages (ex. 49.18) are grouped based on the number before the decimal (ex. 49 for 49.18).
Below are the main questions we want to answer, along with sub-questions to guide us:
Next, we wanted to test if there are other body conditions that may be associated with the stages of cirrhosis and we used mosaic plots here to test this. We created 4 mosaic plots with the 4 conditions which are Ascites
, Hepatomegaly
, Spiders
, and Edema
included in the dataset and tested if there are a significantly higher/lower number of patients with these conditions at each stage of cirrhosis than expected if these conditions and the stages of cirrhosis are independent. All of these mosaic plots showed that there are a significantly higher number of patients with these 4 conditions at stage 4 of cirrhosis than expected under independence. This shows that these 4 body conditions are associated with the stages of cirrhosis and there is dependence between each body condition variable with the Stage
variable.
From these 3 graphs, we believe that age and gender is associated with cirrhosis, but not necessarily more severe stages of cirrhosis. From our first stacked bar plot, there are generally more female patients than male patients at all stages of cirrhosis. In terms of age, the age groups of 40-49 and 50-59 seem to be most common among all 4 stages of cirrhosis. This may mean that females are more likely to have cirrhosis than males in general, but maybe not a higher stage of cirrhosis. The same goes for the age groups where patients at age 40-49 and 50-59 are more likely to have cirrhosis than other age groups for both genders overall, but maybe not a higher stage of cirrhosis.
Gender and age may not be associated with more severe stages of the disease, but certain body conditions are. The 4 body conditions given in this dataset: Ascites
, Hepatomegaly
, Spiders
, and Edema
seem to be associated with higher stages of cirrhosis, especially stage 4, where our mosaic plots show that there is a significantly higher number of patients with each of the conditions than expected under independence.
To examine the Bilirubin
level of patients of each stage over time, we plot a scatterplot and fit a regression line to see the trend. The scatterplot of Bilirubin
level and numbers of days registered indicated a declining trend. On the plot, we see the slope for stage 4 patients is the steepest compared to all other stages, followed by patients from stage 2. First, from the general declining trend, this indicates that patients with greater numbers of days registered potentially got treated for more days which would lead to lower levels of Bilirubin.
Stage 4 patients having the steepest slope could just be due to those who were treated for fewer days having higher Bilirubin
levels to start with.
Here we want to examine whether drugs that the patients were treated with played any roles in the level of Bilirubin
. We see that data points for patients in stage 1 is not sufficient to generate conclusive regression lines; patients in stage 2 and 3 both showed those treated with placebo have more negative association between Bilirubin
and number of days registered; data of patients in stage 4 showed that those treated with D-penicillamine have slightly more negative association between Bilirubin
and number of days registered; however, the association appears to match those treated with placebo.
We see that there are two modes in the plot, and the one above will be omitted since no contour lines surround it. The one left is centered around 54 year-old and 1.4 mg/dl Bilirubin
, stage 4. We think this plot is informative because it demonstrates the majority of patients are centered around a certain area and provides a snapshot of how the data is distributed regarding age and Bilirubin
levels, and how stages are distributed among patients. Moving on, our group will separate the patients based on stages.
We dive into the groups of patients with different stages with a faceted heat map. As we plot the heat map, midpoints were chosen based on the scales of density generated by R, and outliers were omitted by setting ranges of y-axis to focus on the groups of patients in each stage. We see that the majority of the patients of Stage 1 center around 46 year-old and 0.6 mg/dl Bilirubin
; patients of Stage 2 center around 52 year-old and 0.75 mg/dl Bilirubin
; patients of Stage 3 center around 51 year-old and 0.8 mg/dl Bilirubin
; and patients of Stage 4 center around 60 year-old and 1.2 mg/dl Bilirubin.
From the above plots, we found that Bilirubin
is negatively associated with patients’ number of days registered, especially for those in stage 4. We also more closely studied whether or not drugs played an effect in this, and interesting we see that placebo had a greater effect. This was interesting since D-penicillamine is typically used to treat neonatal hyperbilirubinaemia which is when there is an excess bilirubin serum level. In the heat map faceted over stages of patients, we also find the center of patients in each stage evolves with greater age and greater level of bilirubin as we moved from stage 1 to stage 4.
Through our analysis, we have come to several interesting conclusions. We found that age and gender may be associated with cirrhosis itself, but not necessarily with disease stage. It appears that there were more female patients, and that those from age 40-59 appear to be at the greatest risk of the disease, but not necessarily facing a greater risk for severe cirrhosis. We did, however, find that individuals with other body conditions such as ascites, hepatomegaly, spiders, and edema may be more prone to severe cirrhosis. We also concluded that cholestrol does not appear to be indicative of disease severity.
Next, we also moved on to study bilirubin, something created by the liver itself during the normal breakdown of red blood cells. We decided to focus on this because a higher bilirubin level is indicative of liver problems. We found that bilirubin level tends to decrease as the number of days a patient registered for the disease increases. This means that they have received greater periods of treatment that is effective. However, what is interesting is that the placebo effect appears to take prominance over the actual drug. Also, we found that bilirubin level was higher for more severe disease stages.
Lastly, we were able to find that the quantitative measurements may not be indicative of the severity of cirrhosis. It appears that some stage 3 and 4 patients also share similar characteristics as those of earlier stages.
Overall, more study has to be done in this area. There are important pieces of information that was not available in this dataset, including the cause of the cirrhosis. For example, cirrhosis caused by alcohol abuse versus hepatitis may display very different patient attributes. Also, more study is needed to identify more clear distinctions between the disease groups, particularly stage 3 and 4 patients as they appear to share very similar characteristics. Perhaps expanding the sample pool would be beneficial.