Introduction

We will be examining the factors that are related to cirrhosis disease, which is chronic liver damage caused by liver diseases and conditions such as hepatitis and alcoholism. Cirrhosis causes liver damages such as scarring and in worst cases, liver failure. The damage done cannot be repaired and the symptoms can only be alleviated through controlling certain factors like the ones that we will be studying. In severe cases, a liver transplant may be necessary.

The Data

The Cirrhosis Prediction Dataset that we will be using contains data collected from the Mayo Clinic trial conducted between 1974 and 1984. While the data may be slightly old, it is one of the most complete and comprehensive data set out there regarding cirrhosis; therefore, we still find merits in studying it.

The data itself contains 418 rows, each representing an individual patient, and there are 20 variables. Below are the variables:

  1. ID: unique identifier
  2. N_Days: number of days between registration and the earlier of death, transplantation, or study analysis time in July 1986
  3. Status: status of the patient C (censored), CL (censored due to liver tx), or D (death)
  4. Drug: type of drug D-penicillamine or placebo
  5. Age: age in [days]
  6. Sex: M (male) or F (female)
  7. Ascites: presence of ascites N (No) or Y (Yes)
  8. Hepatomegaly: presence of hepatomegaly N (No) or Y (Yes)
  9. Spiders: presence of spiders N (No) or Y (Yes)
  10. Edema: presence of edema N (no edema and no diuretic therapy for edema), S (edema present without diuretics, or edema resolved by diuretics), or Y (edema despite diuretic therapy)
  11. Bilirubin: serum bilirubin in [mg/dl]
  12. Cholesterol: serum cholesterol in [mg/dl]
  13. Albumin: albumin in [gm/dl]
  14. Copper: urine copper in [ug/day]
  15. Alk_Phos: alkaline phosphatase in [U/liter]
  16. SGOT: SGOT in [U/ml]
  17. Triglycerides: triglicerides in [mg/dl]
  18. Platelets: platelets per cubic [ml/1000]
  19. Prothrombin: prothrombin time in seconds [s]
  20. Stage: histologic stage of disease (1, 2, 3, or 4)

The variable ID is an identification variable that we will not really need. The variable Stage will be our main variable of interest (response). This is because higher stages mean greater severity of cirrhosis and we want to potentially study what could be related to that.

Also, note that we have decided to change Age from days into years by dividing each value by 365 as years is a more conventional unit. We also created a new variable called Age_Group which is a categorical variables, defining whether a patient is Under 40, 40-49, 50-59, 60-69, and Above 70. We define it like this because the minimum age in our dataset is around 26 and the maximum is around 79. Note that all partial ages (ex. 49.18) are grouped based on the number before the decimal (ex. 49 for 49.18).

Research Questions

Below are the main questions we want to answer, along with sub-questions to guide us:

  1. Are certain groups of people more likely to have severe cirrhosis?
  • Are gender and age related to disease stage?
  • Does the presence of certain body conditions (ex. Ascites, Hepatomegaly) relate to disease stage?
  • Does those with high cholestrol, one of the most commonly checked body metrics, related to disease stage?
  1. What is the relationship between serum bilirubin level, age, and disease stage?
  • How does serum bilirubin level change over time for each disease stage?
  • Does drug treatment play a role on any associations between serum bilirubin level and number of days registered?
  • Does serum bilirubin level show consistent pattern across ages for different disease stages?
  1. How are the quantitative covariates related to the cirrhosis disease stage of a patient?
  • Can we use clustering algorithm to define the different groups of disease stages?
  • Are any of the quantitative measures more clearly associated with certain disease stages?

Are certain groups of people more likely to have severe cirrhosis?

Does the presence of certain body conditions relate to disease stage?

Graph 2. Mosaic plot showing the presence of a body condition or not for each cirrhosis disease stage.

Graph 2. Mosaic plot showing the presence of a body condition or not for each cirrhosis disease stage.

Next, we wanted to test if there are other body conditions that may be associated with the stages of cirrhosis and we used mosaic plots here to test this. We created 4 mosaic plots with the 4 conditions which are Ascites, Hepatomegaly, Spiders, and Edema included in the dataset and tested if there are a significantly higher/lower number of patients with these conditions at each stage of cirrhosis than expected if these conditions and the stages of cirrhosis are independent. All of these mosaic plots showed that there are a significantly higher number of patients with these 4 conditions at stage 4 of cirrhosis than expected under independence. This shows that these 4 body conditions are associated with the stages of cirrhosis and there is dependence between each body condition variable with the Stage variable.

Findings

From these 3 graphs, we believe that age and gender is associated with cirrhosis, but not necessarily more severe stages of cirrhosis. From our first stacked bar plot, there are generally more female patients than male patients at all stages of cirrhosis. In terms of age, the age groups of 40-49 and 50-59 seem to be most common among all 4 stages of cirrhosis. This may mean that females are more likely to have cirrhosis than males in general, but maybe not a higher stage of cirrhosis. The same goes for the age groups where patients at age 40-49 and 50-59 are more likely to have cirrhosis than other age groups for both genders overall, but maybe not a higher stage of cirrhosis.

Gender and age may not be associated with more severe stages of the disease, but certain body conditions are. The 4 body conditions given in this dataset: Ascites, Hepatomegaly, Spiders, and Edema seem to be associated with higher stages of cirrhosis, especially stage 4, where our mosaic plots show that there is a significantly higher number of patients with each of the conditions than expected under independence.

What is the relationship between serum bilirubin level, age, and disease stages?

How does serum bilirubin level change over time for each disease stage?

Graph 4. Scatter plot of serum bilirubin level for each disease stage with regression lines

Graph 4. Scatter plot of serum bilirubin level for each disease stage with regression lines

To examine the Bilirubin level of patients of each stage over time, we plot a scatterplot and fit a regression line to see the trend. The scatterplot of Bilirubin level and numbers of days registered indicated a declining trend. On the plot, we see the slope for stage 4 patients is the steepest compared to all other stages, followed by patients from stage 2. First, from the general declining trend, this indicates that patients with greater numbers of days registered potentially got treated for more days which would lead to lower levels of Bilirubin. Stage 4 patients having the steepest slope could just be due to those who were treated for fewer days having higher Bilirubin levels to start with.

Does drug treatment play a role on any associations between serum bilirubin level and number of days registered?

Graph 5. Faceted scatter plot of serum bilirubin level for each disease stage with regression lines based on drug treatment.

Graph 5. Faceted scatter plot of serum bilirubin level for each disease stage with regression lines based on drug treatment.

Here we want to examine whether drugs that the patients were treated with played any roles in the level of Bilirubin. We see that data points for patients in stage 1 is not sufficient to generate conclusive regression lines; patients in stage 2 and 3 both showed those treated with placebo have more negative association between Bilirubin and number of days registered; data of patients in stage 4 showed that those treated with D-penicillamine have slightly more negative association between Bilirubin and number of days registered; however, the association appears to match those treated with placebo.

Does serum bilirubin level show consistent pattern across ages for different disease stages?

Graph 6. Contour plot showing one mode for serum bilirubin level and age.

Graph 6. Contour plot showing one mode for serum bilirubin level and age.

We see that there are two modes in the plot, and the one above will be omitted since no contour lines surround it. The one left is centered around 54 year-old and 1.4 mg/dl Bilirubin, stage 4. We think this plot is informative because it demonstrates the majority of patients are centered around a certain area and provides a snapshot of how the data is distributed regarding age and Bilirubin levels, and how stages are distributed among patients. Moving on, our group will separate the patients based on stages.

Graph 7. Heatmaps of serum bilirubin level and age for each cirrhosis disease level.

Graph 7. Heatmaps of serum bilirubin level and age for each cirrhosis disease level.

We dive into the groups of patients with different stages with a faceted heat map. As we plot the heat map, midpoints were chosen based on the scales of density generated by R, and outliers were omitted by setting ranges of y-axis to focus on the groups of patients in each stage. We see that the majority of the patients of Stage 1 center around 46 year-old and 0.6 mg/dl Bilirubin; patients of Stage 2 center around 52 year-old and 0.75 mg/dl Bilirubin; patients of Stage 3 center around 51 year-old and 0.8 mg/dl Bilirubin; and patients of Stage 4 center around 60 year-old and 1.2 mg/dl Bilirubin.

Findings

From the above plots, we found that Bilirubin is negatively associated with patients’ number of days registered, especially for those in stage 4. We also more closely studied whether or not drugs played an effect in this, and interesting we see that placebo had a greater effect. This was interesting since D-penicillamine is typically used to treat neonatal hyperbilirubinaemia which is when there is an excess bilirubin serum level. In the heat map faceted over stages of patients, we also find the center of patients in each stage evolves with greater age and greater level of bilirubin as we moved from stage 1 to stage 4.

Conclusion

Through our analysis, we have come to several interesting conclusions. We found that age and gender may be associated with cirrhosis itself, but not necessarily with disease stage. It appears that there were more female patients, and that those from age 40-59 appear to be at the greatest risk of the disease, but not necessarily facing a greater risk for severe cirrhosis. We did, however, find that individuals with other body conditions such as ascites, hepatomegaly, spiders, and edema may be more prone to severe cirrhosis. We also concluded that cholestrol does not appear to be indicative of disease severity.

Next, we also moved on to study bilirubin, something created by the liver itself during the normal breakdown of red blood cells. We decided to focus on this because a higher bilirubin level is indicative of liver problems. We found that bilirubin level tends to decrease as the number of days a patient registered for the disease increases. This means that they have received greater periods of treatment that is effective. However, what is interesting is that the placebo effect appears to take prominance over the actual drug. Also, we found that bilirubin level was higher for more severe disease stages.

Lastly, we were able to find that the quantitative measurements may not be indicative of the severity of cirrhosis. It appears that some stage 3 and 4 patients also share similar characteristics as those of earlier stages.

Next Steps

Overall, more study has to be done in this area. There are important pieces of information that was not available in this dataset, including the cause of the cirrhosis. For example, cirrhosis caused by alcohol abuse versus hepatitis may display very different patient attributes. Also, more study is needed to identify more clear distinctions between the disease groups, particularly stage 3 and 4 patients as they appear to share very similar characteristics. Perhaps expanding the sample pool would be beneficial.