The purpose of this project is to analyze clinical data related to Inflammatory Bowel Disease (IBD). The goal is to explore relationships between demographics, disease phenotype, treatment types, and outcomes through statistical analysis, machine learning models, and visualizations.
The project involves:
- Collecting and preprocessing IBD clinical data.
- Performing Exploratory Data Analysis (EDA) to understand the data.
- Applying statistical tests (T-tests, ANOVA) and logistic regression for analysis.
- Creating interactive visualizations to analyze the results.
- Building a web-based dashboard for users to interact with the data.
The dataset used in this project includes clinical trial data related to IBD, and the data is assumed to come from a simulated or real clinical data source. Some common datasets that can be used are:
- Clinical trials related to IBD (e.g., clinicaltrials.gov).
- Open datasets like IBD Registry or datasets available through academic institutions.
- Simulated IBD datasets that mimic clinical trial results.
The dataset contains the following key features:
- Demographic Information: Age, gender, ethnicity.
- Disease Phenotype: Disease severity, location, complications.
- Treatment Information: Type of treatment (medication, surgery), treatment response.
- Clinical Outcomes: Success or failure of the treatment, other clinical measures.
The preprocessing steps included:
- Handling Missing Values: Filling missing numerical values with the median and dropping rows with too many missing values.
- Feature Engineering: Adding new features like age groups and disease severity classification.
- Data Transformation: Converting categorical features into numerical (e.g., encoding gender as 0 or 1).
We used one-hot encoding to transform categorical variables like treatment type and age group for logistic regression modeling.
- Age Distribution: A histogram was used to visualize the distribution of patients’ ages.
- Gender Distribution: Bar charts were used to visualize the gender distribution across different age groups.
- A boxplot was used to show the relationship between age groups and disease severity.
- A heatmap was created to visualize correlations between various numerical features such as age, disease severity, and treatment success.
We performed an ANOVA test to compare disease severity across different age groups. The results showed significant differences in disease severity between groups, helping identify which demographic factors influence the disease.
We applied logistic regression to predict the likelihood of severe disease based on variables like age, gender, and treatment type.
- The model's accuracy, precision, recall, and confusion matrix were used to evaluate its performance.
- An interactive boxplot was created to visualize disease severity across different age groups.
- The graph allows users to filter data based on different age groups, providing a clear visualization of how disease severity varies across different demographics.
A Dash-based web app was developed to allow users to interact with the data. Key features include:
- Age Group Selector: A dropdown for selecting an age group to explore disease severity.
- Interactive Graphs: A dynamic boxplot that updates based on user selection.
We built a logistic regression model to predict whether a patient’s disease is severe based on certain features:
- Features Used: Age, gender, treatment type.
- Evaluation Metrics: Accuracy, precision, recall, and confusion matrix were used to evaluate the model's performance.
We used Dash, a Python web framework for building interactive web applications. The app allows users to:
- Select age groups to filter data.
- View interactive plots based on selected filters.
- Understand the relationships between demographic factors and disease severity.
- Run the script with python app.py.
- Visit http://127.0.0.1:8050/ in your browser to interact with the dashboard.
This project provides valuable insights into the demographics and clinical outcomes of IBD patients. It also demonstrates how to use Python for data preprocessing, statistical analysis, and visualization in the healthcare research domain.
- Expand Data: Include more diverse data sources and larger datasets for deeper insights.
- Add More Models: Integrate machine learning models like Random Forests or SVM for predictive analysis.
- Improve Dashboard: Enhance the web app with more interactivity and additional insights.