This project aims to predict the 10-year risk of future coronary heart disease (CHD) in patients using machine learning techniques. By analyzing various health indicators and lifestyle factors, we develop a model that can assist healthcare professionals in identifying high-risk individuals and implementing preventive measures.
-
Python
-
Data Analysis: Using Python libraries such as Pandas and NumPy for data manipulation and analysis.
-
Data Visualization: Employing Matplotlib and Seaborn for creating insightful visualizations of the dataset.
-
Machine Learning: Implementing various classification algorithms including Logistic Regression, K-Nearest Neighbors, Decision Trees, and Random Forests using Scikit-learn.
-
Feature Engineering: Performing data preprocessing tasks such as handling missing values, encoding categorical variables, and scaling numerical features.
-
Model Evaluation: Utilizing cross-validation techniques and various performance metrics (accuracy, F1-score, ROC-AUC) to assess model performance.
-
Hyperparameter Tuning: Applying Grid Search to optimize model parameters for improved performance.
-
Statistical Analysis: Using libraries like StatsModels for in-depth statistical modeling.
- Early Risk Identification: By accurately predicting the 10-year risk of coronary heart disease, healthcare providers can identify high-risk patients early, enabling timely interventions and preventive care.
- Resource Allocation: Healthcare systems can use these predictions to allocate resources more efficiently, focusing on patients with higher risk profiles.
- Personalized Care Plans: The model's insights can help in developing personalized treatment and lifestyle modification plans for patients based on their risk factors.
- Cost Reduction: Early intervention and prevention strategies based on accurate risk prediction can potentially reduce the long-term healthcare costs associated with treating advanced heart disease.
The dataset used in this project contains information about patients, with 17 features for each individual. These features include:
- id: Unique identifier for each patient
- age: Age of the patient
- education: Education level (1-4)
- sex: Gender (M/F)
- is_smoking: Smoking status (YES/NO)
- cigsPerDay: Number of cigarettes smoked per day
- BPMeds: Blood pressure medication usage (0/1)
- prevalentStroke: History of stroke (0/1)
- prevalentHyp: Presence of hypertension (0/1)
- diabetes: Presence of diabetes (0/1)
- totChol: Total cholesterol level
- sysBP: Systolic blood pressure
- diaBP: Diastolic blood pressure
- BMI: Body Mass Index
- heartRate: Heart rate
- glucose: Glucose level
- TenYearCHD: 10-year risk of coronary heart disease (target variable, 0/1)
This comprehensive dataset includes demographic information, lifestyle factors, medical history, and various health measurements, providing a solid foundation for developing an accurate predictive model.
- Exploratory Data Analysis (EDA): Thorough analysis to uncover patterns and relationships in the dataset.
- Feature Engineering: Advanced techniques to capture complex interactions between health indicators.
- Machine Learning Models: Implementation of various models to predict CHD risk.
- Model Evaluation and Interpretation: Providing insights into the most important risk factors and the model’s performance.
Thank You. Let’s keep learning and growing together!