Employee Health Risk Analysis and Cholesterol Modeling
Employee Health Risk Analysis and Cholesterol Modeling
Project Summary
This notebook investigates which health and demographic factors most strongly relate to total cholesterol in an employee wellness dataset. The work emphasizes structured EDA, anomaly correction, feature engineering, and a Random Forest feature-importance pass to rank the main drivers of cholesterol variation.
Tech Stack
- pandas
- NumPy
- scikit-learn
- seaborn
- Matplotlib
Dataset Scope
1,336employee records15original variables spanning demographics, blood pressure, BMI, glucose, triglycerides, fat metrics, and employment duration
Problem Framing
The objective is to understand which measurable health indicators are most associated with total cholesterol. Instead of jumping directly into a model, the notebook first performs extensive validation and correction so downstream analysis is not distorted by obvious data-entry issues.
1.3 Objectives
Our primary objective is to unravel the complex interplay between various health factors and their association with CT levels. To achieve this goal, we will delve into the following steps:
-
Descriptive Statistics: Begin by exploring the characteristics of each variable through descriptive statistics, providing a comprehensive understanding of the data distribution and identifying potential patterns.
-
Correlation Analysis: Delve into the relationships between variables by calculating correlation coefficients. This will shed light on the strength and direction of associations between CT levels and other health indicators.
-
Modeling CT Levels: Employ machine learning techniques to construct predictive models that estimate CT levels based on the other measured variables. This will enable us to quantify the impact of each factor on CT levels.
-
Identifying Key Determinants: Uncover the most influential factors affecting CT levels by analyzing the results of the predictive models. This will provide valuable insights into targeted interventions to improve employee health.
-
Interpretation and Implications: Interpret the findings from the modeling process, drawing meaningful conclusions and highlighting the practical implications for employee health management.
| Responden | Jenis Kelamin | Usia | Tekanan darah (S) | Tekanan darah (D) | Tinggi badan (cm) | Berat badan (kg) | IMT (kg/m2) | Lingkar perut (cm) | Glukosa Puasa (mg/dL) | Cholesterol Total (mg/dL) | Trigliserida (mg/dL) | Fat | Visceral Fat | Masa Kerja | Tempat lahir | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | M | 19.0 | 126.0 | 88.0 | 172.5 | 49.5 | 16.53 | 66.0 | 84.0 | 187.0 | 99.0 | 26.4 | 6.0 | 0.97 | Purworejo |
| 1 | 2 | M | 19.0 | 120.0 | 80.0 | 158.0 | 53.6 | 21.50 | 71.0 | 84.0 | 187.0 | 99.0 | 26.4 | 6.0 | 0.60 | Bogor |
| 2 | 3 | M | 19.0 | 120.0 | 80.0 | 170.0 | 59.5 | 20.59 | 80.0 | 80.0 | 187.0 | 99.0 | 26.4 | 6.0 | 1.37 | bandung |
| 3 | 4 | F | 19.0 | 100.0 | 70.0 | 149.0 | 45.1 | 20.31 | 62.0 | 81.0 | 187.0 | 99.0 | 30.5 | 3.5 | 1.00 | Jakarta |
| 4 | 5 | M | 19.0 | 110.0 | 70.0 | 171.6 | 62.4 | 21.19 | 78.0 | 84.0 | 187.0 | 99.0 | 26.4 | 6.0 | 4.00 | Teluk Betung |
| Responden | Usia | Tekanan darah (S) | Tekanan darah (D) | Tinggi badan (cm) | Berat badan (kg) | IMT (kg/m2) | Lingkar perut (cm) | Glukosa Puasa (mg/dL) | Cholesterol Total (mg/dL) | Trigliserida (mg/dL) | Fat | Visceral Fat | Masa Kerja | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| count | 1339.000000 | 1339.000000 | 1339.000000 | 1339.000000 | 1339.000000 | 1339.000000 | 1339.000000 | 1339.000000 | 1339.000000 | 1339.000000 | 1339.000000 | 1339.000000 | 1339.000000 | 1339.000000 |
| mean | 670.000000 | 28.597461 | 113.147872 | 74.009709 | 164.940851 | 64.620500 | 23.693727 | 80.441972 | 84.571322 | 187.995519 | 106.982823 | 26.203510 | 6.231367 | 6.401837 |
| std | 386.680316 | 4.767230 | 10.164592 | 7.718752 | 7.386617 | 12.799096 | 4.021585 | 10.688215 | 11.522057 | 21.104834 | 44.143456 | 3.678467 | 2.431923 | 4.554438 |
| min | 1.000000 | 19.000000 | 80.000000 | 58.000000 | 138.500000 | 38.500000 | 14.850000 | 54.000000 | 65.000000 | 103.000000 | 34.000000 | 5.800000 | 0.500000 | 0.000000 |
| 25% | 335.500000 | 25.000000 | 110.000000 | 70.000000 | 160.000000 | 55.275000 | 20.855000 | 72.000000 | 84.000000 | 187.000000 | 99.000000 | 26.400000 | 6.000000 | 4.000000 |
| 50% | 670.000000 | 28.000000 | 110.000000 | 72.000000 | 165.000000 | 62.500000 | 23.200000 | 80.000000 | 84.000000 | 187.000000 | 99.000000 | 26.400000 | 6.000000 | 6.000000 |
| 75% | 1004.500000 | 31.000000 | 120.000000 | 80.000000 | 170.000000 | 71.775000 | 26.000000 | 87.000000 | 84.000000 | 187.000000 | 99.000000 | 26.400000 | 6.000000 | 8.000000 |
| max | 1339.000000 | 39.000000 | 170.000000 | 100.000000 | 187.500000 | 139.750000 | 44.100000 | 128.000000 | 321.000000 | 308.000000 | 634.000000 | 40.900000 | 23.000000 | 31.000000 |
Data Quality Review
The notebook audits multiple columns for implausible values and inconsistent relationships, then applies targeted corrections before any feature analysis.
Detected age anomalies below the expected 21-65 range, then normalized those records for consistency.
Flagged BMI records where calculated and stored IMT diverged materially, then imputed the inconsistent cases with recalculated values.
Reviewed unrealistic combinations of age and employment duration, then corrected rows where tenure values likely reflected entry mistakes.
Feature Engineering
After cleaning, the notebook derives binned demographic features, BMI-category indicators, geospatial birthplace coordinates, and interaction terms such as BMI-blood pressure products and age-triglyceride combinations.
Modeling and Importance Analysis
A Random Forest regressor is used as a compact, interpretable baseline to estimate cholesterol and surface the most influential engineered features.
Visual Exploration
The notebook closes with distribution plots and relationship views that connect age, BMI, and cholesterol patterns back to the cleaned feature space.