Back

Employee Health Risk Analysis and Cholesterol Modeling

Employee Health Risk Analysis and Cholesterol Modeling

Project Summary

This notebook investigates which health and demographic factors most strongly relate to total cholesterol in an employee wellness dataset. The work emphasizes structured EDA, anomaly correction, feature engineering, and a Random Forest feature-importance pass to rank the main drivers of cholesterol variation.

Tech Stack

  • pandas
  • NumPy
  • scikit-learn
  • seaborn
  • Matplotlib

Dataset Scope

  • 1,336 employee records
  • 15 original variables spanning demographics, blood pressure, BMI, glucose, triglycerides, fat metrics, and employment duration

Problem Framing

The objective is to understand which measurable health indicators are most associated with total cholesterol. Instead of jumping directly into a model, the notebook first performs extensive validation and correction so downstream analysis is not distorted by obvious data-entry issues.

1.3 Objectives

Our primary objective is to unravel the complex interplay between various health factors and their association with CT levels. To achieve this goal, we will delve into the following steps:

  1. Descriptive Statistics: Begin by exploring the characteristics of each variable through descriptive statistics, providing a comprehensive understanding of the data distribution and identifying potential patterns.

  2. Correlation Analysis: Delve into the relationships between variables by calculating correlation coefficients. This will shed light on the strength and direction of associations between CT levels and other health indicators.

  3. Modeling CT Levels: Employ machine learning techniques to construct predictive models that estimate CT levels based on the other measured variables. This will enable us to quantify the impact of each factor on CT levels.

  4. Identifying Key Determinants: Uncover the most influential factors affecting CT levels by analyzing the results of the predictive models. This will provide valuable insights into targeted interventions to improve employee health.

  5. Interpretation and Implications: Interpret the findings from the modeling process, drawing meaningful conclusions and highlighting the practical implications for employee health management.

In [13]
Responden Jenis Kelamin Usia Tekanan darah (S) Tekanan darah (D) Tinggi badan (cm) Berat badan (kg) IMT (kg/m2) Lingkar perut (cm) Glukosa Puasa (mg/dL) Cholesterol Total (mg/dL) Trigliserida (mg/dL) Fat Visceral Fat Masa Kerja Tempat lahir
0 1 M 19.0 126.0 88.0 172.5 49.5 16.53 66.0 84.0 187.0 99.0 26.4 6.0 0.97 Purworejo
1 2 M 19.0 120.0 80.0 158.0 53.6 21.50 71.0 84.0 187.0 99.0 26.4 6.0 0.60 Bogor
2 3 M 19.0 120.0 80.0 170.0 59.5 20.59 80.0 80.0 187.0 99.0 26.4 6.0 1.37 bandung
3 4 F 19.0 100.0 70.0 149.0 45.1 20.31 62.0 81.0 187.0 99.0 30.5 3.5 1.00 Jakarta
4 5 M 19.0 110.0 70.0 171.6 62.4 21.19 78.0 84.0 187.0 99.0 26.4 6.0 4.00 Teluk Betung
In [15]
Responden Usia Tekanan darah (S) Tekanan darah (D) Tinggi badan (cm) Berat badan (kg) IMT (kg/m2) Lingkar perut (cm) Glukosa Puasa (mg/dL) Cholesterol Total (mg/dL) Trigliserida (mg/dL) Fat Visceral Fat Masa Kerja
count 1339.000000 1339.000000 1339.000000 1339.000000 1339.000000 1339.000000 1339.000000 1339.000000 1339.000000 1339.000000 1339.000000 1339.000000 1339.000000 1339.000000
mean 670.000000 28.597461 113.147872 74.009709 164.940851 64.620500 23.693727 80.441972 84.571322 187.995519 106.982823 26.203510 6.231367 6.401837
std 386.680316 4.767230 10.164592 7.718752 7.386617 12.799096 4.021585 10.688215 11.522057 21.104834 44.143456 3.678467 2.431923 4.554438
min 1.000000 19.000000 80.000000 58.000000 138.500000 38.500000 14.850000 54.000000 65.000000 103.000000 34.000000 5.800000 0.500000 0.000000
25% 335.500000 25.000000 110.000000 70.000000 160.000000 55.275000 20.855000 72.000000 84.000000 187.000000 99.000000 26.400000 6.000000 4.000000
50% 670.000000 28.000000 110.000000 72.000000 165.000000 62.500000 23.200000 80.000000 84.000000 187.000000 99.000000 26.400000 6.000000 6.000000
75% 1004.500000 31.000000 120.000000 80.000000 170.000000 71.775000 26.000000 87.000000 84.000000 187.000000 99.000000 26.400000 6.000000 8.000000
max 1339.000000 39.000000 170.000000 100.000000 187.500000 139.750000 44.100000 128.000000 321.000000 308.000000 634.000000 40.900000 23.000000 31.000000

Data Quality Review

The notebook audits multiple columns for implausible values and inconsistent relationships, then applies targeted corrections before any feature analysis.

In [22]
Detected age anomalies below the expected 21-65 range, then normalized those records for consistency.
In [32]
Flagged BMI records where calculated and stored IMT diverged materially, then imputed the inconsistent cases with recalculated values.
In [45]
Reviewed unrealistic combinations of age and employment duration, then corrected rows where tenure values likely reflected entry mistakes.

Feature Engineering

After cleaning, the notebook derives binned demographic features, BMI-category indicators, geospatial birthplace coordinates, and interaction terms such as BMI-blood pressure products and age-triglyceride combinations.

In [54]
In [60]

Modeling and Importance Analysis

A Random Forest regressor is used as a compact, interpretable baseline to estimate cholesterol and surface the most influential engineered features.

In [68]
Output visualization

Visual Exploration

The notebook closes with distribution plots and relationship views that connect age, BMI, and cholesterol patterns back to the cleaned feature space.

In [73]
Output visualization
Output visualization
In [79]
Output visualization
Output visualization
In [83]
Output visualization