Employee Health Risk Analysis and Cholesterol Modeling

Project Summary

This notebook investigates which health and demographic factors most strongly relate to total cholesterol in an employee wellness dataset. The work emphasizes structured EDA, anomaly correction, feature engineering, and a Random Forest feature-importance pass to rank the main drivers of cholesterol variation.

Tech Stack

pandas
NumPy
scikit-learn
seaborn
Matplotlib

Dataset Scope

1,336 employee records
15 original variables spanning demographics, blood pressure, BMI, glucose, triglycerides, fat metrics, and employment duration

Problem Framing

The objective is to understand which measurable health indicators are most associated with total cholesterol. Instead of jumping directly into a model, the notebook first performs extensive validation and correction so downstream analysis is not distorted by obvious data-entry issues.

1.3 Objectives

Our primary objective is to unravel the complex interplay between various health factors and their association with CT levels. To achieve this goal, we will delve into the following steps:

Descriptive Statistics: Begin by exploring the characteristics of each variable through descriptive statistics, providing a comprehensive understanding of the data distribution and identifying potential patterns.
Correlation Analysis: Delve into the relationships between variables by calculating correlation coefficients. This will shed light on the strength and direction of associations between CT levels and other health indicators.
Modeling CT Levels: Employ machine learning techniques to construct predictive models that estimate CT levels based on the other measured variables. This will enable us to quantify the impact of each factor on CT levels.
Identifying Key Determinants: Uncover the most influential factors affecting CT levels by analyzing the results of the predictive models. This will provide valuable insights into targeted interventions to improve employee health.
Interpretation and Implications: Interpret the findings from the modeling process, drawing meaningful conclusions and highlighting the practical implications for employee health management.

In [13]

	Responden	Jenis Kelamin	Usia	Tekanan darah (S)	Tekanan darah (D)	Tinggi badan (cm)	Berat badan (kg)	IMT (kg/m2)	Lingkar perut (cm)	Glukosa Puasa (mg/dL)	Cholesterol Total (mg/dL)	Trigliserida (mg/dL)	Fat	Visceral Fat	Masa Kerja	Tempat lahir
0	1	M	19.0	126.0	88.0	172.5	49.5	16.53	66.0	84.0	187.0	99.0	26.4	6.0	0.97	Purworejo
1	2	M	19.0	120.0	80.0	158.0	53.6	21.50	71.0	84.0	187.0	99.0	26.4	6.0	0.60	Bogor
2	3	M	19.0	120.0	80.0	170.0	59.5	20.59	80.0	80.0	187.0	99.0	26.4	6.0	1.37	bandung
3	4	F	19.0	100.0	70.0	149.0	45.1	20.31	62.0	81.0	187.0	99.0	30.5	3.5	1.00	Jakarta
4	5	M	19.0	110.0	70.0	171.6	62.4	21.19	78.0	84.0	187.0	99.0	26.4	6.0	4.00	Teluk Betung

In [15]

	Responden	Usia	Tekanan darah (S)	Tekanan darah (D)	Tinggi badan (cm)	Berat badan (kg)	IMT (kg/m2)	Lingkar perut (cm)	Glukosa Puasa (mg/dL)	Cholesterol Total (mg/dL)	Trigliserida (mg/dL)	Fat	Visceral Fat	Masa Kerja
count	1339.000000	1339.000000	1339.000000	1339.000000	1339.000000	1339.000000	1339.000000	1339.000000	1339.000000	1339.000000	1339.000000	1339.000000	1339.000000	1339.000000
mean	670.000000	28.597461	113.147872	74.009709	164.940851	64.620500	23.693727	80.441972	84.571322	187.995519	106.982823	26.203510	6.231367	6.401837
std	386.680316	4.767230	10.164592	7.718752	7.386617	12.799096	4.021585	10.688215	11.522057	21.104834	44.143456	3.678467	2.431923	4.554438
min	1.000000	19.000000	80.000000	58.000000	138.500000	38.500000	14.850000	54.000000	65.000000	103.000000	34.000000	5.800000	0.500000	0.000000
25%	335.500000	25.000000	110.000000	70.000000	160.000000	55.275000	20.855000	72.000000	84.000000	187.000000	99.000000	26.400000	6.000000	4.000000
50%	670.000000	28.000000	110.000000	72.000000	165.000000	62.500000	23.200000	80.000000	84.000000	187.000000	99.000000	26.400000	6.000000	6.000000
75%	1004.500000	31.000000	120.000000	80.000000	170.000000	71.775000	26.000000	87.000000	84.000000	187.000000	99.000000	26.400000	6.000000	8.000000
max	1339.000000	39.000000	170.000000	100.000000	187.500000	139.750000	44.100000	128.000000	321.000000	308.000000	634.000000	40.900000	23.000000	31.000000

Data Quality Review

The notebook audits multiple columns for implausible values and inconsistent relationships, then applies targeted corrections before any feature analysis.

In [22]

Detected age anomalies below the expected 21-65 range, then normalized those records for consistency.

In [32]

Flagged BMI records where calculated and stored IMT diverged materially, then imputed the inconsistent cases with recalculated values.

In [45]

Reviewed unrealistic combinations of age and employment duration, then corrected rows where tenure values likely reflected entry mistakes.

Feature Engineering

After cleaning, the notebook derives binned demographic features, BMI-category indicators, geospatial birthplace coordinates, and interaction terms such as BMI-blood pressure products and age-triglyceride combinations.

In [54]

In [60]

Modeling and Importance Analysis

A Random Forest regressor is used as a compact, interpretable baseline to estimate cholesterol and surface the most influential engineered features.

In [68]

Visual Exploration

The notebook closes with distribution plots and relationship views that connect age, BMI, and cholesterol patterns back to the cleaned feature space.

In [73]

In [79]

In [83]