Fraud Detection on Fintech Loan Transaction Data

Project Summary

This portfolio notebook rebuilds the RISTEK Datathon fraud-detection workflow using the locally cached competition data so the web UI can render real outputs. The objective is to flag fraudulent fintech borrowers by combining anonymized profile variables (pc0-pc16) with loan-activity behavior.

What This Version Emphasizes

Dataset integrity checks and table-level overview
Exploratory analysis on class imbalance and sentinel values
Loan-history feature engineering from loan_activities.csv
A reproducible LightGBM baseline with compact validation metrics

In [1]

Loaded local competition files from the repo cache.

	table	rows	columns
0	train	857899	19
1	test	367702	18
2	loan_activities	4300999	4
3	non_borrower_user	1048575	18

Data Snapshot

The training set contains one fraud label per user_id, while loan_activities.csv provides a longer behavioral history. Both are useful: the profile variables capture anonymized user attributes, and the loan table helps recover contact-network and activity-volume signals that are not visible in the main train table.

In [2]

	user_id	pc0	pc1	pc2	pc3	pc4	pc5	pc6	pc7	pc8	pc9	pc10	pc11	pc12	pc13	pc14	pc15	pc16
0	3	1.0000	1.0000	0.2750	0.2550	0.9273	0.4000	0.2600	0.0400	0.2540	0.9769	1.0000	0.0727	0.0231	0.0784	0.7500	0.0182	0.2500
1	5	0.0000	0.0000	0.4300	0.3650	0.8488	0.4000	1.2530	0.2100	1.2350	0.9856	1.0000	0.1512	0.0144	0.0548	0.5000	0.0116	0.2500
2	9	1.0000	3.0000	1.3150	0.8250	0.6274	0.9000	2.3850	0.1280	2.2700	0.9518	1.0000	0.3726	0.0482	0.0545	0.7778	0.0038	0.1111

	user_id	reference_contact	loan_type	ts
0	2223129	903716	1	671
1	1380939	484583	1	89
2	2724411	1185034	1	230

Class Imbalance Review

Fraud is rare in this dataset, so accuracy alone would be misleading. Average Precision and threshold-tuned recall are more informative than raw accuracy because they better reflect performance on the minority class.

In [3]

	label	count	share_pct
0	0	847042	98.7300
1	1	10857	1.2700

Sentinel Pattern in Profile Variables

Many anonymized pc columns contain the value -1, which behaves like a structured missing-value marker. Measuring how often each feature hits -1 helps identify columns where "absence of information" may itself carry predictive value.

In [4]

	feature	minus_one_rate
0	pc16	56.1600
1	pc15	56.1600
2	pc14	41.0900
3	pc11	39.6000
4	pc12	39.6000
5	pc13	38.8400
6	pc7	38.8400
7	pc5	38.8400
8	pc3	35.6100
9	pc9	35.6100

Loan-Activity Feature Engineering

The most useful behavioral features come from aggregating each borrower's history:

how often the user appears in the loan log
how many unique emergency contacts and loan types they have used
whether those contacts are statistically associated with fraud in the labeled training set

In [5]

	count	mean	std	25%	50%	75%	max
loan_count	857,899.0000	1.7148	1.3963	1.0000	2.0000	3.0000	6.0000
unique_reference_contacts	857,899.0000	1.7148	1.3963	1.0000	2.0000	3.0000	6.0000
unique_loan_types	857,899.0000	1.3427	0.9947	1.0000	1.0000	2.0000	6.0000
reference_fraud_avg	857,899.0000	0.0035	0.0560	0.0000	0.0000	0.0000	1.0000

In [6]

	loan_count_bucket	users	fraud_rate_pct
0	0	182467	3.1130
1	1	241130	1.2940
2	2	216908	0.6980
3	3-5	205039	0.2570
4	6-10	12355	0.1210
5	11-25	0	NaN
6	26+	0	NaN

Baseline Model

A compact LightGBM classifier is trained on a stratified sample of 200,000 users. This keeps execution practical for the portfolio while still surfacing the main signal from the engineered features and anonymized profile columns.

In [7]

Average Precision: 0.0378
ROC AUC: 0.7955
Best threshold by F1: 0.7600

	precision	recall	f1-score	support
0	0.9897	0.9373	0.9628	39,493.0000
1	0.0466	0.2387	0.0779	507.0000
accuracy	0.9284	0.9284	0.9284	0.9284
macro avg	0.5181	0.5880	0.5204	40,000.0000
weighted avg	0.9777	0.9284	0.9516	40,000.0000

In [8]

	feature	importance
21	max_ts	1773
20	avg_ts	1709
13	pc13	979
7	pc7	899
1	pc1	899
2	pc2	824
4	pc4	754
3	pc3	747
11	pc11	740
8	pc8	719
6	pc6	717
15	pc15	709

Conclusion

This refreshed notebook now contains renderable portfolio outputs instead of empty cells. The strongest signals come from loan-history timing, frequency, and contact-network features rather than from any single anonymized profile column alone, which is exactly the kind of story worth surfacing in a portfolio notebook for fraud detection.