Back

Fraud Detection on Fintech Loan Transaction Data

Fraud Detection on Fintech Loan Transaction Data

Project Summary

This portfolio notebook rebuilds the RISTEK Datathon fraud-detection workflow using the locally cached competition data so the web UI can render real outputs. The objective is to flag fraudulent fintech borrowers by combining anonymized profile variables (pc0-pc16) with loan-activity behavior.

What This Version Emphasizes

  • Dataset integrity checks and table-level overview
  • Exploratory analysis on class imbalance and sentinel values
  • Loan-history feature engineering from loan_activities.csv
  • A reproducible LightGBM baseline with compact validation metrics
In [1]
Loaded local competition files from the repo cache.
table rows columns
0 train 857899 19
1 test 367702 18
2 loan_activities 4300999 4
3 non_borrower_user 1048575 18

Data Snapshot

The training set contains one fraud label per user_id, while loan_activities.csv provides a longer behavioral history. Both are useful: the profile variables capture anonymized user attributes, and the loan table helps recover contact-network and activity-volume signals that are not visible in the main train table.

In [2]
user_id pc0 pc1 pc2 pc3 pc4 pc5 pc6 pc7 pc8 pc9 pc10 pc11 pc12 pc13 pc14 pc15 pc16 label
0 3 1.0000 1.0000 0.2750 0.2550 0.9273 0.4000 0.2600 0.0400 0.2540 0.9769 1.0000 0.0727 0.0231 0.0784 0.7500 0.0182 0.2500 0
1 5 0.0000 0.0000 0.4300 0.3650 0.8488 0.4000 1.2530 0.2100 1.2350 0.9856 1.0000 0.1512 0.0144 0.0548 0.5000 0.0116 0.2500 0
2 9 1.0000 3.0000 1.3150 0.8250 0.6274 0.9000 2.3850 0.1280 2.2700 0.9518 1.0000 0.3726 0.0482 0.0545 0.7778 0.0038 0.1111 0
user_id reference_contact loan_type ts
0 2223129 903716 1 671
1 1380939 484583 1 89
2 2724411 1185034 1 230

Class Imbalance Review

Fraud is rare in this dataset, so accuracy alone would be misleading. Average Precision and threshold-tuned recall are more informative than raw accuracy because they better reflect performance on the minority class.

In [3]
label count share_pct
0 0 847042 98.7300
1 1 10857 1.2700
Output visualization

Sentinel Pattern in Profile Variables

Many anonymized pc columns contain the value -1, which behaves like a structured missing-value marker. Measuring how often each feature hits -1 helps identify columns where "absence of information" may itself carry predictive value.

In [4]
feature minus_one_rate
0 pc16 56.1600
1 pc15 56.1600
2 pc14 41.0900
3 pc11 39.6000
4 pc12 39.6000
5 pc13 38.8400
6 pc7 38.8400
7 pc5 38.8400
8 pc3 35.6100
9 pc9 35.6100
Output visualization

Loan-Activity Feature Engineering

The most useful behavioral features come from aggregating each borrower's history:

  • how often the user appears in the loan log
  • how many unique emergency contacts and loan types they have used
  • whether those contacts are statistically associated with fraud in the labeled training set
In [5]
count mean std min 25% 50% 75% max
loan_count 857,899.0000 1.7148 1.3963 0.0000 1.0000 2.0000 3.0000 6.0000
unique_reference_contacts 857,899.0000 1.7148 1.3963 0.0000 1.0000 2.0000 3.0000 6.0000
unique_loan_types 857,899.0000 1.3427 0.9947 0.0000 1.0000 1.0000 2.0000 6.0000
reference_fraud_avg 857,899.0000 0.0035 0.0560 0.0000 0.0000 0.0000 0.0000 1.0000
In [6]
loan_count_bucket users fraud_rate_pct
0 0 182467 3.1130
1 1 241130 1.2940
2 2 216908 0.6980
3 3-5 205039 0.2570
4 6-10 12355 0.1210
5 11-25 0 NaN
6 26+ 0 NaN
Output visualization

Baseline Model

A compact LightGBM classifier is trained on a stratified sample of 200,000 users. This keeps execution practical for the portfolio while still surfacing the main signal from the engineered features and anonymized profile columns.

In [7]
Average Precision: 0.0378
ROC AUC: 0.7955
Best threshold by F1: 0.7600
precision recall f1-score support
0 0.9897 0.9373 0.9628 39,493.0000
1 0.0466 0.2387 0.0779 507.0000
accuracy 0.9284 0.9284 0.9284 0.9284
macro avg 0.5181 0.5880 0.5204 40,000.0000
weighted avg 0.9777 0.9284 0.9516 40,000.0000
Output visualization
In [8]
feature importance
21 max_ts 1773
20 avg_ts 1709
13 pc13 979
7 pc7 899
1 pc1 899
2 pc2 824
4 pc4 754
3 pc3 747
11 pc11 740
8 pc8 719
6 pc6 717
15 pc15 709
Output visualization

Conclusion

This refreshed notebook now contains renderable portfolio outputs instead of empty cells. The strongest signals come from loan-history timing, frequency, and contact-network features rather than from any single anonymized profile column alone, which is exactly the kind of story worth surfacing in a portfolio notebook for fraud detection.