Fraud Detection on Fintech Loan Transaction Data
Fraud Detection on Fintech Loan Transaction Data
Project Summary
This portfolio notebook rebuilds the RISTEK Datathon fraud-detection workflow using the locally cached competition data so the web UI can render real outputs. The objective is to flag fraudulent fintech borrowers by combining anonymized profile variables (pc0-pc16) with loan-activity behavior.
What This Version Emphasizes
- Dataset integrity checks and table-level overview
- Exploratory analysis on class imbalance and sentinel values
- Loan-history feature engineering from
loan_activities.csv - A reproducible LightGBM baseline with compact validation metrics
Loaded local competition files from the repo cache.
| table | rows | columns | |
|---|---|---|---|
| 0 | train | 857899 | 19 |
| 1 | test | 367702 | 18 |
| 2 | loan_activities | 4300999 | 4 |
| 3 | non_borrower_user | 1048575 | 18 |
Data Snapshot
The training set contains one fraud label per user_id, while loan_activities.csv provides a longer behavioral history. Both are useful: the profile variables capture anonymized user attributes, and the loan table helps recover contact-network and activity-volume signals that are not visible in the main train table.
| user_id | pc0 | pc1 | pc2 | pc3 | pc4 | pc5 | pc6 | pc7 | pc8 | pc9 | pc10 | pc11 | pc12 | pc13 | pc14 | pc15 | pc16 | label | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 3 | 1.0000 | 1.0000 | 0.2750 | 0.2550 | 0.9273 | 0.4000 | 0.2600 | 0.0400 | 0.2540 | 0.9769 | 1.0000 | 0.0727 | 0.0231 | 0.0784 | 0.7500 | 0.0182 | 0.2500 | 0 |
| 1 | 5 | 0.0000 | 0.0000 | 0.4300 | 0.3650 | 0.8488 | 0.4000 | 1.2530 | 0.2100 | 1.2350 | 0.9856 | 1.0000 | 0.1512 | 0.0144 | 0.0548 | 0.5000 | 0.0116 | 0.2500 | 0 |
| 2 | 9 | 1.0000 | 3.0000 | 1.3150 | 0.8250 | 0.6274 | 0.9000 | 2.3850 | 0.1280 | 2.2700 | 0.9518 | 1.0000 | 0.3726 | 0.0482 | 0.0545 | 0.7778 | 0.0038 | 0.1111 | 0 |
| user_id | reference_contact | loan_type | ts | |
|---|---|---|---|---|
| 0 | 2223129 | 903716 | 1 | 671 |
| 1 | 1380939 | 484583 | 1 | 89 |
| 2 | 2724411 | 1185034 | 1 | 230 |
Class Imbalance Review
Fraud is rare in this dataset, so accuracy alone would be misleading. Average Precision and threshold-tuned recall are more informative than raw accuracy because they better reflect performance on the minority class.
| label | count | share_pct | |
|---|---|---|---|
| 0 | 0 | 847042 | 98.7300 |
| 1 | 1 | 10857 | 1.2700 |
Sentinel Pattern in Profile Variables
Many anonymized pc columns contain the value -1, which behaves like a structured missing-value marker. Measuring how often each feature hits -1 helps identify columns where "absence of information" may itself carry predictive value.
| feature | minus_one_rate | |
|---|---|---|
| 0 | pc16 | 56.1600 |
| 1 | pc15 | 56.1600 |
| 2 | pc14 | 41.0900 |
| 3 | pc11 | 39.6000 |
| 4 | pc12 | 39.6000 |
| 5 | pc13 | 38.8400 |
| 6 | pc7 | 38.8400 |
| 7 | pc5 | 38.8400 |
| 8 | pc3 | 35.6100 |
| 9 | pc9 | 35.6100 |
Loan-Activity Feature Engineering
The most useful behavioral features come from aggregating each borrower's history:
- how often the user appears in the loan log
- how many unique emergency contacts and loan types they have used
- whether those contacts are statistically associated with fraud in the labeled training set
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| loan_count | 857,899.0000 | 1.7148 | 1.3963 | 0.0000 | 1.0000 | 2.0000 | 3.0000 | 6.0000 |
| unique_reference_contacts | 857,899.0000 | 1.7148 | 1.3963 | 0.0000 | 1.0000 | 2.0000 | 3.0000 | 6.0000 |
| unique_loan_types | 857,899.0000 | 1.3427 | 0.9947 | 0.0000 | 1.0000 | 1.0000 | 2.0000 | 6.0000 |
| reference_fraud_avg | 857,899.0000 | 0.0035 | 0.0560 | 0.0000 | 0.0000 | 0.0000 | 0.0000 | 1.0000 |
| loan_count_bucket | users | fraud_rate_pct | |
|---|---|---|---|
| 0 | 0 | 182467 | 3.1130 |
| 1 | 1 | 241130 | 1.2940 |
| 2 | 2 | 216908 | 0.6980 |
| 3 | 3-5 | 205039 | 0.2570 |
| 4 | 6-10 | 12355 | 0.1210 |
| 5 | 11-25 | 0 | NaN |
| 6 | 26+ | 0 | NaN |
Baseline Model
A compact LightGBM classifier is trained on a stratified sample of 200,000 users. This keeps execution practical for the portfolio while still surfacing the main signal from the engineered features and anonymized profile columns.
Average Precision: 0.0378 ROC AUC: 0.7955 Best threshold by F1: 0.7600
| precision | recall | f1-score | support | |
|---|---|---|---|---|
| 0 | 0.9897 | 0.9373 | 0.9628 | 39,493.0000 |
| 1 | 0.0466 | 0.2387 | 0.0779 | 507.0000 |
| accuracy | 0.9284 | 0.9284 | 0.9284 | 0.9284 |
| macro avg | 0.5181 | 0.5880 | 0.5204 | 40,000.0000 |
| weighted avg | 0.9777 | 0.9284 | 0.9516 | 40,000.0000 |
| feature | importance | |
|---|---|---|
| 21 | max_ts | 1773 |
| 20 | avg_ts | 1709 |
| 13 | pc13 | 979 |
| 7 | pc7 | 899 |
| 1 | pc1 | 899 |
| 2 | pc2 | 824 |
| 4 | pc4 | 754 |
| 3 | pc3 | 747 |
| 11 | pc11 | 740 |
| 8 | pc8 | 719 |
| 6 | pc6 | 717 |
| 15 | pc15 | 709 |
Conclusion
This refreshed notebook now contains renderable portfolio outputs instead of empty cells. The strongest signals come from loan-history timing, frequency, and contact-network features rather than from any single anonymized profile column alone, which is exactly the kind of story worth surfacing in a portfolio notebook for fraud detection.