Crowd Counting and Head Localization
Crowd Counting and Head Localization
Project Summary
This notebook addresses dense crowd analysis by predicting both the number of people and the head-center coordinates in each image. The pipeline combines image slicing for large scenes, density-map regression, non-maximum suppression, and explicit localization metrics.
Tech Stack
- PyTorch
- torchvision
- SAHI
- SciPy
- Matplotlib
Key Results
- Three-epoch training run reached a best validation loss of
0.0002 - Inference generated predictions for
400test images with1,646detected heads - The notebook includes both count-based error metrics and localization precision/recall/F1 evaluation
Problem Setup
The task is to detect every head center in crowded scenes, then evaluate both total count accuracy and point localization quality. Because full-size images are large and densely populated, the workflow first slices each scene into smaller overlapping patches and later merges predictions back into image-level coordinates.
Evaluation Design
Counting quality is measured with error metrics such as MAE and RMSE, while localization quality is measured by matching predicted head centers to ground-truth points within a fixed pixel radius.
Example evaluation on 3 images MAE: 1.6667 RMSE: 2.8868 Precision: 0.9167 Recall: 0.6471 F1-Score: 0.7586
Slice-Based Data Preparation
Large crowd images are cut into overlapping 512 x 512 patches before training. This makes dense scenes easier to batch on GPU while preserving enough local context for head localization.
Dataset and Model
The training dataset converts point annotations into density targets, then feeds them into a VGG16-based density estimation network with a frozen feature extractor and a custom convolutional backend.
Training Run
The training loop below shows the compact experiment that was used to verify convergence before moving to inference and evaluation.
Training on cuda Epoch 1/3 | Train Loss: 0.0002 | Val Loss: 0.0002 Epoch 2/3 | Train Loss: 0.0002 | Val Loss: 0.0002 Epoch 3/3 | Train Loss: 0.0002 | Val Loss: 0.0002 Training finished. Best validation loss: 0.0002
Inference Snapshot
After loading the best checkpoint, the model predicts density maps on sliced test scenes, merges the detections back to full-image coordinates, and visualizes the most crowded examples.
Loaded checkpoint with validation loss: 0.0002 Completed inference on 400 test images. Total detected heads: 1646
Top visualized detections: 192 (29 heads), 125 (28), 182 (27), 210 (27), 52 (26).
Full Evaluation Snapshot
The final notebook also evaluates the method across the training image set to expose where counting remains easier than exact localization. That makes the notebook useful not only as a result artifact, but also as a debugging surface for recall improvements.
Aggregated evaluation on 1,100 images MAE: 13.6527 RMSE: 17.0979 Precision: 0.4239 Recall: 0.1006 F1-Score: 0.1625 The model captures many crowded regions but still under-recovers true head centers, leaving clear room for thresholding and localization refinement.