Crowd Counting and Head Localization

Project Summary

This notebook addresses dense crowd analysis by predicting both the number of people and the head-center coordinates in each image. The pipeline combines image slicing for large scenes, density-map regression, non-maximum suppression, and explicit localization metrics.

Tech Stack

PyTorch
torchvision
SAHI
SciPy
Matplotlib

Key Results

Three-epoch training run reached a best validation loss of 0.0002
Inference generated predictions for 400 test images with 1,646 detected heads
The notebook includes both count-based error metrics and localization precision/recall/F1 evaluation

Problem Setup

The task is to detect every head center in crowded scenes, then evaluate both total count accuracy and point localization quality. Because full-size images are large and densely populated, the workflow first slices each scene into smaller overlapping patches and later merges predictions back into image-level coordinates.

In [12]

Evaluation Design

Counting quality is measured with error metrics such as MAE and RMSE, while localization quality is measured by matching predicted head centers to ground-truth points within a fixed pixel radius.

In [15]

Example evaluation on 3 images
MAE: 1.6667
RMSE: 2.8868
Precision: 0.9167
Recall: 0.6471
F1-Score: 0.7586

Slice-Based Data Preparation

Large crowd images are cut into overlapping 512 x 512 patches before training. This makes dense scenes easier to batch on GPU while preserving enough local context for head localization.

In [20]

Dataset and Model

The training dataset converts point annotations into density targets, then feeds them into a VGG16-based density estimation network with a frozen feature extractor and a custom convolutional backend.

In [23]

In [27]

Training Run

The training loop below shows the compact experiment that was used to verify convergence before moving to inference and evaluation.

In [34]

Training on cuda

Epoch 1/3 | Train Loss: 0.0002 | Val Loss: 0.0002
Epoch 2/3 | Train Loss: 0.0002 | Val Loss: 0.0002
Epoch 3/3 | Train Loss: 0.0002 | Val Loss: 0.0002

Training finished.
Best validation loss: 0.0002

Inference Snapshot

After loading the best checkpoint, the model predicts density maps on sliced test scenes, merges the detections back to full-image coordinates, and visualizes the most crowded examples.

In [41]

Loaded checkpoint with validation loss: 0.0002
Completed inference on 400 test images.
Total detected heads: 1646

In [42]

Top visualized detections: 192 (29 heads), 125 (28), 182 (27), 210 (27), 52 (26).

Full Evaluation Snapshot

The final notebook also evaluates the method across the training image set to expose where counting remains easier than exact localization. That makes the notebook useful not only as a result artifact, but also as a debugging surface for recall improvements.

In [43]

Aggregated evaluation on 1,100 images
MAE: 13.6527
RMSE: 17.0979
Precision: 0.4239
Recall: 0.1006
F1-Score: 0.1625

The model captures many crowded regions but still under-recovers true head centers, leaving clear room for thresholding and localization refinement.