This project focuses on the development and evaluation of two neural network models designed for the recognition of sequences of digits from images, typically sourced from street view images of house numbers. This task is crucial for various applications such as automated postal mail sorting, navigation systems, and urban planning.
Data originates in '.mat' format containing annotations which include the filenames, digit labels, and bounding boxes for each digit present in the images. For better accessibility during the training phase, the data is converted from the '.mat' file to structured CSV format. This structured format includes columns for filenames, individual digits, and their corresponding bounding boxes, making the data easier to manipulate and access.
To enhance the model's ability to generalize and improve robustness, several data augmentation techniques are implemented:
- Random Rotation: Rotating images within a specified angle range to mimic variations in orientation.
- Color Perturbation: Adjusting the color channels to simulate different lighting conditions.
- Cropping: Employing random cropping based on the bounding boxes to focus on relevant image parts.
- Resizing: Standardizing images to a uniform dimension to maintain consistent input data shape for the neural network.
- Grayscale Conversion: Converting images to grayscale to reduce computational complexity and to emphasize structure over color.
- Normalization: Normalizing images to have zero mean and unit variance to help in faster convergence during training.
The preprocess_digits_column
function is used to convert digit annotations into a list format within the DataFrame, simplifying further training and evaluation. Images undergo preprocessing which includes cropping according to bounding boxes, resizing to uniform dimensions, and normalization to ensure consistency in model inputs.
Two distinct models are implemented and compared:
-
Convolutional Neural Network (CNN): This model primarily focuses on learning spatial hierarchies of features through convolutional layers. Each layer in a CNN captures different aspects of the image, from edges in the early layers to more complex shapes and eventually digit-like features in the deeper layers. The CNN does not inherently process the data as a sequence but rather as a collection of independent features which it learns to associate with individual digits. This approach is very effective when the digits are well segmented and the primary task is to identify what each digit is, independent of its position in the sequence.
-
Convolutional Recurrent Neural Network with CTC Loss (CRNN): In contrast to the purely spatial processing of CNNs, CRNNs add a sequential component to handle inputs where the order of elements (digits) is significant. After initial convolutional layers extract features, the recurrent layers (such as LSTMs or GRUs) process the data in a sequence-aware manner. This model uses Connectionist Temporal Classification (CTC) loss to align the input sequences with the outputs, making it powerful for recognizing sequences where the spacing or size of elements varies. This setup excels in situations where digits are closely packed or overlapping, as it can learn the context of each digit relative to its neighbors.
-
CNN Model Performance:
- Individual accuracy: 92.82 %
- Sequence prediction accuracy: 76.33 %
- Coverage: 93.79 %
-
CRNN Model Performance:
- Individual accuracy: 89.13 %
- Sequence prediction accuracy: 64.03 %
- Coverage: 89.13 %
The CNN model showcases higher accuracies in recognizing both individual digits and sequences, suggesting its robustness in scenarios that allow for precise digit segmentation and recognition. On the other hand, the CRNN model, though slightly less accurate, offers advantages in sequence prediction in environments where digit segmentation is not possible.