This project is an implementation of the "Siamese Neural Networks for One-shot Image Recognition" paper from scratch. Instead of using the Omniglot dataset, I created a custom dataset consisting of images of myself (collected via webcam) and celebrities (randomly downloaded from the internet).
- Project Overview
- Installation
- Data Preprocessing
- Model Architecture
- Distance Layer
- Key Formula
- Checkpoints
- Training
- Evaluation
- Testing
- Issues
The images are stored in three directories:
- Anchor: Images of myself, captured via webcam.
- Positive: Similar images, also captured via webcam, which serve as the ground truth.
- Negative: Dissimilar images of celebrities, collected randomly from the internet.
This structure is used to simulate a one-shot learning scenario where the model learns to differentiate between similar and dissimilar images.
To run this project, install the following packages:
git clone https://github.com/codeflamer/POSE-ESTIMATION.git
pip install tensorflow==2.10.1 matplotlib opencv-python numpy pathlib
The dataset is structured into three folders:
- Anchor: Contains images of myself collected using the webcam.
- Positive: Contains additional images of myself, also collected using the webcam, which serve as the ground truth for similarity.
- Negative: Contains images of celebrities, downloaded from the internet, used to represent dissimilar images.
-
Loading Images:
- Images from the
Anchor
,Positive
, andNegative
directories were loaded usingtensorflow
'stf.data.Dataset
API.
- Images from the
-
Decoding and Resizing:
- Each image was decoded from its raw format, resized to the shape
(105, 105, 3)
to ensure uniformity across the dataset, and rescaled to a range of[0, 1]
for efficient training.
- Each image was decoded from its raw format, resized to the shape
-
Creating Triplets:
- The datasets were paired to create triplet datasets:
dataset(anchor, positive, label=1)
– pairs of similar images.dataset(anchor, negative, label=0)
– pairs of dissimilar images.
- The datasets were paired to create triplet datasets:
-
Batching and Prefetching:
- The datasets were cached for efficient loading, shuffled, batched, and prefetched to improve performance during training.
- Anchor:
[image1_anchor, image2_anchor, ...]
- Positive:
[image1_positive, image2_positive, ...]
- Negative:
[image1_negative, image2_negative, ...]
By structuring and preprocessing the images in this way, the dataset was optimized for training the Siamese Neural Network.
The model architecture follows the structure described in the original paper, Siamese Neural Networks for One-shot Image Recognition. The architecture was implemented using TensorFlow's Functional API for flexibility and robustness.
-
Input Layer:
- Input shape:
(105, 105, 3)
representing RGB images with dimensions 105x105 pixels.
- Input shape:
-
Convolutional Block 1:
- Convolutional Layer: 64 filters, kernel size
(10, 10)
, activation function:ReLU
. - Max Pooling Layer: Pool size
(2, 2)
.
- Convolutional Layer: 64 filters, kernel size
-
Convolutional Block 2:
- Convolutional Layer: 128 filters, kernel size
(7, 7)
, activation function:ReLU
. - Max Pooling Layer: Pool size
(2, 2)
.
- Convolutional Layer: 128 filters, kernel size
-
Convolutional Block 3:
- Convolutional Layer: 128 filters, kernel size
(4, 4)
, activation function:ReLU
. - Max Pooling Layer: Pool size
(2, 2)
.
- Convolutional Layer: 128 filters, kernel size
-
Convolutional Block 4:
- Convolutional Layer: 256 filters, kernel size
(4, 4)
, activation function:ReLU
. - Flatten Layer: Converts the 2D matrix into a 1D vector.
- Convolutional Layer: 256 filters, kernel size
Each of these blocks processes the input image, creating an embedding (a numerical representation) of the image. This process is performed by two identical networks, one for the Anchor
image and one for the comparison image (Positive
or Negative
).
- The Distance Layer computes the absolute difference between the embeddings produced by the two networks.
- The final layer is a Dense Layer with a sigmoid activation function, which outputs the probability of similarity between the two images (with values between 0 and 1).
The core idea of the Siamese Neural Network is to compare the two images by computing the distance between their embeddings and determining whether they are similar or dissimilar based on that distance.
The Distance Layer is the key component that determines the similarity between two images by calculating the absolute difference between the embeddings generated by the two neural networks.
-
Embedding Comparison:
- The two identical neural networks process the
Anchor
andPositive/Negative
images, generating two embeddings (one from each network). - These embeddings are high-dimensional vectors that represent the features of the images.
- The two identical neural networks process the
-
Absolute Difference:
- The Distance Layer calculates the absolute difference between the two embeddings, which represents the dissimilarity between the images.
-
Dense Layer:
- The output from the Distance Layer is passed through a Dense Layer with a single unit and a Sigmoid Activation Function.
- This layer converts the distance into a probability that indicates how similar the two images are:
- A probability close to 1 means the images are similar.
- A probability close to 0 means the images are dissimilar.
[ \text{Similarity Score} = \sigma\left(W \cdot |embedding_1 - embedding_2|\right) ] Where:
- ( W ) are the learned weights of the Dense Layer.
- ( \sigma ) is the Sigmoid Activation Function.
During training, checkpoints were established to save the model at various stages. This ensures that in the event of any interruption, the training can resume from the last saved state.
- Loss Function: Binary Cross-Entropy.
- Optimizer: Adam Optimizer (Keras).
- Training Duration: 50 epochs, with 170 images per category (510 images in total).
- Metrics: Recall and Precision.
- The model achieved a high score, close to 1, indicating strong performance.
For testing, images from the test dataset were passed into the model. For example, a picture of myself and Lionel Messi was fed into the model, which correctly predicted them as dissimilar (output = 0).
A picture of myself and me was fed into the model, which correctly predicted them as similar (output = 1)
A picture of myself and me(with headset) was fed into the model, which does not correctly predicted me (output = 0)
Due to the small number of training epochs and the small dataset available and not with a lot of variations. The model doesn't learn all the necessary features and sometimes does not even recgonize me in when the lightning is bad or when i have my headset on (hahhah).