The purpose of this project is to create a Deep Learning Model that is able to differentiate between healthy cells and cells infected with malaria and classify them correctly. The aim is to achieve a very high sensitivity and specifiity on the test data (> 90%).
One way to create such a model is by constructing a Convolutional Neural Network
using Keras
and Tensorflow
.
The data used to train the model comes from US'National Institues of Health and will be imported into the project from Tensorflow-Datasets
. This dataset has exactly 27,558
images of parasitized and uninfected cells from the thin blood smear slide images of segmented cells. The data was collected from 150 patients that were infected with Malaria and 50 healthy patients. The infected cells contain parasites called Plasmodium Falciparum
that are responsible for causing Malaria.
There are equal number of images for both the classes. The data was split into training
, validation
and test
sets using a 80-10-10 split. All sets include almost equal instances of the two classes.
A summary of the model is provided below:
Model: "Malaria Detection"
_________________________________________________________________
Layer (type) Output Shape Param #
=================================================================
resizing (Resizing) (None, 300, 300, 3) 0
conv2d (Conv2D) (None, 298, 298, 16) 448
max_pooling2d (MaxPoolin (None, 149, 149, 16) 0
g2D)
conv2d (Conv2D) (None, 147, 147, 16) 2320
max_pooling2d (MaxPoolin (None, 73, 73, 16) 0
g2D)
conv2d (Conv2D) (None, 69, 69, 32) 12832
max_pooling2d (MaxPoolin (None, 34, 34, 32) 0
g2D)
conv2d (Conv2D) (None, 28, 28, 64) 100416
max_pooling2d (MaxPoolin (None, 14, 14, 64) 0
g2D)
conv2d (Conv2D) (None, 8, 8, 64) 200768
max_pooling2d (MaxPoolin (None, 4, 4, 64) 0
g2D)
flatten (Flatten) (None, 1024) 0
dense (Dense) (None, 256) 262400
dense (Dense) (None, 1) 257
=================================================================
Total params: 579,441
Trainable params: 579,441
Non-trainable params: 0
_________________________________________________________________
In addition to the Convoltional layers and the Dense layers, a resizing layer was added to account for the discrete dimensions of images in the data. This can be fixed by preprocessing the data as well, but adding the Resizing
layer saves time required to do the preprocessing the data. It also works better with the data type that Tensorflow-Datasets
imports the data as.
The batch size for training the model was set to 16
. Due to hardware constraints, it is not possible to allocate memory for a bigger batch size during training. Using a smaller batch size can be considered, but it will increase the training time significantly.
Adam
optimizer was used with an initial Learning Rate of 0.001. Given that this is a Binary Classification problem, Binary Cross-Entropy
was used as the loss function. Other optimizers and loss functions can be tested out for changes in model performance.
The model achieved an accuracy of ~96% on the training set and ~93% on the test set. The model steadily increases an accuracy after a couple epochs and reaches its peak soon after. More epochs could be used in training but it will result in an overfit model that cannot generalise well. Due to memory constraints, it is also not possible to increase the complexity of the model.
- A web application that uses the model to detect parasites.
- Using a more complex model.