Here is a simple neural network architecture to recognise hand-written digits based off the famous MNIST dataset
The training images are 28 x 28 pixels, 784 pixels in total. Since images are greyscaled, each pixel ranges from 0 to 255. 255 means white, 0 means black.
This is a simple neural network, only with 2 layers.
- Input Layer: Contains our 784 nodes, each mapped to a node
- Hidden Layer: 10 units, ReLU activation function
- Output Layer: 10 output units (each representing 1 digit), Softmax activation function
-
$A^{[0]} = X$ is our input layer, there is no processing there, it is just the 784 pixels. -
$Z^{[1]} = W^{[1]} A^{[0]} + b^{[1]}$ calculates the linear transformation for the first hidden layer, and introduces weights and biases to the equations. -
$A^{[1]} = g(Z^{[1]}) = ReLU(Z^{[1]})$ applies the ReLU activation function to introduce non-linearity. -
$Z^{[2]} = W^{[2]} A^{[1]} + b^{[2]}$ computes the linear transformation for the output layer, with weights and biases from the first hidden layer. -
$A^{[2]} = softmax(Z^{[2]})$ applies the softmax activation function to produce class probabilities.
-
$dZ^{[2]} = A^{[2]} - Y$ calculates the derivative of the cost with respect to$Z^{[2]}$ . -
$db^{[2]} = \frac{1}{m} \sum dZ^{[2]}$ computes the gradient of the bias for layer 2. -
$dZ^{[1]} = W^{[2]T} dZ^{[2]} \cdot g'(Z^{[1]})$ calculates the derivative of the cost with respect to$Z^{[1]}$ using the chain rule. -
$dW^{[1]} = \frac{1}{m} dZ^{[1]} X^T$ computes the gradient of the weights for layer 1. -
$db^{[1]} = \frac{1}{m} \sum dZ^{[1]}$ calculates the gradient of the bias for layer 1.
-
$W^{[1]} = W^{[1]} - \alpha dW^{[1]}$ updates the weights for layer 1 using the learning rate$\alpha$ and the gradient$dW^{[1]}$ . -
$b^{[1]} = b^{[1]} - \alpha db^{[1]}$ updates the bias for layer 1. -
$W^{[2]} = W^{[2]} - \alpha dW^{[2]}$ updates the weights for layer 2. -
$b^{[2]} = b^{[2]} - \alpha db^{[2]}$ updates the bias for layer 2.
Overall, this is a pretty simple architecture that acheived a 90% accuracy. Definite improvements can be made, for example the current architecture uses a simple 2-layer neural network (784 -> 10 -> 10). We could onsider adding more layers or increasing the number of neurons in the hidden layer to capture more complex patterns. Other than that, adding batch normalization after the ReLU activation could improve training stability and experimenting with other activation functions like ELU or LeakyReLU for the hidden layer could well optimise performance.






