This is Part 1 of the two series of basic neural network posts, covering explanations of mathematics involved, in particular how backpropagation works with one hidden layer and one output layer. Part 2 addresses a vectorisation implementation for a neural network in Python based on the formulas covered in this post. If you are not fond of the mathematics behind neural networks, you can skip to Part 2.

## Illustration of a small neural network

Let us assume a neural network with one hidden layer and one output layer with weights and biases initialised as above for the illustration purpose. We have a data set with a batch size 2 and its two data points are X1 = (x1: 0.1, x2: 0.1) and X2 = (x1: 0.1, x2: 0.2). Ground truth labels for X1 and X2 is 0 and 1, respectively. A learning rate is set to 0.1. We will follow the following process step by step:

1. Forward function for a hidden layer;
2. Forward function for an output layer;
3. Quadratic loss function;
4. Backpropagation of an output layer;
5. Backpropagation of a hidden layer; and finally,
6. Adjustment of weights and biases- Gradient Descent

This process can be loosely visualised as a red arrow below: ### 1. Forward function for a hidden layer

First, the following forward function needs to be calculated: This will result in the following hidden layer: Then, a sigmoid function is calculated to produce the result of the hidden layer: ### 2. Forward function for an output layer

Similarly, the following forward function needs to be calculated for the output layer based on the result from the hidden layer: The final result of output layers is then calculated through a sigmoid function again: ### 3. Quadratic loss function

Based on the final result of the output layers, the loss is calculated. Here, a quadratic loss function is used.  ### 4. Backpropagation of an output layer

The loss calculated above is then broken down into the two output losses: where: Then, a chain rule in a backward function is: ### 5. Backpropagation of a hidden layer

I find the backward process for a hidden layer is a bit more heavy. Essentially, we will need to use what was calculated previously for the derivative of an individual loss and an output sigmoid function, which are highlighted in yellow below. For a more concrete example, I will use the weight W1 below: If we calculate all weights and biases for a hidden layer in a similar manner, we get: ### 6. Adjustment of weights and biases- Gradient Descent

With a learning rate 0.1, we can now perform first gradient descent of our small neural network: Applying this, we will eventually get all adjusted weights and biases as follow: That’s it! This is only one batch gradient descent performed with a batch size 2 data points. Now, it is time to implement this hairy process into codes! You can see this implementation here, Part 2. Otherwise, if you have any questions on the process and calculation, feel free to contact me.