This is Part 1 of the two series of basic neural network posts, covering explanations of mathematics involved, in particular how backpropagation works with one hidden layer and one output layer. Part 2 addresses a vectorisation implementation for a neural network in Python based on the formulas covered in this post. If you are not fond of the mathematics behind neural networks, you can skip to Part 2.

## Illustration of a small neural network

Let us assume a neural network with one hidden layer and one output layer with weights and biases initialised as above for the illustration purpose. We have a data set with a batch size 2 and its two data points are X1 = (x1: 0.1, x2: 0.1) and X2 = (x1: 0.1, x2: 0.2). Ground truth labels for X1 and X2 is 0 and 1, respectively. A learning rate is set to 0.1. We will follow the following process step by step:

1. Forward function for a hidden layer;
2. Forward function for an output layer;
4. Backpropagation of an output layer;
5. Backpropagation of a hidden layer; and finally,

This process can be loosely visualised as a red arrow below: ### 1. Forward function for a hidden layer

First, the following forward function needs to be calculated: This will result in the following hidden layer: Then, a sigmoid function is calculated to produce the result of the hidden layer: ### 2. Forward function for an output layer

Similarly, the following forward function needs to be calculated for the output layer based on the result from the hidden layer: The final result of output layers is then calculated through a sigmoid function again: Based on the final result of the output layers, the loss is calculated. Here, a quadratic loss function is used.  ### 4. Backpropagation of an output layer

The loss calculated above is then broken down into the two output losses: where: Then, a chain rule in a backward function is: ### 5. Backpropagation of a hidden layer

I find the backward process for a hidden layer is a bit more heavy. Essentially, we will need to use what was calculated previously for the derivative of an individual loss and an output sigmoid function, which are highlighted in yellow below. For a more concrete example, I will use the weight W1 below: If we calculate all weights and biases for a hidden layer in a similar manner, we get:   