When I come across a new mathematical concept or before I use a canned software package, I like to replicate the calculations in order to get a deeper understanding of what is going on. This type of computation based approach from first principles helped me greatly when I first came across material on artificial neural networks.
In this post, I go through a detailed example of one iteration of the backpropagation algorithm using full formulas from basic principles and actual values. The neural network I use has three input neurons, one hidden layer with two neurons, and an output layer with two neurons.
The following are the (very) high level steps that I will take in this post. Details on each step will follow after.
(1) Initialize weights for the parameters we want to train
(2) Forward propagate through the network to get the output values
(3) Define the error or cost function and its first derivatives
(4) Backpropagate through the network to determine the error derivatives
(5) Update the parameter estimates using the error derivative and the current value
Step 1
The input and target values for this problem are and
. I will initialize weights as shown in the diagram below. Generally, you will assign them randomly but for illustration purposes, I’ve chosen these numbers.
Step 2
Mathematically, we have the following relationships between nodes in the networks. For the input and output layer, I will use the somewhat strange convention of denoting ,
,
, and
to denote the value before the activation function is applied and the notation of
,
,
, and
to denote the values after application of the activation function.
Input to hidden layer
Hidden layer to output layer
We can use the formulas above to forward propagate through the network. I’ve shown up to four decimal places below but maintained all decimals in actual calculations.
Step 3
We now define the sum of squares error using the target values and the results from the last layer from forward propagation.
Step 4
We are now ready to backpropagate through the network to compute all the error derivatives with respect to the parameters. Note that although there will be many long formulas, we are not doing anything fancy here. We are just using the basic principles of calculus such as the chain rule.
First we go over some derivatives we will need in this step. The derivative of the sigmoid function is given here. Also, given that and
, we have
,
,
,
,
, and
.
We are now ready to calculate ,
,
, and
using the derivatives we have already discussed.
I will omit the details on the next three computations since they are very similar to the one above. Feel free to leave a comment if you are unable to replicate the numbers below.
The error derivative of is a little bit more involved since changes to
affect the error through both
and
.
To summarize, we have computed numerical values for the error derivatives with respect to ,
,
,
, and
. We will now backpropagate one layer to compute the error derivatives of the parameters connecting the input layer to the hidden layer. These error derivatives are
,
,
,
,
,
, and
.
I will calculate ,
, and
first since they all flow through the
node.
The calculation of the first term on the right hand side of the equation above is a bit more involved than previous calculations since affects the error through both
and
.
Now I will proceed with the numerical values for the error derivatives above. These derivatives have already been calculated above or are similar in style to those calculated above. If anything is unclear, please leave a comment.
Plugging the above into the formula for , we get
The calculations for and
are below
I will now calculate ,
, and
since they all flow through the
node.
The calculation of the first term on the right hand side of the equation above is a bit more involved since affects the error through both
and
.
Plugging the above into the formula for , we get
The calculations for and
are below
The final error derivative we have to calculate is , which is done next
We now have all the error derivatives and we’re ready to make the parameter updates after the first iteration of backpropagation. We will use the learning rate of
So what do we do now? We repeat that over and over many times until the error goes down and the parameter estimates stabilize or converge to some values. We obviously won’t be going through all these calculations manually. I’ve provided Python code below that codifies the calculations above. Nowadays, we wouldn’t do any of these manually but rather use a machine learning package that is already readily available.
I ran 10,000 iterations and we see below that sum of squares error has dropped significantly after the first thousand or so iterations.
Thank you from the bottom of my heart for everything
I like this internet site because so much useful stuff on here : D.
Almost all of whatever you state is supprisingly appropriate and that makes me ponder the reason why I had not looked at this in this light before. This article really did turn the light on for me personally as far as this specific topic goes. Nonetheless at this time there is one issue I am not necessarily too comfy with and while I attempt to reconcile that with the actual central theme of the position, permit me see what the rest of the subscribers have to point out.Very well done.
Thank you from the bottom of my heart for everything
There are no words to show my appreciation!
And also:
( 0.7896 * 0.0983 * 0.7 * 0.0132 * 1) + ( 0.7504 * 1598 * 0.1 * 0.0049 * 1);
-> 0.5882953953632 not 0.0008
My bad, forget a 0. there
In your final calculation of db1, you chain derivates from w7 and w10, not w8 and w9, why?
I really love your article. It’s obvious that you have plenty of knowledge about this topic.Your points are well made as well as relatable. Thanks for writing engaging and interesting material.
nevermind, figured it out, you meant for t2 to equal .05 not .5.
thanks for the write-up.
how are you computing dE/do2 = .7504?
you state:
dE/do2 = o2 – t2
o2 = .8004
t2 = .5
therefore:
dE/do2 = (.8004) – (.5) = .3004 (not .7504)
am i missing something?
Thanks for the post. It explained backprop perfectly.