Backpropagation Example With Numbers Step by Step

When I come across a new mathematical concept or before I use a canned software package, I like to replicate the calculations in order to get a deeper understanding of what is going on. This type of computation based approach from first principles helped me greatly when I first came across material on artificial neural networks.

In this post, I go through a detailed example of one iteration of the backpropagation algorithm using full formulas from basic principles and actual values. The neural network I use has three input neurons, one hidden layer with two neurons, and an output layer with two neurons.

The following are the (very) high level steps that I will take in this post. Details on each step will follow after.

(1) Initialize weights for the parameters we want to train

(2) Forward propagate through the network to get the output values

(3) Define the error or cost function and its first derivatives

(4) Backpropagate through the network to determine the error derivatives

(5) Update the parameter estimates using the error derivative and the current value


Step 1

The input and target values for this problem are x_1=1, x_2=4, x_3=5 and t_1=0.1, t_2=0.05. I will initialize weights as shown in the diagram below. Generally, you will assign them randomly but for illustration purposes, I’ve chosen these numbers.


Step 2

Mathematically, we have the following relationships between nodes in the networks. For the input and output layer, I will use the somewhat strange convention of denoting z_{h_1}, z_{h_2}, z_{o_1}, and z_{o_2} to denote the value before the activation function is applied and the notation of h_1, h_2, o_1, and o_2 to denote the values after application of the activation function.

Input to hidden layer

w_1x_1+w_3x_2+w_5x_3+b_1=z_{h_1}

w_2x_1+w_4x_2+w_6x_3+b_1=z_{h_2}

h_1=\sigma(z_{h_1})

h_2=\sigma(z_{h_2})

Hidden layer to output layer

w_7h_1 + w_9h_2 + b_2=z_{o_1}

w_8h_1 + w_{10}h_2 + b_2=z_{o_2}

o_1=\sigma(z_{o_1})

o_2=\sigma(z_{o_2})

We can use the formulas above to forward propagate through the network. I’ve shown up to four decimal places below but maintained all decimals in actual calculations.

w_1x_1 + w_3x_2 + w_5x_3 + b_1 = z_{h_1} = 0.1(1) + 0.3(4) + 0.5(5) + 0.5 = 4.3
h_1 = \sigma(z_{h_1}) = \sigma(4.3) = 0.9866

w_2x_1 + w_4x_2 + w_6x_3 + b_1 = z_{h_2} = 0.2(1) + 0.4(4) + 0.6(5) + 0.5 = 5.3
h_2 = \sigma(z_{h_2}) = \sigma(5.3) = 0.9950

w_7h_1 + w_9h_2 + b_2 = z_{o_1} = 0.7(0.9866) + 0.9(0.9950) + 0.5 = 2.0862
o_1 = \sigma(z_{o_1}) = \sigma(2.0862) = 0.8896

w_8h_1 + w_{10}h_2 + b_2 = z_{o_2} = 0.8(0.9866) + 0.1(0.9950) + 0.5 = 1.3888
o_2 = \sigma(z_{o_2}) = \sigma(1.3888) = 0.8004


Step 3

We now define the sum of squares error using the target values and the results from the last layer from forward propagation.

E = \frac{1}{2}[(o_1-t_1)^2+(o_2-t_2)^2

\frac{dE}{d_{o_1}} = o_1 - t_1

\frac{dE}{d_{o_2}} = o_2 - t_2


Step 4

We are now ready to backpropagate through the network to compute all the error derivatives with respect to the parameters. Note that although there will be many long formulas, we are not doing anything fancy here. We are just using the basic principles of calculus such as the chain rule.

First we go over some derivatives we will need in this step. The derivative of the sigmoid function is given here. Also, given that w_7h_1 + w_9h_2 + b_2=z_{o_1} and w_8h_1 + w_{10}h_2 + b_2=z_{o_2}, we have \frac{dz_{o_1}}{dw_7} = h_1, \frac{dz_{o_2}}{dw_8} = h_1, \frac{dz_{o_1}}{dw_9} = h_2, \frac{dz_{o_2}}{dw_{10}} = h_2, \frac{dz_{o_1}}{db_2} = 1, and \frac{dz_{o_2}}{db_2} = 1.

We are now ready to calculate \frac{dE}{dw_7}, \frac{dE}{dw_8}, \frac{dE}{dw_9}, and \frac{dE}{dw_{10}} using the derivatives we have already discussed.

\frac{dE}{dw_7} = \frac{dE}{do_1} \frac{do_1}{dz_{o_1}} \frac{dz_{o_1}}{dw_7}

\frac{dE}{dw_7} = (o_1 - t_1)(o_1(1 - o_1))h_1

\frac{dE}{dw_7} = (0.8896 - 0.1)(0.8896(1 - 0.8896))(0.9866)

\frac{dE}{dw_7} = 0.0765

I will omit the details on the next three computations since they are very similar to the one above. Feel free to leave a comment if you are unable to replicate the numbers below.

\frac{dE}{dw_8} = \frac{dE}{do_2} \frac{do_2}{dz_{o_2}} \frac{dz_{o_2}}{dw_8}

\frac{dE}{dw_8} = (0.7504)(0.1598)(0.9866)

\frac{dE}{dw_8} = 0.1183

\frac{dE}{dw_9} = \frac{dE}{do_1} \frac{do_1}{dz_{o_1}} \frac{dz_{o_1}}{dw_9}

\frac{dE}{dw_9} = (0.7896)(0.0983)(0.9950)

\frac{dE}{dw_9} = 0.0772

\frac{dE}{dw_{10}} = \frac{dE}{do_2} \frac{do_2}{dz_{o_2}} \frac{dz_{o_2}}{dw_{10}}

\frac{dE}{dw_{10}} = (0.7504)(0.1598)(0.9950)

\frac{dE}{dw_{10}} = 0.1193

The error derivative of b_2 is a little bit more involved since changes to b_2 affect the error through both o_1 and o_2.

\frac{dE}{db_2} = \frac{dE}{do_1} \frac{do_1}{dz_{o_1}} \frac{dz_{o_1}}{db_2} + \frac{dE}{do_2} \frac{do_2}{dz_{o_2}} \frac{dz_{o_2}}{db_2}

\frac{dE}{db_2} = (0.7896)(0.0983)(1) + (0.7504)(0.1598)(1)

\frac{dE}{db_2} = 0.1975

To summarize, we have computed numerical values for the error derivatives with respect to w_7, w_8, w_9, w_{10}, and b_2. We will now backpropagate one layer to compute the error derivatives of the parameters connecting the input layer to the hidden layer. These error derivatives are \frac{dE}{dw_1}, \frac{dE}{dw_2}, \frac{dE}{dw_3}, \frac{dE}{dw_4}, \frac{dE}{dw_5}, \frac{dE}{dw_6}, and \frac{dE}{db_1}.

I will calculate \frac{dE}{dw_1}, \frac{dE}{dw_3}, and \frac{dE}{dw_5} first since they all flow through the h_1 node.

\frac{dE}{dw_1} = \frac{dE}{dh_1} \frac{dh_1}{dz_{h_1}} \frac{dz_{h_1}}{dw_1}

The calculation of the first term on the right hand side of the equation above is a bit more involved than previous calculations since h_1 affects the error through both o_1 and o_2.

\frac{dE}{dh_1} = \frac{dE}{do_1} \frac{do_1}{dz_{o_1}} \frac{dz_{o_1}}{dh_1} + \frac{dE}{do_2} \frac{do_2}{dz_{o_2}} \frac{dz_{o_2}}{dh_1}

Now I will proceed with the numerical values for the error derivatives above. These derivatives have already been calculated above or are similar in style to those calculated above. If anything is unclear, please leave a comment.

\frac{dE}{dh_1} = (0.7896)(0.0983)(0.7) + (0.7504)(0.1598)(0.8) = 0.1502

Plugging the above into the formula for \frac{dE}{dw_1}, we get

\frac{dE}{dw_1} = (0.1502)(0.0132)(1) = 0.0020

The calculations for \frac{dE}{dw_3} and \frac{dE}{dw_5} are below

\frac{dE}{dw_3} = \frac{dE}{dh_1} \frac{dh_1}{dz_{h_1}} \frac{dz_{h_1}}{dw_3}

\frac{dE}{dw_3} = (0.1502)(0.0132)(4) = 0.0079

\frac{dE}{dw_5} = \frac{dE}{dh_1} \frac{dh_1}{dz_{h_1}} \frac{dz_{h_1}}{dw_5}

\frac{dE}{dw_5} = (0.1502)(0.0132)(5) = 0.0099

I will now calculate \frac{dE}{dw_2}, \frac{dE}{dw_4}, and \frac{dE}{dw_6} since they all flow through the h_2 node.

\frac{dE}{dw_2} = \frac{dE}{dh_2} \frac{dh_2}{dz_{h_2}} \frac{dz_{h_2}}{dw_2}

The calculation of the first term on the right hand side of the equation above is a bit more involved since h_2 affects the error through both o_1 and o_2.

\frac{dE}{dh_2} = \frac{dE}{do_1} \frac{do_1}{dz_{o_1}} \frac{dz_{o_1}}{dh_2} + \frac{dE}{do_2} \frac{do_2}{dz_{o_2}} \frac{dz_{o_2}}{dh_2}

\frac{dE}{dh_2} = (0.7896)(0.0983)(0.9) + (0.7504)(0.1598)(0.1) = 0.0818

Plugging the above into the formula for \frac{dE}{dw_2}, we get

\frac{dE}{dw_2} = (0.0818)(0.0049)(1) = 0.0004

The calculations for \frac{dE}{dw_4} and \frac{dE}{dw_6} are below

\frac{dE}{dw_4} = \frac{dE}{dh_2} \frac{dh_2}{dz_{h_2}} \frac{dz_{h_2}}{dw_4}

\frac{dE}{dw_4} = (0.0818)(0.0049)(4) = 0.0016

\frac{dE}{dw_6} = \frac{dE}{dh_2} \frac{dh_2}{dz_{h_2}} \frac{dz_{h_2}}{dw_6}

\frac{dE}{dw_6} = (0.0818)(0.0049)(5) = 0.0020

The final error derivative we have to calculate is \frac{dE}{db_1}, which is done next

\frac{dE}{db_1} = \frac{dE}{do_1} \frac{do_1}{dz_{o_1}} \frac{dz_{o_1}}{dh_1} \frac{dh_1}{dz_{h_1}} \frac{dz_{h_1}}{db_1} + \frac{dE}{do_2} \frac{do_2}{dz_{o_2}} \frac{dz_{o_2}}{dh_2} \frac{dh_2}{dz_{h_2}} \frac{dz_{h_2}}{db_1}

\frac{dE}{db_1} = (0.7896)(0.0983)(0.7)(0.0132)(1) + (0.7504)(0.1598)(0.1)(0.0049)(1) = 0.0008

We now have all the error derivatives and we’re ready to make the parameter updates after the first iteration of backpropagation. We will use the learning rate of \alpha = 0.01

w_1 := w_1 - \alpha \frac{dE}{dw_1} = 0.1 - (0.01)(0.0020) = 0.1000

w_2 := w_2 - \alpha \frac{dE}{dw_2} = 0.2 - (0.01)(0.0004) = 0.2000

w_3 := w_3 - \alpha \frac{dE}{dw_3} = 0.3 - (0.01)(0.0079) = 0.2999

w_4 := w_4 - \alpha \frac{dE}{dw_4} = 0.4 - (0.01)(0.0016) = 0.4000

w_5 := w_5 - \alpha \frac{dE}{dw_5} = 0.5 - (0.01)(0.0099) = 0.4999

w_6 := w_6 - \alpha \frac{dE}{dw_6} = 0.6 - (0.01)(0.0020) = 0.6000

w_7 := w_7 - \alpha \frac{dE}{dw_7} = 0.7 - (0.01)(0.0765) = 0.6992

w_8 := w_8 - \alpha \frac{dE}{dw_8} = 0.8 - (0.01)(0.1183) = 0.7988

w_9 := w_9 - \alpha \frac{dE}{dw_9} = 0.9 - (0.01)(0.0772) = 0.8992

w_{10} := w_{10} - \alpha \frac{dE}{dw_{10}} = 0.1 - (0.01)(0.1193) = 0.0988

b_1 := b_1 - \alpha \frac{dE}{db_1} = 0.5 - (0.01)(0.0008) = 0.5000

b_2 := b_2 - \alpha \frac{dE}{db_2} = 0.5 - (0.01)(0.1975) = 0.4980

So what do we do now? We repeat that over and over many times until the error goes down and the parameter estimates stabilize or converge to some values. We obviously won’t be going through all these calculations manually. I’ve provided Python code below that codifies the calculations above. Nowadays, we wouldn’t do any of these manually but rather use a machine learning package that is already readily available.

I ran 10,000 iterations and we see below that sum of squares error has dropped significantly after the first thousand or so iterations.

 

12 thoughts on “Backpropagation Example With Numbers Step by Step

  1. Almost all of whatever you state is supprisingly appropriate and that makes me ponder the reason why I had not looked at this in this light before. This article really did turn the light on for me personally as far as this specific topic goes. Nonetheless at this time there is one issue I am not necessarily too comfy with and while I attempt to reconcile that with the actual central theme of the position, permit me see what the rest of the subscribers have to point out.Very well done.

  2. how are you computing dE/do2 = .7504?

    you state:
    dE/do2 = o2 – t2
    o2 = .8004
    t2 = .5

    therefore:
    dE/do2 = (.8004) – (.5) = .3004 (not .7504)

    am i missing something?

Leave a Reply

Your email address will not be published.