What is Backpropagation?

When training a neural network we aim to adjust these weights and biases such that the predictions improve. In this post, we discuss how backpropagation works, and explain it in detail for three simple examples. The first two examples will contain all the calculations, for the last one we will only illustrate the equations that need to be calculated. We will not go into the general formulation of the backpropagation algorithm but will give some further readings at the end. The chain rule is essential to calculating the derivatives of activation functions in neural networks, which are composed of the outputs of activation functions of other neurons in previous layers. At its most basic, a neural network takes input data and maps it to an output value.

A simplified model is used to illustrate the concepts and to avoid overcomplicating the process too much. A 2 input, 2 output, 2 hidden layer network is used and illustrated in figure 1. The output nodes are denoted as e indicating the error, though you may also see them commonly denoted as C for the cost function. This would typically be a function like mean squared error (MSE) or binary cross entropy.

While loss might fluctuate on an epoch-to-epoch basis, it quickly converges to the minimum throughout many updates.
By adding all these desired effects, you can get a list of the nudges you want to happen to this second-to-last layer.
Each neuron is configured to perform a mathematical operation, called an “activation function”, on the sum of varyingly weighted inputs it receives from nodes in the previous layer.
A neural network consists of a set of parameters – the weights and biases – which define the outcome of the network, that is the predictions.

How neural networks work

Such a computation graph could represent an MLP, for example, which we will see in the next section. Train, validate, tune and deploy generative AI, foundation models and machine learning capabilities with IBM watsonx.ai, a next-generation enterprise studio for AI builders. Build AI applications in a fraction of the time with a fraction of the data.

Vanishing/Exploding Gradient Problem

The full set of operations for a pointwise layer is shown next in Figure 14.13.
Visualizations like this are a useful way to figure out what visual features a given neuron is sensitive to.
In this post we calculated the backpropagation algorithm for some simplified examples in detail.
We’ve also discussed gradient descent, so you should know that when people describe a network as “learning,” they mean finding the weights and biases that minimize a certain cost function.

Notice that all these operations are simple expressions, mainly involving matrix multiplies. Forward and backward for a linear layer are also very easy to write in code, using any library that provides matrix multiplication (matmul) as a primitive. In the remaining sections, we will still focus only on the case of backpropagation for the loss at a single datapoint. As you read on, keep in mind that doing the same for batches simply requires applying Equation 14.3. The gradient of a sum of terms is the sum of the gradients of each term.

This map will visually guide us through the derivation and deliver us to our final destination, the formula’s of backpropagation. To be clear, we will still end up with many formula’s that look intimidating on their own but after viewing the process by which they evolve each equation should make sense and things become very systematic. With merge and branch, we can construct any DAG computation graph by simply inserting these layers wherever we want a layer to have multiple inputs or multiple outputs. Backpropagation is an algorithm that efficiently calculates the gradient of the loss with respect to each and every parameter in a computation graph. It relies on a special new operation, called backward that, just like forward, can be defined for each layer, and acts in isolation from the rest of the graph. But first, before we get to defining backward, we will build up some intuition about the key trick backpropagation will exploit.

Each training example has its own desire for how the weights and biases should be adjusted, and with what relative strengths. By averaging together the desires of all training examples, we get the final result for how a given weight or bias should be changed in a single gradient descent step. The output of Lc’s activation function depends on the contributions that it receives from neurons in the penultimate layer, which we’ll call layer L-1. One way to change Lc’s output is to change the weights between the neurons in L-1 and Lc. By calculating the partial derivative of each L-1 weight with respect to the other weights, we can see how increasing or decreasing any of them will bring the output of Lc closer to (or further away from) 1. In a well-trained network, this model will consistently output a high probability value for the correct classification and output low probability values for the other, incorrect classifications.

Backpropagation in Neural Network

Because the network is not yet well trained, the activations in that output layer are effectively random. That’s no good; we want to change these activations so that they properly identify the digit 2. The way to read this is that the cost function is 32 times more sensitive to changes to that first weight. So if you were to wiggle the value of that weight a bit, it’ll cause a change to the cost function 32 times greater than what the same wiggle to the second weight would cause.

Figure 2 indicates the notation for nodes and weights in the example network. Illustration of backpropagation in a neural network consisting of a single neuron. If not mentioned differently, we use the following data, activation function, and loss throughout the examples of this post.

Backward Pass

Whether you’re looking at images or words or raw numerical data backpropagation tutorial all the network sees is numbers and it’s simply finding patterns in these numbers. The input data is filtered through a matrix of weights which are the parameters of the network and can number in the thousands to millions or billions. Fine tuning these weights to recognize the patterns is obviously not a task any human wants to or can do and so a method to do this was devised, several times but most notably in 1986 1.

The method takes a neural networks output error and propagates this error backwards through the network determining which paths have the greatest influence on the output. Backprop is often presented as a method just for training neural networks, but it is actually a much more general tool than that. Backprop is an efficient way to find partial derivatives in computation graphs.

5 Backpropagation Over Data Batches

Namely, if everything connected to that digit-2 neuron with a positive weight was brighter, and if everything connected with a negative weight was dimmer, that digit-2 neuron would be more active. Because this gets quite repetitive and because I only have so much length I can cram into a GIF, the process is repeated (very) quickly in figure 9 for all remaining weights. This tracing out of the edges and nodes is done for each path from the error node to each weight in the final layer, running through it quickly in figure 4. The weight subscript indexes may appear backwards but it will make more sense when we build the matrices. Indexing in this manner allows the rows of the matrix to line up with the rows of the neural network and the weight indexes agree with the typical (row, column) matrix indexing.

There is a general term for this setup, where one neural net outputs values that parameterize another neural net; this is called a hypernetwork 1. The forward network is a hypernetwork that parameterizes the backward network. Our goal is to iteratively update weights until we have reached the minimum gradient. The object of gradient descent algorithms is to find the specific parameter adjustments that will move us down the gradient most efficiently. Moving down—descending—the gradient of the loss function will decrease the loss. Since the gradient we calculated during backpropagation contains the partial derivatives for every model parameter, we know which direction to “step” each of our parameters to reduce loss.

We’ve now completed a forward pass and backward pass for a single training example. However, our goal is to train the model to generalize well to new inputs. To do so requires training on a large number of samples that reflect the diversity and range of inputs the model will be tasked with making predictions on post-training.

How neural networks work

Vanishing/Exploding Gradient Problem

Backpropagation in Neural Network

Backward Pass

5 Backpropagation Over Data Batches

Leave a Comment Cancel Reply