Neural Networks Cost Function, Gradient Descent and Stochastic Gradient Descent


This article continues from the previous article Neural Networks Theory Part 1. In this article I will talk about optimising the weights and bias parameters of the neural network using an algorithm called gradient descent as well as define a possible cost function to be used that will be optimised.

Quadratic Cost Function

In Neural Networks in order to train a model with an appropriate set of values for the weights and biases parameter we need a function that is parameterised by weights and biases that can be optimised to achieve our goal. 

Lets step back a bit and try to understand how we can do this. A good model should classify or output a value that is exactly or as close to the desired outcome as possible. A naive approach might be to aggregate the difference between the desired outcome (y) with the predicted output (a). In mathematical terms it could be: \(\begin{eqnarray} C(w,b) \equiv \sum_x y(x) - a \end{eqnarray}\) where a = \(\sigma(w \cdot x+b)\) and we are summing over all examples.

There is one issue with this approach in that the sign of the error matters. Lets take the case where we are using 2 examples. The difference between the first desired outcome (\(y(x_1)\)) and the first example (\(a_1\)) might be 1 unit. The difference between the second desired outcome(\(y(x_2)\)) and the second example (\(a_2\)) is -1. Using the naive cost function this would return a cost of 0 which indicates that our model is perfect even though it is clearly not.

The quadratic cost function takes into account for the sign of the difference by using a concept of distance between the desired outcome and the input and taking an average. The qudratic cost function is defined as below: 

$$\begin{eqnarray} C(w,b) \equiv \frac{1}{2n} \sum_x \| y(x) - a\|^2. \tag{1}\end{eqnarray}$$

Basically what this formule does is to average the L1 norm squared of each difference in the examples from the corresponding desired output. Statisticians will know this cost function as the mean squared error.

Now we have a cost function that is parameterised by weights and biases that we want to find optimal values for. We know that the shape of the cost function is also convex, meaning that there has to be a optima (minimum point). We can now apply the gradient descent algorithm on this cost function to try and find optimal values for our model.

Gradient Descent

Gradient descent is an algorithm used to find the global optimum of a objective/loss function. Whether or not the global minimum is found is another question, but that is beyond the scope of this article.

To explain the concept of gradient descent I will go away from the weights and biases variables used in the concept of neural networks and use more general terms. For simplicity lets work in the 3-dimension space with Cost (C) in the z-axis, v1 in the x-axis and v2 in the y-axis. Note that in more realistic problems we will often have many more input variables then just the 2.

Putting it in more mathematical terms lets define the cost function as C(v1,v2). Let us also define the cost function to be a convex function, real valued and differentiable everywhere in the support of the function.

For those with a background in linear algebra, you will all be thinking this is a trivial problem why not just use calculus. The answer to that is yes you can use calculus but imagine if the cost function contains million if not billions of parameters and even then you have to test if that stationary point is a minimum or a maximum for all of those parameters. This will be very inefficient.

The idea of gradient descent is quite simple. Imagine a valley between 2 mountains. Next, imagine that it starts raining what will happen to the water from the rain? A property of water is that it will go to the lowest point it can go. Imagine that a water droplet drops at a particulr point on one of the mountains. Assuming no friction and a smooth surface to the valey we would expect this water droplet to roll down the mountain until it reaches the lowest point of the valley. This is how gradient descent works.

Now lets try and define a relationship for the change in the cost function due to changes in v1 and v2. Calculus tells us that the Cost changes by: 

\(\begin{eqnarray} \Delta C \approx \frac{\partial C}{\partial v_1} \Delta v_1 + \frac{\partial C}{\partial v_2} \Delta v_2. \tag{2}\end{eqnarray}\)

There is a way to choose \(\Delta v_1\) and \(\Delta v_2\) such that \(\Delta C\) is negative. But before I go into that lets define the vector changes of the variables v as :

$$\Delta v \equiv (\Delta v_1, \Delta v_2)^T $$

I will also define the gradient of the cost function as:

$$\begin{eqnarray} \nabla C \equiv \left( \frac{\partial C}{\partial v_1}, \frac{\partial C}{\partial v_2} \right)^T. \tag{3}\end{eqnarray}$$

Using these definitions we can define equation (2) succintly as: 

$$\begin{eqnarray} \Delta C \approx \nabla C \cdot \Delta v. \tag{3a}\end{eqnarray}$$

Where the dot is the dot product of 2 vectors.

Now lets choose \(\Delta v\) such that \(\Delta C\) will be negative and guarantee that we go towards the minimum point (assuming correct choice of hyper parameter).

Suppose we choose \(\begin{eqnarray} \Delta v = -\eta \nabla C, \tag{4}\end{eqnarray}\)

This means that \(\Delta C \approx -\eta \nabla C \cdot \nabla C = -\eta \|\nabla C\|^2\)

Since  \( ||\nabla C|| ^2\) >= 0 this ensures that \(\Delta C\) will be negative. 

This also gives us our update rule to move the variables v1 and v2 that brings the cost closer to the optimal position in the cost function:

\(\begin{eqnarray} v \rightarrow v' = v -\eta \nabla C. \tag{5}\end{eqnarray}\)

Breaking the equation above into words means that the optimal "next step" for the variable is the original position - \(\eta\) times the derivative of the cost function wrt to the variable. This is similar to trying to find the steepest step down the mountain in the earlier example. 

Gradient descent then applies this step numerous times to try and find this optimal minimum point. Note that the choice of value for \(\eta\) will affect the algorithm. Too small and you may not reach the bottom of the function, too large and you will actually take too large a step each iteration that you get further and further away from the valley.

Note that in this example we only used two variables \(v_1\) and \(v_2\). This theory generalises into arbitrary numbers of input variables.

Linking Back to Neural Networks

The update rule defined in the previous section can be applied to the weights and biases parameters in neural networks. Specifically here are the update rules:

\(\begin{eqnarray} w_k & \rightarrow & w_k' = w_k-\eta \frac{\partial C}{\partial w_k} \tag{6}\\ b_l & \rightarrow & b_l' = b_l-\eta \frac{\partial C}{\partial b_l}. \tag{7}\end{eqnarray}\)

Where the k subscript represents the weight from the kth neuron from the previous layer connected to the current aritificial neuron and l represents the bias for the current artificial neuron at layer l.

There is 1 substantial problem with the gradient descent that I would like to discuss further.

Recall that the cost function has the form \(C = \frac{1}{n} \sum_x C_x\)

What the cost measures is the sum of the individual costs of each example in the training data set. This means that in the update rule we also need to compute the gradient of the cost function \(\nabla C\) for each individual training input as well \(\nabla C_x\) and then get an average:

$$\nabla C = \frac{1}{n} \sum_x \nabla C_x$$

This means if the size of the training data is very large, this will take a very long time and is inefficient.

One way to overcome this slowness is to use a technique called stochastic gradient descent. The idea of stochastic gradient descent is that instead of calculating the gradients over the entire training data, it will randomly select a subsample of size m and calculate the gradient of the cost function over this smaller sample. These randomy selected training input data will be labelled as

$$X_1, X_2, ..., X_m$$

This sample will be referred to as a mini-batch. As long as m is large enough the gradient of the cost function calculated over the mini-batch is a good approximation of the gradient of the cost function over the entire training data:

$$\begin{eqnarray} \frac{\sum_{j=1}^m \nabla C_{X_{j}}}{m} \approx \frac{\sum_x \nabla C_x}{n} = \nabla C, \tag{8}\end{eqnarray}$$

Where the middle summation is over the entire training data set. 

To bring all of this theory back into neural networks we can re-write the update rule using stochastic gradient descent as following:

\( w_k -> w_k' = w_k-\frac{\eta}{m} \sum_j \frac{\partial C_{X_j}}{\partial w_k} \tag{9}\)

\( b_l -> b_l' = b_l-\frac{\eta}{m} \sum_j \frac{\partial C_{X_j}}{\partial b_l} \tag{10}\)

The sums are over all the training examples in the mini-batch. Next, another randomly chosen mini-batch is used and trained on those until all the data in the training dataset has been used. Once this is done the algorithm has completed one epoch of training, we then go through the same process again for another epoch until the specified number of epochs have been reached.


This concludes the theory behind Neural Networks. I have tried to limit the amount of maths in the 2 part article. Note, there are various types of neural networks that goes beyond the scope of this article and there are various aspects of Neural Networks that can be researched further on. Please leave any comments and feedback below.

Subscribe to our mailing list

* indicates required