Credit: Medium

Introduction

In a Neural Network, there are several possibilities for what the output could be. For example, lets say we are training a Neural Network to guess the tone of language of a person, going from 0 (chill), 0.5 (decent mood) to 1.0 (angry).

Well, we have to train the Neural Network to guess the passivity somehow! So, we assign random values to each neuron because we don't know what the function is to accurately guess somebody's tone. This means in the final/output layer, our probability distribution could look like,

This is a problem since it's just guessing and hasn't been able to come to a conclusion. So, we need something to train the Neural Network and tell it how far off it is. That's where Gradient Descent comes in.

How Does it Work?

First, we find the cost function, which is a fancy term that is to find the output value - expected output value square for all neurons summed. For our example, it would look like,

f(x) = (0.5-1.0)^2 + (0.1-0.0)^2 + (0.4-0.0)^2

Once we figure this value out, we want to try and get this as close to the minimum/smallest value we can. When the calculations become more complex (e.g. 100+ output values), these equations can get very large. So, we would find the derivative, or the slope of the point there and try to make it smaller by nudging it in each direction a little. This way, we can check all our potential outputs and make the correct guess. This way, you will get to the local minima.

Drawback

What I have explained above is the Vanilla/Classic version of Gradient Descent. Sometimes, this version of Gradient Descent can run into a few issues.

1. Vanishing Gradients

Changes to weights get very small, and as a result, they don't learn and reach the global minima as you start updating the weights.

2. Exploding Gradients

This is the opposite to the Vanishing Gradient problem. The weights constantly change and move a lot but the model doesn't actually learn.

4. Local Minima

Like I said, the function can get stuck in the "local minima" which is not always the lowest global minima. This means there could theoretically be a "lower" point.

Works Referenced/Extra Resources

These were the works I referenced while researching for this article. I also included some other resources that I thought would be beneficial for everybody.

"Gradient Descent Problems and Solutions in Neural Networks" by Shachi Kaul (published in Analytics Vidhya) on Medium

"Gradient descent, how neural networks learn | DL2" by 3Blue1Brown on YouTube

"Gradient Descent in 3 minutes" by Visually Explained on YouTube

A Beginner's look into Artificial Intelligence

Intro to Gradient Descent

Credit: Medium

Introduction

How Does it Work?

Drawback

What I have explained above is the Vanilla/Classic version of Gradient Descent. Sometimes, this version of Gradient Descent can run into a few issues.

1. Vanishing Gradients

2. Exploding Gradients

4. Local Minima

Works Referenced/Extra Resources

Comments

Post a Comment

Popular posts from this blog

What is a Multimodal LLM?

A Brief Introduction into Model Quantization

A Mathematical Explanation of Gradient Descent