What is SoftMax and ArgMax?

What is a SoftMax?

First created in 1868 by physicist Ludwig Boltzmann, the term was yet to be called the SoftMax function. Its initial name was the Boltzmann or Gibbs Distribution, theorized for Gas Theory: a theory in physics, as you guessed, is about gas. It has later become more useful for statistics and neural networks, where its purpose is to redefine the probability distribution, which tells the probability from 0-1 of a set of tokens being the next possible one. In some cases, the probability distribution could be: 

like = 10.01

love = 7.4

hate = 5.1

distaste = 3.7 

(Assume for our test case that the tokens are full words)

As you noticed, the tokens are not between 0 and 1. In our case, this is not very helpful or useful since when numbers add up together, they do not equal a value that is constant for every probability distribution created. This makes it hard to determine the real possibility for the next word. That's where the SoftMax function comes in, 

x is the one of the next tokens that is in the probability distribution (e.g. like, love, hate, distaste).

e is Euler’s number.

This function not only sets every token's probability between 0 and 1, but it also puts it in the context of the other tokens. So now, our new probability distribution would be, 

like = 0.924

love = 0.068

hate = 0.007

distaste = 0.002

So, what's most probable in this scenario? It would be “like.” 

*Note: SoftMax and ArgMax are what are called "Activation Functions," which is just a fancy name of saying it calculates the output of an node.

What is an ArgMax?

The ArgMax is just like the SoftMax, except values can only be set to 0 or 1. This keeps the probability distribution very clean and easy to read. Let's take our same set of data again, 

like = 10.01

love = 7.4

hate = 5.1

distaste = 3.7 

The highest value will be set to 1 and the rest to 0. So, 

like = 1

love = 0

hate = 0

distaste = 0

Thus, "like" is the next token.

Pros and Cons

SoftMax

Pros: 

- Very easy to see the "weights" and "biases" of a neural network working

- Easy to customize to change the values of the probability distribution more to your liking.

- Can be used for Backpropagation (a way to update the weights and parameters of a neural network).

- Good during the Training Process

Cons:

- Gets very hard to read/hard to debug

- Not good when shipping out the actual model

ArgMax

Pros:

- Super easy to read (either a 0 or 1)

- Best for the actual shipped model

Cons:

- Hard to see the effects of the weights and biases in the probability distribution (Bad for the training process)

- Cannot be used for Backpropagation

Works Referenced/Extra Resources

These were the works I referenced while researching for this article. I also included some other resources that I thought would be beneficial for everybody. All other resources that were used are already linked above.

"Neural Networks Part 5: ArgMax and SoftMax" by StatQuest with Josh Starmer.

"What Is Argmax in Machine Learning?" by Jason Brownlee, PhD (Machine Learning Mystery).

"Softmax Activation Function with Python" by Jason Brownlee, PhD (Machine Learning Mystery). 


*edit: all the links have broken, and have not been able to replace them. Sorry for the inconvenience.

Comments

Popular posts from this blog

What is a Multimodal LLM?

A Brief Introduction into Model Quantization

A Mathematical Explanation of Gradient Descent