Activation Functions

Activation functions are used in Neural Networks in order to add a non-linearity to the otherwise linear layers. The addition of a non-linearity ensures that neurons in a Neural Network are separated and not simply stacked linear functions. Without activations the Neural Network would just become a linear model as the product of multiple linear functions is a linear function.

The requirements of an activation function are the following:

Non-Linear: In order for the activation function to be effective it must be a non-linear function.
Differentiable: Neural Networks rely on back-propagation in order to "learn" and update their parameters. It is a requirement that all elements of a Neural Network are are differentiable in order for back-propagation to work.

The following are the most commonly used activation functions. I have also included the PyTorch functions and methods that are used to implement them.

Sigmoid

Sigmoid is a classic activation function which takes a positive or negative x value and outputs a value between 0 and 1. As you can see from the formula below, the sigmoid value when x = 0 is 0.5. Negative x values will give an output of from >0 to <0.5 and positive x values will give an output of >0.5 to <1.

\(\sigma(x) = \frac{1}{1 + e^{-x}} \)

One useful property of Sigmoid is that very large values will be close to 1 and very small values will be close to 0. Values in the range of about -3 to 3 will have much more resolution is terms of their Sigmoid output compared to larger values. This provides a "squishification" type effect for large or small values.

Sigmoid is typically only used as an output activation function for binary classifiers as it has a marjor drawback when used as an activation between layers within a neural network. The drawback is the "vanishing gradient problem". What this means is that gradients will quickly decay to zero after a few Sigmoid layers due to Sigmoid's small derivative.

PyTorch Implementation: import torch from torch import nn sigmoid = nn.Sigmoid() test_values = torch.Tensor([-3.0, 0.0, 3.0]) print(sigmoid(test_values)) >> tensor([0.0474, 0.5000, 0.9526])

Tanh

The Tanh function, also known as the hyperbolic tangent, performs in a similar manner to Sigmoid with the key difference being that it's output range is -1 to 1. Tanh has similar drawbacks to Sigmoid such as the vanishing gradient problmem and is therefore not suitable for use as an activation in deep networks.

\(\text{Tanh}(x) = \Large{\frac{e^x - e^{-x}}{e^x + e^{-x}}} \)

Tanh can be used for the output activation of a network if you wish to have an output in the -1 to 1 range. This is commonly seen for image generation tasks.

PyTorch Implementation: import torch from torch import nn tanh = nn.Tanh() test_values = torch.Tensor([-3.0, 0.0, 3.0]) print(tanh(test_values)) >> tensor([-0.9951, 0.0000, 0.9951])

ReLU

The ReLU (Rectified Linear Unit) activation function provides a solution to the vanishing gradient problem. It achieves this by having a much larger derivative than Sigmoid. ReLU is expressed as:

\( \text{ReLU}(x) = \begin{cases} 0 & \text{if } x \lt 0 \\ x & \text{if } x \geq 0 \end{cases} \)

The derivative of ReLU(x) is equal to 1 for positive integers and is equal to 0 for negative integers. Note that the derivative of ReLU is undefined for x = 0 however it is acceptable to set the derivative to either 1 or 0 in practical applications. Having a derivative of 1 allows gradients to flow during back-propagation without vanishing.

ReLU is commonly used as an activation function for hidden layers of Neural Networks however there are some downsides to it's use. ReLU has a derivative of zero for negative inputs and therefore it's possible for a neuron to become disabled (stuck at zero) if it receives many negative inputs. Furthermore it's also possible to get "exploding gradients" by using ReLU as the output is not bounded. This issue can be mitigated by the use of batch normalisation.

There are many variations of the ReLU activation which address the downsides of ReLU. These include Leaky ReLU, GELU, SwiGLU and many others.

Leaky ReLU

A Leaky ReLU is similar to ReLU but allows for passthrough of negative values. The negative values are defined by a "slope" parameter. Negative values of the input x are multiplied by this slope value (denoted by α).

\( \text{ReLU}(x) = \begin{cases} \alpha x & \text{if } x \lt 0 \\ x & \text{if } x \geq 0 \end{cases} \)

Using Leaky ReLU avoids the dying neuron and exploding gradient issues that can occur with standard ReLU. This activation function also introduces another hyperparameter (α) which can be tuned to increase model performance.