Summary of common activation functions (deep learning)

foreword

  When learning neural networks, we always hear the word activation function, and many materials refer to commonly used activation functions, such as Sigmoid function, tanh function, and Relu function. After studying for a period of time, I decided to record personal study notes.

1. Activation function

1. Activation function definition?
  In the neural network, after the input is weighted and summed, it needs to go through a function, which is the activation function (Activation Function).

2. What is the purpose of the activation function?
  First of all, we need to know that if the activation function is not introduced in the neural network, then in the network, the output of each layer is a linear function of the input of the previous layer, no matter how many layers the final neural network has, the output is the input Linear combination; it can generally only be applied to linear classification problems, such as a very typical multi-layer perceptron. If you want to continue to take advantage of the neural network in nonlinear problems, you need to add an activation function to process the output of each layer at this time, and introduce nonlinear factors so that the neural network can approach any nonlinear function. In turn, the neural network with the added activation function can continue to play an important role in the nonlinear field!
  Furthermore, the application of activation functions in neural networks, in addition to introducing nonlinear expression capabilities, can improve model robustness, alleviate gradient disappearance problems, map feature inputs to new feature spaces, and accelerate model convergence. Different degrees of improvement!

2. Several common activation functions at present

  Common activation functions mainly include the following: Sigmoid, tanh, ReLU, ReLU6 and variants P-R-Leaky, ELU, SELU, Swish, Mish, Maxout, hard-sigmoid, hard-swish.
  The following will be divided into saturated activation function and non-saturated activation function for introduction and analysis.

1. Saturated activation function

   (mainly Tanh, Sigmoid and hard-Sigmoid functions)

Sigmoid

The mathematical expression of the Sigmoid function is as follows:

  As shown in the figure, it is a sigmoid function. It can be seen that its value range is between 0 and 1; the advantages and disadvantages of this type of activation function are as follows:

advantage:
  1. Compress a large range of input eigenvalues ​​to between 0 and 1, so that the data amplitude can be kept from changing greatly in the deep network, and the Relu function will not restrict the data amplitude;
   2. It is closest to biological neurons in the physical sense;
  3. According to its output range, this function is suitable for models with predicted probability as output;
shortcoming:
  1. When the input is very large or very small, the output is basically constant, that is, the change is very small, which leads to the gradient close to 0;
  2. The output is not 0-mean, which will cause the neurons in the latter layer to receive the non-zero-mean signal output by the previous layer as input. As the network deepens, the distribution trend of the original data will change;
  3. The gradient may disappear prematurely, resulting in slower convergence. For example, compared with the Tanh function, it converges faster than the sigmoid function because its gradient disappearance problem is lighter than that of the sigmoid function;
  4. Exponentiation is relatively time-consuming.

Tanh

The mathematical expression of the Tanh function is as follows:

  As shown in the figure, it is the Tanh function; compared with the Sigmoid function in the figure below, it can be seen that the two activation functions are saturated activation functions, and the output range of this function is between -1 and 1. Its advantages and disadvantages are summarized as follows:
advantage:
  1. Solved the above-mentioned problem that the output of the Sigmoid function is not 0 mean;
  2. The value range of the derivative of the Tanh function is between 01, which is better than 00.25 of the sigmoid function, which alleviates the problem of gradient disappearance to a certain extent;
   3. The Tanh function is similar to the y=x function near the origin. When the input activation value is low, matrix operations can be performed directly, and the training is relatively easy;
shortcoming:
  1. Similar to the Sigmoid function, the problem of gradient disappearance still exists;
  2. Observe its two forms of expressions, namely 2*sigmoid(2x)-1 and (exp(x)-exp(x))/(exp(x)+exp(-x)), it can be seen that power operation The problem still exists;

2. Unsaturated activation function

(mainly ReLU, ReLU6 and variants P-R-Leaky, ELU, Swish, Mish, Maxout, hard-sigmoid, hard-swish)

Relu

The mathematical expression of the Relu function is as follows:

  The above figure is the graph of the Relu function. It can be seen that when the input is negative, the output is 0. In the interval where the input is greater than 0, the output y=x. It can be seen that this function is not a function that can be derived in the whole interval; its advantages and disadvantages Summarized as follows:

advantage:
  1. Compared with the sigmoid function and Tanh function, when the input is positive, the Relu function has no saturation problem, which solves the gradient vanishing problem and makes the deep network trainable;
  2. The calculation speed is very fast, only need to judge whether the input is greater than 0;
  3. The convergence speed is much faster than the sigmoid and Tanh functions;
  4. Relu output will make some neurons have a value of 0, which not only brings network sparsity, but also reduces the correlation between parameters, and alleviates the problem of overfitting to a certain extent;
shortcoming:
  1. The output of the Relu function is not a function whose mean value is 0;
  2. There is a Dead Relu Problem, that is, some neurons may never be activated, resulting in the corresponding parameters not being updated. The main reasons for this problem include parameter initialization problems and excessive learning rate settings;
  3. When the input is positive and the derivative is 1, in the "chain reaction", there will be no gradient disappearance, but the strength of gradient descent depends entirely on the product of weights, which may lead to gradient explosion;

ReLU6

The mathematical expression of the Relu function is as follows:


  The above figure is a schematic diagram of the Relu6 function. When x is greater than or equal to 6, the value of y will be limited; its advantages and disadvantages are similar to relu.

  ReLU6 is an ordinary ReLU but the maximum output is limited to 6, which is used in the MobilenetV1 network. The purpose is to meet the low precision needs of float16/int8. The picture above is the relationship between relu6 and relu.

advantage:
  1.ReLU6 has the advantages of ReLU function;
  2. This activation function can work well even when the mobile device uses float16/int8 low precision. If the activation range of ReLU is not limited and the activation value is very large, the low-precision float16/int8 cannot accurately describe such a large range of values, resulting in loss of precision.
shortcoming:
  1. Similar to the disadvantages of ReLU.

LeakyReLU

The mathematical expression of the LeakyReLU function is as follows:


  The above figure is a schematic diagram of the Leaky Relu function. When x is greater than or equal to 0, y=x, when x is less than 0, y=α*x, and the value of α selected in the figure is 2; its advantages and disadvantages are summarized as follows:

advantage:
  1. For the Dead Relu Problem in the Relu function, the Leaky Relu function gives the input value a small slope when the input is a negative value. On the basis of solving the 0 gradient problem in the case of negative input, it is also very good Alleviated the Dead Relu problem;
2. The output of this function is from negative infinity to positive infinity, that is, leaky expands the range of the Relu function, and the value of α is generally set to a small value, such as 0.01;

shortcoming:
  1. In theory, this function has a better effect than the Relu function, but a lot of practice has proved that its effect is unstable, so there are not many applications of this function in practice.
2. Due to inconsistent results brought about by different functions applied in different intervals, it will not be possible to provide consistent relationship predictions for positive and negative input values.

ELU

The mathematical expression of the ELU function is as follows:


  The ELU function shown in the figure is also a variant of the Relu function. When x is greater than 0, y=x, and when x is less than or equal to 0, y=α(exp(x)-1), which can be regarded as an intermediary A function between Relu and Leaky Relu; its advantages and disadvantages are summarized as follows:

advantage:
  1. ELU has most of the advantages of Relu, there is no Dead Relu problem, and the average value of the output is also close to 0;
  2. This function makes the normal gradient closer to the unit natural gradient by reducing the influence of the bias offset, so that the mean value can accelerate learning towards 0;
  3. This function has a saturation region in the negative field, so it has certain robustness to noise;

shortcoming:
  1. The calculation intensity is high, including power operation;
  2. In practice, there is also no more prominent effect than Relu, so there are not many applications.

Swish

The mathematical expression of the Swish function is as follows:


  The above figure is the Swish function. Swish is an improved version of Sigmoid and ReLU, similar to the combination of ReLU and Sigmoid, and β is a constant or trainable parameter. Swish has the characteristics of no upper bound and lower bound, smoothness, and non-monotonicity. Swish outperforms ReLU on deep models. Its advantages and disadvantages are summarized as follows:

advantage:
  1.Swish has certain advantages of ReLU function;
  2.Swish has certain advantages of Sigmoid function;
  3. The Swish function can be regarded as a smooth function between the linear function and the ReLU function.
shortcoming:
  1. The operation is complex and the speed is slow.

Mish

The mathematical expression of the Mish function is as follows:


  The Mish function is shown in the figure. Mish is similar to the Swish activation function. Mish has the characteristics of no upper bound, lower bound, smoothness, and non-monotonicity. Mish outperforms ReLU on deep models. No upper bound can avoid function saturation caused by too large activation value.

advantage:
  1. There is no boundary above positive values ​​(that is, positive values ​​can reach any height) to avoid saturation due to capping. The theoretical slight allowance for negative values ​​allows for better gradient flow instead of hard zero boundaries like in ReLU.
  2. A smooth activation function allows better information to go deep into the neural network, resulting in better accuracy and generalization.

shortcoming:
  1. The amount of calculation must be larger than that of relu, and it takes up a lot more memory;

Maxout

The mathematical expression of the Maxout function is as follows:

  The essence of the maxout activation function is to perform a max operation on all outs. It is also called a unified activation function, because the maxout network can approximate any continuous function, and when w2,b2,..., wn,bn are 0, it degenerates into ReLU . Maxout can alleviate the disappearance of the gradient, and at the same time avoid the disadvantage of ReLU neuron death, but it increases the parameters and calculation amount.
  In addition, the maxout activation function is not a fixed function. Unlike functions such as Sigmod, Relu, and Tanh, it is a fixed function equation. It is a learnable activation function because our W parameter is learned to change. The Maxout unit is not only a nonlinear mapping from net input to output, but a nonlinear mapping relationship between the overall learning input and output, which can be regarded as a piecewise linear approximation of any convex function, and is non-differentiable at finite points .

advantage:
  1.Maxout has a very strong fitting ability and can fit any convex function.
  2.Maxout has all the advantages of ReLU, linearity and unsaturation.
  3. At the same time, it does not have some shortcomings of ReLU. Such as: the death of neurons.

shortcoming:
  1. It can be seen from the activation function formula that there are two sets of (w,b) parameters in each neuron, then the number of parameters is doubled, which leads to a surge in the number of overall parameters.

hard-sigmoid

The mathematical expression of the hard-sigmoid function is as follows:


  Hard sigmoid is an approximation to sigmoid. The main advantage is fast calculation speed and no need for power calculation. Therefore, hard sigmoid is an option when the speed requirement is high. There are many ways in the specific implementation form. For example, the implementation methods in pytorch and tensorflow are different, but in short, they are all an approximation to sigmoid. In general, the calculation speed is faster than sigmoid, because there is no exponential operation.
advantage:
  1. The output is (0,1), which can be expressed as monotonic and continuous probability.
shortcoming:
  1. Gradient soft saturation.
  2. The output is not 0 mean, which is not conducive to optimization.
  3. Including index calculation, the speed is slow.

Hard-swish

The mathematical expression of the Hard-sigmoid function is as follows:


  The earliest author of MobileNetV3 used hard-Swish and hard-Sigmoid to replace the Sigmoid layer in ReLU6 and SE-block, but only replaced ReLU6 with h-Swish in the second half of the network, because the author found that the Swish function is only in deeper networks. The use of layers can reflect its advantages.

solved problem
  1. Use piecewise linear function instead of Swish to improve calculation efficiency.
  2. In the deep layer of the MobileNetV3 network, Hard-Swish is used instead of ReLU6, which improves the accuracy of the model while ensuring a low computational load.

3. Activation function code

import matplotlib.pyplot as plt
import numpy as np


class ActivateFunc():
    def __init__(self, x, b=1, lamb=2, alpha=1, a=2):
        super(ActivateFunc, self).__init__()
        self.x = x
        self.b = b
        self.lamb = lamb
        self.alpha = alpha
        self.a = a

    def __init__(self, x, b=1, lamb=2, alpha=1, a=2):
        super(ActivateFunc, self).__init__()
        self.x = x
        self.b = b
        self.lamb = lamb
        self.alpha = alpha
        self.a = a

    def Sigmoid(self):
        y = np.exp(self.x) / (np.exp(self.x) + 1)
        y_grad = y*(1-y)
        return [y, y_grad]

    def Tanh(self):
        y = np.tanh(self.x)
        y_grad = 1 - y * y
        return [y, y_grad]

    def Swish(self): #b is a constant, specifying b
        y = self.x * (np.exp(self.b*self.x) / (np.exp(self.b*self.x) + 1))
        y_grad = np.exp(self.b*self.x)/(1+np.exp(self.b*self.x)) + self.x * (self.b*np.exp(self.b*self.x) / ((1+np.exp(self.b*self.x))*(1+np.exp(self.b*self.x))))
        return [y, y_grad]

    def ELU(self): # alpha is a constant, specify alpha
        y = np.where(self.x > 0, self.x, self.alpha * (np.exp(self.x) - 1))
        y_grad = np.where(self.x > 0, 1, self.alpha * np.exp(self.x))
        return [y, y_grad]

    def SELU(self):  # lamb is greater than 1, specify lamb and alpha
        y = np.where(self.x > 0, self.lamb * self.x, self.lamb * self.alpha * (np.exp(self.x) - 1))
        y_grad = np.where(self.x > 0, self.lamb*1, self.lamb * self.alpha * np.exp(self.x))
        return [y, y_grad]

    def ReLU(self):
        y = np.where(self.x < 0, 0, self.x)
        y_grad = np.where(self.x < 0, 0, 1)
        return [y, y_grad]

    def PReLU(self):    # a is greater than 1, specify a
        y = np.where(self.x < 0, self.x / self.a, self.x)
        y_grad = np.where(self.x < 0, 1 / self.a, 1)
        return [y, y_grad]

    def LeakyReLU(self):   # a is greater than 1, specify a
        y = np.where(self.x < 0, self.x / self.a, self.x)
        y_grad = np.where(self.x < 0, 1 / self.a, 1)
        return [y, y_grad]

    def Mish(self):
        f = 1 + np.exp(x)
        y = self.x * ((f*f-1) / (f*f+1))
        y_grad = (f*f-1) / (f*f+1) + self.x*(4*f*(f-1)) / ((f*f+1)*(f*f+1))
        return [y, y_grad]

    def ReLU6(self):
        y = np.where(np.where(self.x < 0, 0, self.x) > 6, 6, np.where(self.x < 0, 0, self.x))
        y_grad = np.where(self.x > 6, 0, np.where(self.x < 0, 0, 1))
        return [y, y_grad]

    def Hard_Swish(self):
        f = self.x + 3
        relu6 = np.where(np.where(f < 0, 0, f) > 6, 6, np.where(f < 0, 0, f))
        relu6_grad = np.where(f > 6, 0, np.where(f < 0, 0, 1))
        y = self.x * relu6 / 6
        y_grad = relu6 / 6 + self.x * relu6_grad / 6
        return [y, y_grad]

    def Hard_Sigmoid(self):
        f = (2 * self.x + 5) / 10
        y = np.where(np.where(f > 1, 1, f) < 0, 0, np.where(f > 1, 1, f))
        y_grad = np.where(f > 0, np.where(f >= 1, 0, 1 / 5), 0)
        return [y, y_grad]



def PlotActiFunc(x, y, title):
    plt.grid(which='minor', alpha=0.2)
    plt.grid(which='major', alpha=0.5)
    plt.plot(x, y)
    plt.title(title)
    plt.show()


def PlotMultiFunc(x, y):
    plt.grid(which='minor', alpha=0.2)
    plt.grid(which='major', alpha=0.5)
    plt.plot(x, y)


if __name__ == '__main__':
    x = np.arange(-10, 10, 0.01)
    activateFunc = ActivateFunc(x)
    activateFunc.b = 1

    PlotActiFunc(x, activateFunc.Sigmoid()[0], title='Sigmoid')
    PlotActiFunc(x, activateFunc.Tanh()[0], title='Tanh')
    PlotActiFunc(x, activateFunc.ReLU()[0], title='ReLU')
    PlotActiFunc(x, activateFunc.LeakyReLU()[0], title='LeakyReLU')
    PlotActiFunc(x, activateFunc.ReLU6()[0], title='ReLU6')
    PlotActiFunc(x, activateFunc.Swish()[0], title='Swish')
    PlotActiFunc(x, activateFunc.Mish()[0], title='Mish')
    PlotActiFunc(x, activateFunc.ELU()[0], title='ELU')
    PlotActiFunc(x, activateFunc.Hard_Swish()[0], title='Hard_Swish')
    PlotActiFunc(x, activateFunc.Hard_Sigmoid()[0], title='Hard_Sigmoid')

    plt.figure(1)
    PlotMultiFunc(x, activateFunc.Swish()[0])
    PlotMultiFunc(x, activateFunc.Mish()[0])
    plt.legend(['Swish', 'Mish'])

    plt.figure(2)
    PlotMultiFunc(x, activateFunc.Swish()[0])
    PlotMultiFunc(x, activateFunc.Hard_Swish()[0])
    plt.legend(['Swish', 'Hard-Swish'])

    plt.figure(3)
    PlotMultiFunc(x, activateFunc.Sigmoid()[0])
    PlotMultiFunc(x, activateFunc.Hard_Sigmoid()[0])
    plt.legend(['Sigmoid', 'Hard-Sigmoid'])

    plt.figure(4)
    PlotMultiFunc(x, activateFunc.ReLU()[0])
    PlotMultiFunc(x, activateFunc.ReLU6()[0])
    plt.legend(['ReLU', 'ReLU6'])

    plt.show()

refer to: The most comprehensive: python draws Sigmoid, Tanh, Swish, ELU, SELU, ReLU, ReLU6, Leaky ReLU, Mish, hard-Sigmoid, hard-Swish and other activation functions (with source code)
Commonly used activation functions: summary of advantages and disadvantages of Sigmoid, Tanh, Relu, Leaky Relu, ELU

Tags: Deep Learning

Posted by gottes_tod on Sat, 19 Nov 2022 18:31:14 +0530