From last post, we have talked about some good ways to prevent overfitting and improve the performance of the neural network. One of them is to normalize data to speed up gradient descent. In this article, I am going to talk about some other ways to improve the performance of gradient descent for DNN.
Gradient descent must be the most common way to optimize neural network by far. It goes without saying how important it is to influence the performance of the neural network. So this post aims to summarize it by mainly referring to the course from Deeplearning.ai and also this paper. Notice this post will ignore basic explanation of gradient descent.
1. Basic Gradient Descent
1.1 Batch Gradient Descent
\[\theta = \theta - \alpha\cdot \nabla J(\theta;x,y)\]
It is the most straight forward method, we will update the parameter once per iteration of the whole training set (Average gradients for all training data point).
- It is guarenteed to converge to global minimum for convex error surface or local minimum otherwise
- But if the training set is too large, the whole process will be very slow. Also hard to deal with streaming data
1.2 Stochastic Gradient Descent
\[\theta = \theta - \alpha \cdot \nabla J(\theta;x^{(i)},y^{(i)})\] SGD solve the problem of Batch GD to update the parameter for everything training example.
- Much faster and can also be used to learn online
- Performs frequent updates with a high variance that cause the objective function to fluctuate heavily. The large fluctuations can be useful in getting to better local minimum but not exact minimum
1.3 Mini-Batch Gradient Descent
\[\theta = \theta - \alpha \cdot \nabla J(\theta;x^{(i:i+n)},y^{(i:i+n)})\] Mini-Batch works in between SGD and BGD.
- This will reduce the variance of the parameter update
- Can avoid redundant computations and makes use of highly optimized matrix optimizations
- Commonly SGD is also referred as mini-batch GD
2. Gradient Descent Optimizations
To improve the gradient descent, it has two directions. First is to adaptively change the gradient direction, which we will introduce momentum; another direction is to adaptively change the learning rate (The magnitude of the gradient), which we will introduce some adaptive learning-rate algorithms.
2.1 GD with Momentum
\[v_t = \gamma v_{t-1} + \alpha \cdot \nabla J(\theta) \] \[\theta = \theta - v_t\]
The theory used behind it is exponentially weighted average, the intuition is that we give the gradients memory to increase for dimensions whose gradients point in the same directions and decrease for the dimension that changes direction. (The parameter vector will build up velocity in any direction that has consistent gradient.)
- \(\gamma\) is the momentum term which controls how much you will depend on your previous gradients (mostly 0.9)
- Leads to faster convergence and less fluctuation
- Need to use Bias correction at the beginning
2.2 NAG
There is also a better type of momentum method called Nesterov Accelerated Gradient (NAG), the only difference is the gradient is computed of the predicted next parameter values: \[v_t = \gamma v_{t-1} + \alpha \cdot \nabla J(\theta - \gamma v_{t-1}) \] \[\theta = \theta - v_t\]
- It can prevent us from going too fast and results in significant increase in performance of RNNs
- The intuition is that since \(\theta - \gamma v_{t-1}\) can give us a future approximation of the gradient, it makes sense to use it instead of the old gradient
2.3 RMS Prop
\[v_t= \gamma v_{t-1}+(1-\gamma)\nabla J(\theta)^2\] \[\theta = \theta - \alpha\cdot \frac{\nabla J(\theta)}{\sqrt{v_t}+ \epsilon}\]
- RMSprop and Adadelta have both been developed for the need to resolve Adagrad’s radically diminishing learning rates problem
- Modulates the learning rate of each weight based on the magnitudes of its gradients. Intuition here is that he weights that receive high gradients will be scaled to small while weights that receive small gradients will be scaled to be bigger --> performing larger updates for infrequent and smaller updates for frequent parameters
- Good to deal with sparse data
2.4 Adaptive Moment Estimation (Adam)
\[m_t = \beta_1 m_{t-1} + (1-\beta_1) \nabla J(\theta)\] \[v_t = \beta_2 v_{t-1} + (1-\beta_2) \nabla J(\theta)^2\] So \(m_t\) and \(v_t\) are estimates of the first moment and the second moment of the gradients respectively. Since they are both initialized as 0, so we need to do bias correction: \[\hat{m_t} = \frac{m_t}{1-\beta_1^t}\] \[\hat{v_t} = \frac{v_t}{1-\beta_2^t}\] Finally: \[\theta = \theta - \frac{\alpha}{\sqrt{\hat{v_t}}+ \epsilon}\hat{m_t}\]
- Combination of momentum and RMS prop. Which means we have adaptive learning rates and also keeping an exponentially decaying average of past gradients
- Suggested default values of 0.9 for \(\beta_1\), 0.999 for \(\beta_2\)
3. Other Techniques
Except for innovations of gradient decent algorithms, there are also several techniques can help improve its performance.
3.1 Learning Rate Decay
Ideally, at the initial step of learning, we want to have big learning rate to take big step; later taking smaller step which is good for convergence. In that case, we have several method to decay the rate:
- Step decay:Reduce the learning rate by some factor every few epochs
- Exponential decay: \(\alpha = \alpha_0 e^{-kt}\) where \(a_0,k\) are hyperparameters and \(t\) is the iteration number
3.2 Shuffle and Partition
When using mini-batch gradient descent, we need to shuffle and repartition the data, since:
- It prevents the model from learning the order of the training data
- it prevents any bias during the training
- Helps to converge faster
3.3 Epoch
Actually Epoch is not a technique which means an entire dataset passed forward and backward through the neural network once. But I still want to talk about here because first it is a hupyerparameter for gradient descent but also confused me at first.
My confusion was: Why we use more than one Epoch?
Intuition: That is because gradient descent is an iterating process. There may be enough information in these training samples, but the gradient descent algorithm takes time to extract it. Each epoch finds the approximate direction based on the parameters we have now; But after update, we will extract new information based on current parameters.
3.4 Batch Normalization
We have already know that feature normalization is useful to speed up the optimization. But we will lose the normalization in each layer. So to solve this problem, we introduce batch normalization: \[Z^{[l]}_{new} = \gamma Z_{norm}^{[l]}+ B\] Where \(\gamma\) and \(B\) are learnable from gradient descent. So we will use BN to replace \(Z^{[l]}\) and then use backprop to compute \(dw,dB,d\gamma\) with \(Z^{[l]}_{new}\)
The reason here we don't want the new layer to have \(\mu=0\) and \(\sigma = 1\) is because if all layers start with these value that closed to 0, the activation will work as a linear function (remember sigmoid or tanh similar to linear function when value close to 0)
Why Batch Normalization will work:
- Reduce the problem of input values changing (called covariant shift), and so it allows each layer of the network to learn by itself --> make each layer little bit independent with other layers. In other words, it makes the wrights of the deeper layer relatively less dependent on the weights learned on the shallow layer.
- Each mini-batch is scaled by the mean/var on that batch, so it adds some noise --> forcing the next layer not to rely too much on any previous hidden layer (slight regularization)
- Similar to feature normalization, it makes the cost function in each layer more symmetric
Reference:
DeepLearning.ai
https://towardsdatascience.com/epoch-vs-iterations-vs-batch-size-4dfb9c7ce9c9
An overview of gradient descent optimization algorithms
http://cs231n.github.io/neural-networks-3/#gradcheck