Batch Normalization

Jelal Sultanov
3 min readSep 15, 2019

--

Nowadays the Neural Nets are almost in every aspect of our life. They play significant role in science and business. More complex gets the task, more complex gets the NN. Training the NN sometimes requires tons of time and effort.

The Batch Normalization is the reason why the NN became so popular today. Bacth Norm enables to use higher learning rates, accelerating the training process.Batch Norm helps by making the data flowing between intermediate layers of the network look like whitened data, this means you can use a higher learning rate. Since Batch Norm has a regularizing effect it also means you can often remove dropout (which is helpful as dropout usually slows down training).

There are some old standard methods to optimize the training such as :

`Increasing the Learning Rate

`Decreasing the number of parameters.

Why we need Batch Normalization?

One of the best methods to reduce the time required for training is the Batch Normalization. Batch Normalization is technique for improving the speed,perfomance of NN. It was first introduced in 2015 by Google. It’s used to normilize the input layer by adjusting and scaling the activations.

These are the properties of Batch Normalization with mean and variance for a mini batch version:

  1. Learning faster: Learning rate can be increased compare to non-batch-normalized version.
  2. Increase Accuracy: Flexibility on mean and variance value for every dimension in every hidden layer provides better learning, hence accuracy of the network.
  3. Normalization or Whitening of the inputs to each layer: Zero means, unit variances and or not decorrelated.
  4. To remove the ill-effect of Internal Covariate shift:Transformation makes data to big or to small; change of the input distribution away from normalization due to successive transformation.
  5. Not-Stuck in the saturation mode: Even if ReLU is not used.
  6. Integrate Whitening within the gradient descent optimization: Decoupled Whitening between training steps, which modifies network directly, reduces the effort of optimization. So, model blows up when normalization parameters are computed outside the gradient descent step.
  7. Whitening within gradient descent: Requires inverse square root of covariance matrix as well as derivatives for backpropagation
  8. Normalization of Individual dimension: Individual dimension of hidden layers are normalized independently rather than joint covariances. So, features are not decorrelated.
  9. Normalization of mini-batch: Estimation of mean and variance are computed after each mini-batch rather than entire training set. Even ignoring the joint covariance as it will create singular co-variance matrices for such small number of training sample per mini-batch compare to high dimension size of the hidden layer.
  10. Learning of scale and shift for every dimension: Scaled and shifted values are passed to the next layer, whether mean and variances are calculated after getting all mini-batch activation of current layer. So, forward pass of all the samples within the mini-batch should pass layer wise. Backpropagation is required for getting gradient of weights as well as scaling (variance) and shift (mean).
  11. Inference: During inference moving averaged mean and variance parameters during mini batch training are considered.
  12. Convolution Neural Network: Whitening of intermediate layers, before or after the nonlinearity creates a lot of new innovation pathways

How does batch normalization works?

Firstfull , you should take a look what Covariate Shift means. The Covariate shift refers to the change in distribution of the input values to a learning algorithm. In context of deep learning, we are concerned with the change in distribution of inputs to the inner nodes in neural network. A neural network updates the weights of each layer over the course of training, Which means that the activations of each layer also changes. It becomes that we have layer which has input from previous layer , so that layers in NN face the problem of input distribution changes with each step.

However, after this shift/scale of activation outputs by some randomly initialized parameters, the weights in the next layer are no longer optimal. SGD ( Stochastic gradient descent) undoes this normalization if it’s a way for it to minimize the loss function.

Consequently, batch normalization adds two trainable parameters to each layer, so the normalized output is multiplied by a “standard deviation” parameter (gamma) and add a “mean” parameter (beta). In other words, batch normalization lets SGD do the denormalization by changing only these two weights for each activation, instead of losing the stability of the network by changing all the weights.

Simple Algebraic Formula of Batch Norm

--

--

Jelal Sultanov
Jelal Sultanov

No responses yet