Batch Normalization and Dropout

Discover how batch normalization and dropout improve a model's accuracy.

Batch normalization

If you open any introductory machine learning textbook, you will find the idea of input scaling. It is undesirable to train a model with gradient descent with non-normalized input features.

Let’s start with an intuitive example to understand why we want normalization inside any model.

Suppose you have an input feature x1x1 in the range [0,10000] and another feature x2x2 in the range [0,1]. Any linear combination would ignore x2x2: x1w1+x2w2x1x1*w1 + x2*w2 \approx x1 , since our weights are initialized in a very tiny range like [-1,1].

We encounter the same issues inside the layers of deep neural networks. In this lesson, we will propagate this idea inside the NN.

If we think out of the box, any intermediate layer is conceptually the same as the input layer; it accepts features and transforms them.

Notations

Throughout this lesson, NN will be the batch size, HH will refer to the height, WW to the width, and CC to the feature channels. The greek letter μ() refers to mean and the greek letter σ() refers to standard deviation.

The batch features are denoted by xx with a shape of [N, C, H, W].

xRN×C×H×Wx \in R^{N \times C \times H \times W}

We will visualize the 4D activation maps x by merging the spatial dimensions. Now, we have a 3D shape that looks like this:

Get hands-on with 1400+ tech skills courses.