Understand Semantic segmentation with the Fully Convolutional Network U-Net step-by-step

Pallawi
16 min readMay 31, 2019

What is semantic segmentation?

Semantic segmentation is a pixel-wise classification problem statement. If until now you have classified a set of pixels in an image to be a Cat, Dog, Zebra, Humans, etc then now is the time to learn how you assign classes to every single pixel in an image. And this is made possible through many algorithms like semantic segmentation, Mask-R-CNN.

In this article, we will learn about semantic segmentation using a deep learning model which has performed exceedingly well in the field of biomedical image segmentation called U-Net.

The network can be trained end-to-end with very few images and outperforms the prior best methods of segmentation.

A few years back I build an application using GrabCut by OpenCV which performed semi-automatic segmentation (Human in a loop) on images where my objective was to segment the foreground and background in an image, it was workable then but not a scalable solution.

Then my problem statement changed to segment image in such a way that we segment multiple objects present in the image and the one those are segmented we give same labels if they belong to the same class.

You can see in the above image, all the segmented buildings are assigned a red color, all the segmented trees are given a green color, all the segmented standing water is given a deep blue color. The pixels in the above image are classified into 9 classes. Every color in the above image represents a new class.

Solving the above problem statement was not feasible using GrabCut. Then I got to know about Semantic Segmentation and I tried it to solve multiple segmentation problems.

I have used it to segment handwritten block of words, segment standard documents in images taken with inconsistent background, segment buildings on the 2D satellite maps.

The deep learning model I used to perform Semantic segmentation is U-Net and in this blog, we will understand everything about U-Net step-by-step. In my next article which is the sequence/part-2 of this article, I will help you to train and test the U-Net model on your own data in Keras. Keeping that in mind I have discussed the hyperparameters used in the original paper and how different is it from my implementation. You will find blocks of pseudocode in this article, I have added them so that you can visualize how concepts can be coded. The complete code, data, result, analysis, and improvements are explained in detail in part-2 of this article.

What is U-Net?

U-Net was developed by Olaf Ronneberger et al. for BioMedical Image Segmentation.

It is a Fully Convolutional neural network. The reason behind why it is named U-Net is because of the shape of its architecture, which is the letter “U”. The architecture contains two paths. The left part of the letter “U” is called an encoder and the right part is called a decoder.

An encoder is a network that takes the input and outputs a feature map/vector/tensor. These feature vector hold the information, the features, that represents the input.

The decoder is also a network that takes the feature vector from the encoder and gives the best closest match to the actual input or intended output. It is usually the same network structure as encoder but in an opposite orientation.

As a result, your data has been compressed (encoded) into a few variables. From this hidden representation, the network tries to reconstruct (decode) the input again.

It is an end-to-end fully convolutional network (FCN), i.e. it only contains Convolutional layers and does not contain any Dense layer because of which it can accept the training and testing images of any size.

The below image is the architecture of the U-Net model. If you are finding it difficult to understand then my dear friend let me tell you 90 percent of people who see it for the first time finds it difficult to understand.

I want you to stay with me. I will help you to understand this network in a very easy way.

U-Net can take any size of the image but the preferred sizes are always 128X128X3, 512X512X3, 572x572x3. The even size of training and testing image is preferred as max-pooling layer uses the filter of size 2x2 as stride 2. To cover the boundary pixels in the image we use images with even size.

U-Net Architecture

Left part of the architecture (left part of “U”)

Let us first understand the left part of the “U” shaped network.

In the above image, the size of the input image is 572X572 and its mask is of the same size. Followed by two consecutive convolution layers. Each layer uses 64 filters of size 3X3.

IMG_HEIGHT=572
IMG_WIDTH = 572
IMG_CHANNELS = 3
inputs = Input((IMG_HEIGHT, IMG_WIDTH, IMG_CHANNELS))
s = Lambda(lambda x: x / 255) (inputs)
c1 = Conv2D(64, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (s)c1 = Dropout(0.1) (c1)c1 = Conv2D(64, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (c1)
The right image is the input image (2D satellite images) and left is its mask (Labelled buildings in the image). Both go as an input in the U-Net model during training. The above image is a binary class classification problem where we have only two classes building and the background.

Now the activation function used by the authors is relu. I have used the ELU (Exponential linear unit) as an activation function in my implementation.

The person is showing relu dance move in the left part of the image and ELU dance move in the right part of the image

In contrast to ReLUs, ELUs has negative values which allow them to push mean unit activations closer to zero. Zero means speeds up the learning because they bring the gradient closer to the unit natural gradient. ELUs lead not only to faster learning but also to better generalization performance once networks have many layers (≥ 5). The model contains 23 convolutional layers.

We use a weight initialization method called as He. If you do not specify the weight initialization method explicitly in Keras, it uses Xavier initialization also known as Glorot initialization. The aim of all these weight initializers is to find a good variance for the distribution from which the initial parameters are drawn. This variance is adapted to the activation function. In fact, in the Glorot paper, a uniform distribution is used whereas in the He paper it is a Gaussian one that is chosen. Recent deep CNN's are mostly initialized by random weights drawn from Gaussian distributions.

We use a dropout layer in between these two convolution layers. It helps prevent overfitting and co-adaptation of the features. Here, 0.1 percent of the nodes will be dropped out during the forward and backward pass. By dropping a unit out, I mean temporarily removing it from the network, along with all its incoming and outgoing connections. The choice of which units to drop is random. In the simplest case, each unit is retained with a fixed probability p independent of other units, where p can be chosen using a validation set or can simply be set at 0.5, which seems to be close to optimal for a wide range of networks and tasks. No parameters are learned in the dropout layer. If you are interested to dig deep into dropout for research I would always recommend you to first read the motivation of this paper.

Then comes the max pool layer. You can see a maroon color arrow pointing down in the above image. That is where the downsampling of the feature map happens using max-pooling layer. Unlike the convolution layers where parameters are learned, the max-pooling layer is only responsible to reduce the dimension of the feature map in a meaningful way. In our case, we perform a max-pooling of window size 2x2. Which means maximum value falling in every 2x2 window while iterating throw the feature maps. I would want to direct you to one of my blog if you wish to understand the term max-pooling in a detailed way.

IMG_HEIGHT=572
IMG_WIDTH = 572
IMG_CHANNELS = 3
inputs = Input((IMG_HEIGHT, IMG_WIDTH, IMG_CHANNELS))
s = Lambda(lambda x: x / 255) (inputs)
c1 = Conv2D(64, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (s)
c1 = Dropout(0.1) (c1)
c1 = Conv2D(64, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (c1)
p1 = MaxPooling2D((2, 2)) (c1)
c2 = Conv2D(128, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (p1)
c2 = Dropout(0.1) (c2)
c2 = Conv2D(128, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (c2)

The above image is the left part of the network. You can see that I have given numbers from 1 to 5. Call every number a block. A block like 1 consists of a total 2 convolution layer, 1 dropout layer, and 1 max-pooling layer. The encoder (Left part) has 10 convolution layers to extract features. You can count there are 10 blue arrows in the above image for your ease.

One important point to keep in mind is the number of filters gets doubled as we move from block 1 to block 5 like 16, 32, 64, 128, 256 respectively but the size of the filters will always be 3x3.

IMG_HEIGHT=572
IMG_WIDTH = 572
IMG_CHANNELS = 3
inputs = Input((IMG_HEIGHT, IMG_WIDTH, IMG_CHANNELS))
s = Lambda(lambda x: x / 255) (inputs)
c1 = Conv2D(64, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (s)
c1 = Dropout(0.1) (c1)
c1 = Conv2D(64, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (c1)
p1 = MaxPooling2D((2, 2)) (c1)
c2 = Conv2D(128, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (p1)
c2 = Dropout(0.1) (c2)
c2 = Conv2D(128, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (c2)
p2 = MaxPooling2D((2, 2)) (c2)
c3 = Conv2D(256, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (p2)
c3 = Dropout(0.2) (c3)
c3 = Conv2D(256, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (c3)
p3 = MaxPooling2D((2, 2)) (c3)
c4 = Conv2D(512, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (p3)
c4 = Dropout(0.2) (c4)
c4 = Conv2D(512, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (c4)
p4 = MaxPooling2D(pool_size=(2, 2)) (c4)
c5 = Conv2D(1024, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (p4)
c5 = Dropout(0.3) (c5)
c5 = Conv2D(1024, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (c5)

To understand the right part of the architecture, which is a decoder we must look at the whole architecture because a very important step called as concatenation happens in the decoder part of U-Net, where feature maps from the encoder are concatenated to the feature map of the decoder side for better learning. With the term better learning, I mean to say learning better contextual features. In image processing, the term contextual information means the relationship between or among two or more pixels in an image. As we all know that the initial layers of any convolution network are responsible for learning the features from the initial layer and carrying those features can help us preserve edge and many features in the final output.

In the above image, you can see that output from block 1 is concatenated to the input of block 9. In the same way, the output from block 2 is concatenated to the input of block 8, and also 3 and 7, 4 and 6. These skip connections from earlier layers in the network (prior to a downsampling operation) should provide the necessary detail in order to reconstruct accurate shapes for segmentation boundaries. Indeed, we can recover more fine-grain detail with the addition of these skip connections.

If you start looking at the right part of the network from the bottom. You will find a green upside arrow. That arrow stands for Transpose convolution. Transposed convolution layer (sometimes called Deconvolution).

The need for transposed convolutions generally arises from the desire to use a transformation going in the opposite direction of a normal convolution, i.e., from something that has the shape of the output of some convolution to something that has the shape of its input while maintaining a connectivity pattern that is compatible with said convolution. Achieve output of similar size as input.

Image shows how feature maps from the encoder are upsampled using Transpose convolution

I found this very interesting link which talks about how deconvolution helps in improving the image quality, you might want to read it.

There are four blocks in the right part of U-Net architecture. Every block has one Transpose convolution layer, one concatenate layer, two convolution layer, one dropout layer. Let us take an example, in block 6 you can find an upward green arrow for one transpose convolution layer, then a concatenation happening with feature map of block 4, then two blue arrows for convolution, and in between two we have the dropout layer.

You can see there is no max-pooling layer. The reason is that we want the output to replicate the input size and as we know here, the input is an image and its mask so the expected output is also a mask. If we use max-pooling the size of the output will decrease.

In total the network has 23 convolutional layers. At the final layer, a 1x1 convolution is used to map each 64- component feature vector to the desired number of classes.

To allow a seamless tiling of the output segmentation map, it is important to select the input tile size such that all 2x2 max-pooling operations are applied to a layer with an even x- and y-size.

At the end of the network, we get an image mask as an output. Due to the unpadded convolutions, the output image is smaller than the input by a constant border width.

IMG_HEIGHT=572
IMG_WIDTH = 572
IMG_CHANNELS = 3
inputs = Input((IMG_HEIGHT, IMG_WIDTH, IMG_CHANNELS))
s = Lambda(lambda x: x / 255) (inputs)
c1 = Conv2D(64, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (s)
c1 = Dropout(0.1) (c1)
c1 = Conv2D(64, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (c1)
p1 = MaxPooling2D((2, 2)) (c1)
c2 = Conv2D(128, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (p1)
c2 = Dropout(0.1) (c2)
c2 = Conv2D(128, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (c2)
p2 = MaxPooling2D((2, 2)) (c2)
c3 = Conv2D(256, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (p2)
c3 = Dropout(0.2) (c3)
c3 = Conv2D(256, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (c3)
p3 = MaxPooling2D((2, 2)) (c3)
c4 = Conv2D(512, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (p3)
c4 = Dropout(0.2) (c4)
c4 = Conv2D(512, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (c4)
p4 = MaxPooling2D(pool_size=(2, 2)) (c4)
c5 = Conv2D(1024, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (p4)
c5 = Dropout(0.3) (c5)
c5 = Conv2D(1024, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (c5)
u6 = Conv2DTranspose(512,(2, 2), strides=(2, 2),padding='same') (c5)
u6 = concatenate([u6, c4])
c6 = Conv2D(512, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (u6)
c6 = Dropout(0.2) (c6)
c6 = Conv2D(512, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (c6)
u7 = Conv2DTranspose(256,(2, 2), strides=(2, 2),padding='same') (c6)
u7 = concatenate([u7, c3])
c7 = Conv2D(256, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (u7)
c7 = Dropout(0.2) (c7)
c7 = Conv2D(256, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (c7)
u8 = Conv2DTranspose(128,(2, 2), strides=(2, 2),padding='same') (c7)
u8 = concatenate([u8, c2])
c8 = Conv2D(128, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (u8)
c8 = Dropout(0.1) (c8)
c8 = Conv2D(128, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (c8)
u9 = Conv2DTranspose(64,(2, 2), strides=(2, 2),padding='same') (c8)
u9 = concatenate([u9, c1], axis=3)
c9 = Conv2D(64, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (u9)
c9 = Dropout(0.1) (c9)
c9 = Conv2D(64, (3, 3), activation='elu', kernel_initializer='he_normal', padding='same') (c9)
outputs = Conv2D(1, (1, 1), activation='sigmoid') (c9)

Loss functions

I have used binary-cross entropy loss function in my implementation. Work of the loss function is to calculate the difference between the predicted and expected values by the machine learning model which is getting trained. This difference is also called loss, the lesser it is the better it is. The behavior of loss helps the model to understand what must be done to optimize the model so that the loss can be reduced.

Research on loss function for pixel-wise classification suggests the below types and combination of losses:

1. Distance-weighted cross-entropy explained in the famous U-Net paper (code ). It is strongly recommended to read this paper.

2. Using a linear combination of soft dice and distance weighted cross-entropy (code ). The dice coefficient performs better at class imbalanced problems.

3. Adding component weighted by building size (smaller buildings has greater weight) to the weighted cross-entropy that penalizes misclassification on pixels belonging to the small objects (code).

Loss function in practice — visualization

Aforementioned three points yield somehow complex loss function, hence it might be tricky to catch how it works in practice. Below, there is a visualization on the weights, specifically:

  • distance weights: high values corresponds to pixels between buildings.
  • size weights: high values denote small buildings (the smaller the building the darker the color). Note that no-building is fixed to black.

(for both weights: darker color denotes higher value)

The 1st column is input image, 2nd column are masks, 3rd column visualizes distances between buildings ( the darker color is higher value), 4th column visualizes weight assigned to the roof (smaller roofs are assigned higher values, the background is fixed to black).

Optimizers

I have used Adam optimizer in my implementation. Adam is different from classical stochastic gradient descent. Stochastic gradient descent maintains a single learning rate (termed alpha) for all weight updates and the learning rate does not change during training. In Adam, the learning rate is maintained for each network weight (parameter) and separately adapts as learning unfolds.

Adam is different from classical stochastic gradient descent. Stochastic gradient descent maintains a single learning rate (termed alpha) for all weight updates and the learning rate does not change during training. In Adam, the learning rate is maintained for each network weight (parameter) and separately adapts as learning unfolds.

  • Adaptive Gradient Algorithm (AdaGrad) that maintains a per-parameter learning rate that improves performance on problems with sparse gradients (e.g. natural language and computer vision problems).
  • Root Mean Square Propagation (RMSProp) that also maintains per-parameter learning rates that are adapted based on the average of recent magnitudes of the gradients for the weight (e.g. how quickly it is changing). This means the algorithm does well on online and non-stationary problems (e.g. noisy).

Batch size, epochs

To minimize the overhead and make maximum use of the GPU memory, the papers suggest to use large input tiles over a large batch size and hence reduce the batch to a single image.

I have used a batch size of 16 and Epoch 50 in my implementation. I have also used something called as early stopping. Early stopping rules provide guidance as to how many iterations can be run before the model begins to over-fit.

Example: You have specified the number of Epochs = 50, Early-stopping= 5. This means that your model should train for a total of 50 epochs but if the validation loss does not decrease for consecutive 5 epochs the training will be stopped and the model will be saved. It is to prevent overfitting.

Weight Initialization

In deep networks with many convolutional layers and different paths through the network, a good initialization of the weights is extremely important. Otherwise, parts of the network might give excessive activations, while other parts never contribute.

Ideally, the initial weights should be adapted such that each feature map in the network has approximately unit variance.

For a network with this architecture (alternating convolution and ReLU layers), this can be achieved by drawing the initial weights from a Gaussian distribution with a standard deviation of

Standard deviation

where N denotes the number of incoming nodes of one neuron.

E.g. for a 3x3 convolution and 64 feature channels in the previous layer

N = (9 x64 )= 576

Why use U-Net?

  1. Propagates contextual information

“Contextual” means that the approach is focusing on the relationship of the nearby pixels, which is also called a neighborhood. The goal of this approach is to classify the images by using contextual information.

As the image illustrates, the leftmost part, if only a small portion of the image is shown, it is very difficult to tell what the image is about. Even try another portion of the image, it is still difficult to classify the image. However, if we increase the contextual of the image, then it makes more sense to recognize. Rightmost, as the full images show, almost everyone can classify it easily.

In the upsampling part, the architecture has a large number of feature channels, which allow the network to propagate context information to higher resolution layers.

Prediction of the segmentation in the yellow area requires image data within the blue area as input. Missing input data is extrapolated by mirroring

To predict the pixels in the border region of the image, the missing context is extrapolated by mirroring the input image. This tiling strategy is important to apply the network to large images since otherwise the resolution would be limited by the GPU memory.

2. Separation of touching objects of the same class

The author proposes the use of a weighted loss, where the separating background labels between touching cells obtain a large weight in the loss function.

3. Fully convolution layer

It is an end-to-end fully convolutional network (FCN), i.e. it only contains Convolutional layers and does not contain any Dense layer because of which it can accept the training and testing images of any size.

Conclusion:

Understanding any deep learning model could be tough in the beginning but we must never give up on good things and deep learning models are good things.

So my dear friends when I was introduced to U-Net I had no idea of how things were working but soon when I had an excellent motive to work with it I was unstoppable.

I took small steps every day, did a lot of research on the encoder, decoder, loss functions, optimizers, fully convolution networks. I trained, tweaked the U-Net model.

No doubt I faced a lot of issues but yes I taught myself lots of things during the journey. Today, my motive is to share what I have learned. In my next blog, I will help you code this model in Keras and train and test on your own data. Kindly give your feedback if this blog helped you, that would really encourage me and help me improve my work.

References:

--

--

Pallawi

Computer Vision contributor. Lead Data Scientist @https://www.here.com/ Love Data Science.