Hard Last updated on July 27, 2022, 1:14 a.m.
Image Segmentation can be defined as a classification task on the pixel level. An image consists of various pixels, which are grouped together to define different elements in the image. A method of classifying these pixels into elements is called semantic image segmentation.
Visual Representation of Semantic Segmentation
To appropriately understand the semantic segmentation task, let’s look at the visual representation shown below, wherein the input is represented as semantic labels.
This is a method proposed in the paper - ‘Fully Convolutional Networks for Semantic Segmentation’. It is a simple encoder-decoder type convolutional network consisting of convolutional layers, pooling layers, and activation layers. However, instead of putting dense layers at the end of the network usually, the paper proposes using a 1x1 convolution layer as it will allow the classification net to output a heat map. Doing this gives us an added advantage of feeding an input image of any size since the network is no longer constrained to work on a fixed size.
The encoder part of the network downsamples the image, and the decoder part upsamples it. Downsampling will provide detailed information of the fine-grained features of the image. In contrast, upsampling will bring the image back to its original resolution, as in semantic segmentation, the output of a network should be the image itself with a segmentation map. The decoder module upsamples the image using learned deconvolution layers instead of simple interpolation techniques like Bilinear and Bicubic, as it can learn non-linear upsampling and provide better results.
The main network proposed by the paper is FCN-32, i.e., it downsamples and upsamples the image by 32 times. One of the major drawbacks of this architecture is that 32 times upsampling results in a coarse (not very smooth) image due to loss of information. That is why they have proposed two more architectures, FCN-16 and FCN-8. FCN-16 upsamples the image by 16 times, and in the final layer, it also concatenates the information from the previous pooling layer leading to lesser loss of information. Likewise, FCN-8 uses information from two previous pooling layers along with the final layer leading to upsampling only by eight times.
U-Net is a popular semantic segmentation architecture rightly named after its structure which looks like a U shape. As we saw in the FCN architecture that it cannot reconstruct the image smoothly with fine details, U-Net proposes an important update to the architecture in terms of ‘skip connections.’ Skip connections are direct associations between every downsampling layer and every upsampling layer to propagate more information for accurate reconstruction. In the decoder, the upsampling is performed by summing the information propagated from the encoder layers with the deconvolution layers. This helps to obtain fine-grained features from the initial encoder layers, which eventually generates the segmentation maps with accurate shapes and boundaries.
To further modify the architecture for improved performance, residual blocks and dense blocks can be used in place of stacked convolutional layers. Residual blocks introduce short skip connections within the block itself alongside the main skip connections. This helps to address the problem of vanishing gradients, allows faster convergence, and propagates more information flow.
In dense blocks, every layer is densely connected with the other layers, i.e., at each progressing layer, the input is concatenated with the inputs from all previous layers. This structure is specifically useful for semantic segmentation as both low-level and high-level features will be fed simultaneously, leading to the retainment of important feature information.
The variants of DeepLab introduced an interesting set of strategies that helped to reduce extensive computation parameters and achieve better accuracy than previous models.
DeepLab v1 introduced the concept of atrous convolutions also known as dilated convolutions. Atrous convolutions can capture larger semantic information from the image by maintaining lesser parameters. It will do away the need for excessive downsampling with multiple max pooling layers as the same amount of information could be captured with lesser layers. For example, if a kernel is of 3x3 size, then dilated convolution will introduce zeros between the kernel parameters to make it look like a 5x5 kernel, although the number of parameters would be of 3x3. The number of zeros which are added is controlled by dilation rate. If the dilation rate is one then one zero is added likewise if it is two then two zeros are added between the kernel parameters.
Furthermore, DeepLab also introduced an additional post-processing step with Conditional Random Fields (CRF). Usually, when multiple max-pooling layers are used, the network introduces a property of invariance. This will make the model invariant to minor input changes, resulting in coarse boundaries. Using CRFs, the model accurately detects object shapes as the boundaries are now sharper than before. It performs classification by considering the labels of individual pixels and the neighboring pixels.
DeepLab v1 uses a series of atrous convolutions, and the image is downsampled by 8x. For upsampling, Bilinear upsampling is used as it does not incur any additional parameters, and a fair accuracy is obtained. DeepLab v2 further modified the DeepLab architecture by introducing Atrous Spatial Pyramidal Pooling (ASPP). This concept was first introduced in the SPPNet paper, which helped capture multiscale information. The main advantage of Spatial Pyramidal Pooling (SPP) is that multiscale information can be obtained only from a single image rather than feeding an image with different resolutions multiple times and computing feature maps.
The idea to capture information at multiple scales is applied with atrous convolutions. The input feature map is convolved with a 3x3 kernel of different dilution rates - 6, 12, 18, and 24. This will provide image feature information at multiple scales. The outputs are concatenated, and a 1x1 convolution is applied to them. The final output comprises the dilated outputs, 1x1 convolution on these dilated outputs, and the output of a Global Average Pooling layer applied on the feature map to capture the global context.
DeepLab v3 further aimed to obtain high semantic information and achieve sharper object boundaries. In the previous DeepLab versions upsampling to reconstruct the image was happening using the Bilinear upsampling method; however, taking inspiration from encoder-decoder networks like U-Net, a decoder module has been incorporated here. Also, due to architectural modifications, there is no longer a need to use the post-processing step of CRF as used in previous versions. In DeepLab v3+, the ResNet encoder is replaced by a modified Xception architecture, and depthwise separable convolutions are also used along with the atrous convolutions.
The ASPP module used in the DeepLab architecture improved results as multiscale information could be obtained from a single image with lesser computations. However, that kind of network would not make the kernels generalizable as the processing happened distinctly for each dilated kernel, and no information was transferred in the parallel layers. The major issues with ASPP were:
The kernels pertaining to small atrous rates would be able to capture detailed information but would miss out on capturing information for larger semantic classes. Similarly, kernels with bigger dilation rates would be great at obtaining global details but leave out the smaller details.
During training, smaller objects would only correspond to smaller dilation rate kernels and vice versa for larger objects, so this would create a loss in data required for training.
Using parallel branches having separate kernels resulted in a linear increase in the number of parameters with an increase in the number of branches.
Therefore, to tackle all these challenges, the method of KSAC is proposed.
KSAC allows sharing a single kernel over multiple branches having different atrous rates. Through this, the kernel can capture local and global details by scanning the image multiple times, and this information is shared across all the parallel branches.
<figure style="text-align: center;">
For training, the number of samples being trained also increases as compared to earlier, which results in an improvement in generalization capability. Furthermore, this sharing mechanism also resolves the problem of significant parameters. Employing this method for semantic segmentation resulted in a significant improvement of 3.67% mIoU than the DeepLab v3+ model.