Deep convolution neural network (pooling functions for object detection and segmentation)
Loading...
Date
2022
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
UMT Lahore
Abstract
Deep Convolutional Neural Networks (DCNN) delivers state-of-the-art performance in Object Detection, Classification, and Segmentation and are utilized for a range of Computer Vision applications. Pooling is an important component of many DCNN architectures. Pooling layers decrease the input's spatial size to lower the architecture's resource consumption. Commonly used pooling functions in DCNN are static and do not adapt to the feature maps that are being pooled. Investigating the impact of adaptive pooling functions in a DCNN for detection and segmentation is the goal of this work. This paper explains the theoretical foundations of DCNN, typical modifications, and prominent architectural designs. The flexible pooling algorithms Gated and Tree Pooling are then implemented as Keras layers. Gated Pooling achieves this by learning a combination of Max and Average Pooling and Tree Pooling learns pooling functions and how to combine them. Additionally, it is explained how to develop the MaskX R-CNN architecture by extending an existing Mask R-CNN implementation. This entails constructing a weight transfer function from the bounding box head to the mask head and combining mask predictions that are both class-specific and class agnostic. Following this approach, several MaskX R-CNN
configurations are trained on all 80 categories of the Microsoft COCO dataset using these pooling methods. This thesis demonstrates that, when trained and tested on CIFAR-10 and CIFAR-100, utilizing these flexible pooling functions in tiny CNN results in better performance than Max Pooling. Analysis of the Mask and MaskX R-CNN models reveals that the performance of the basic Mask R-CNN implementation produces somewhat superior results. The performance of a MaskX R-CNN architecture trained and evaluated utilizing Tree Pooling within the ResNet backbone is marginally worse than the MaskX R-CNN baseline. The MaskX R-network CNN's heads that used
Gated Pooling had the worst performance of all the trained models. However, the performance of
the MaskX R-CNN implementations may be explained by the fact that they have around 4.5 times
as many parameters as the Mask R-CNN baseline while adhering to the same training regimen.
Because the MaskX R-CNN architecture is more complex, it is expected that: (1) using a prolonged
iv
training schedule will improve the performance of the baseline and the implementation using Tree
Pooling; and (2) Tree Pooling with further training will eventually achieve a better performance
than the baselines, as seen on the performance increase on CIFAR-10 and CIFAR-100.