Google AI researchers propose ‘CMT-DeepLab’: a transformer-based machine learning framework for panopticon segmentation designed around clustering


Panoptic segmentation (Pan = Everything and Optics = Vision) is the computer vision technique that separates each object in the image into individual parts. After that, these parts are labeled in different colors and then classified. Yet, the difference between Panoptic and other segmentation techniques is that it is seen as a holistic or unified view of segmentation rather than using two different approaches. One is enough. CMT-Deeplab is a framework designed by researchers at Google Research to ease the process of building a panopticon segmentation model by shifting the approach from proxy-based systems to end-to-end systems, which improves all of its functions through the use of transformer encoders. The main idea behind this is to consider the input and predict each object in the image, then create a binary mask prediction using these objects and mask integration vectors (to develop high definition).

Transformers are a new architecture that solves sequence-to-sequence tasks while managing long-range dependencies. Transformers are used alone or combined with CNN (Convolutional Neural Networks), greatly increasing computer vision tasks. Proxy-Based Panoptic Segmentation – To get full output, it uses two neural networks, FCN (Fully Convolutional Network) and Mask R-CNN, but gave inaccurate and inconsistent results. Yet, with the introduction of a mask transformer, it can produce more accurate and reliable results. End-to-end panoptic segmentation – Results are created by combining instance (combination of box detection and box-based segmentation) and semantic segmentation.

The first stage starts with the pixel encoder which extracts the features of the image, then the parts are sent to the pixel decoder which, due to the inclusion of transformers, enhances the pixels and through the use of layers of oversampling, it creates high resolution objects, the problem arises because the architecture of the transformer is designed for object detection and not for object segmentation and to overcome the shortcomings of the transformer type architecture, we turn to clustering by performing softmax operations in different dimensions with the aim of grouping the most similar object queries together softmax operations are applied to the spatial dimension of the image, to get the final output, softmax is executed on object queries to ensure that each pixel finds the most similar pixel to each other, although there are also some issues with this approach, one being as long as object queries are little updated because of softmax updated to very large dimension, second output up date can only be updated once, hence pixels only have one chance to update the information. The equation below solves our problems by updating the cluster features by grouping the pixel features according to the cluster assignment (C are the cluster centers), which greatly improves the performance of the framework.


Free 2 Minute AI NewsletterJoin over 500,000 AI people

We are now trying to modify the transformer decoder to solve our problems with a new approach based on clustering through methods such as the residual path between cluster assignments where we stack the transformer decoder on top of each other and then we add a residual connection between the clustering results. We solve the first Sparse query update problem by combining the proposed Cluster Center update with the original cross attention. We solve the second problem by using the clustering result to perform an update on the pixel features using the cluster center features.

In conclusion, the CMT-Deeplab framework significantly improves panopticon segmentation while reducing its complex process due to end-to-end based systems, which helps to increase the quality of prediction through mask transformers and redefine queries. object and integrates the cluster center. update, which cannot significantly enrich the learned cross-attention maps and further facilitates segmentation prediction.

This Article is written as a summary article by Marktechpost Staff based on the research paper 'CMT-DeepLab: Clustering Mask Transformers for Panoptic Segmentation'. All Credit For This Research Goes To Researchers on This Project. Checkout the paper and reference article.

Please Don't Forget To Join Our ML Subreddit

James G. Williams