Meta AI researchers update their machine learning-based image segmentation models for better virtual backgrounds in video calls and metaverse

While video chatting with colleagues, colleagues or family, many of us have become accustomed to using virtual backgrounds and background filters. It has been shown to offer more control over the environment, allow fewer distractions, preserve the privacy of those around us, and even liven up our presentations and virtual meetings. However, background filters don’t always work as expected or work well for everyone.

Image segmentation is a computer vision process of separating the different components of a photo or video. It has been widely used to enhance background blur, virtual backgrounds, and other augmented reality (AR) effects. Despite advanced algorithms, achieving highly accurate segmentation of people seems difficult.

The model used for image segmentation tasks should be incredibly consistent and lag-free. Inefficient algorithms can lead to poor user experiences. For example, during a videoconference, artifacts generated by faulty segmentation output can easily confuse people using virtual background programs. More importantly, segmentation issues can lead to unwanted exposure to people’s physical environments when applying background effects.

Recent research by Meta AI and Reality Labs presents an improved AI model for image segmentation. New segmentation models are in production for real-time video calling in Spark AR across different surfaces on Portal, Messenger, and Instagram.

These algorithms are more efficient, robust and versatile, improving the quality and consistency of our background filter effects. Improved segmentation models can now handle multiple people and their full bodies, as well as people in the way of an object like a couch, desk, or table. Beyond video calls, better segmentation can also provide new dimensions to augmented and virtual reality (AR/VR) by merging virtual settings with real-world people and objects.

The researchers use FBNet V3 as the backbone of their model, an encoder-decoder architecture based on merging layers with the same spatial resolution. The resulting design is supported by Neural Architecture Search and is highly tuned for on-device performance. This heavyweight encoder with a lightweight decoder outperforms the symmetric architecture in terms of quality.

The team used Meta AI’s ClusterFit model to extract a wide range of samples from the dataset. This includes gender, skin tone, age, body pose, movement, background complexity, number of people, to name a few. To increase the volume of training data, they used a high-capacity offline PointRend model to construct mock ground-truth labels for the unannotated data.

Because real-time models normally contain a tracking mode that relies on temporal information, which leads to frame-by-frame prediction inconsistencies. Therefore, metrics on static images do not correctly reflect the quality of a real-time model. They created a quantitative video evaluation framework that calculates metrics at each frame reached by model inference to measure the quality of our models in real time. This approach greatly increases temporal consistency.

The researchers evaluated the performance of their model with different groups of people. For this, they used metadata from over 100 classes (over 30 categories) to identify the rating films, including three skin tones and two apparent gender categories. The results show that the model accurately predicts both perceived skin tone and apparent binary gender categories. Despite some slight variances between categories, the model still performs well across all subcategories.

A standard deep learning model resamples an image to a small square image as the input to the network. There are distortions resulting from this resampling, which are also variable because the images have varying aspect ratios. The network learns low-level properties that do not withstand varied aspect ratios due to the existence of distortions and changes in distortions. It has been observed that these limitations are amplified in segmentation applications. The team uses Detectron 2’s aspect ratio-dependent resampling algorithm to solve this problem. The method collects photos with similar aspect ratio and resizes them to the same size.

Additionally, padding group photos with comparable aspect ratios are needed for aspect ratio-dependent resampling. However, the zero padding method causes problems, and as the network grows, artifacts propagate to other locations. To get rid of these artifacts, the team uses replicated padding.

The researchers say that for AR segmentation applications, creating smooth and unambiguous boundaries is crucial. According to them, it is also important to consider the boundary-weighted loss in addition to the normal cross-entropy loss for segmentation. To this end, they use the Boundary IoU method to extract boundary areas for both ground truth and prediction and develop cross-entropy loss in these areas. The model trained on the cross-entropy of the boundaries greatly outperforms the baseline. Their results suggest that in addition to improving boundary region clarity in the final mask output, the new designs produce fewer false positives.

These models are trained offline with PyTorch before being deployed in production with the Spark AR platform. The team used PyTorch Lite to optimize deep learning model inference on the device.

Related document: https://arxiv.org/pdf/2103.16562.pdf

Reference: https://ai.facebook.com/blog/creating-better-virtual-backdrops-for-video-calling-remote-presence-and-ar/

James G. Williams