Artificial intelligence (AI) researchers from Cornell University propose a new neural network framework to solve the video matting problem

Image and video editing are two of the most popular applications for computer users. With the advent of Machine Learning (ML) and Deep Learning (DL), image and video editing has been progressively studied through several neural network architectures. Until very recently, most DL models for image and video editing were supervised and, more specifically, required the training data to contain pairs of input and output data to be used to train the details of the desired transformation. Lately, end-to-end training frameworks have been proposed, which require only a single image as input to train the mapping to the desired edited output.

Video mastering is a specific task belonging to video editing. The term “carpet” dates back to the 19th century when matte painted glass plates were placed in front of a camera during filming to create the illusion of an environment that was not present at the filming location. Nowadays, the composition of several digital images follows similar procedures. A composite formula is used to shade the foreground and background intensity of each image, expressed as a linear combination of the two components.

Although very powerful, this process has certain limitations. This requires unambiguous factorization of the image into foreground and background layers, which are then assumed to be independently processable. In certain situations such as video matting, therefore a sequence of temporal and spatial images, the decomposition of the layers becomes a complex task.

The objectives of this article are to elucidate this process and to increase the precision of the decomposition. The authors propose factor meshing, a variant of the meshing problem that factors video into more independent components for downstream editing tasks. To solve this problem, they then present FactorMatte, an easy-to-use framework that combines classic matting priors with conditional priors based on expected deformations in a scene. The classic Bayesian formulation, for example, referring to the estimation of the maximum posterior probability, is extended to remove the limiting assumption about the independence of foreground and background. The majority of approaches further assume that the background layers remain static over time, which is seriously limiting for most video footage.

To overcome these limitations, FactorMatte relies on two modules: a decomposition network that factors the input video into one or more layers for each component, and a set of patch-based discriminators that represent conditional priors on each component. The architecture pipeline is described below.

The input to the decomposition network is composed of a video and a coarse segmentation mask for the object of interest frame by frame (left, yellow box). With this information, the network produces color and alpha layers (medium, green and blue boxes) based on reconstruction loss. The foreground layer models the foreground component (right, green

box), while the environment layer and residual layer together model the background component (right, blue box). The environment layer represents the static-like aspects of the background, while the residual layer captures more irregular changes in the background component due to interactions with foreground objects (the distortion of the pillow on the figure). For each of these layers, a discriminator was trained to learn the respective marginal priors.

The matting result for some selected samples is shown in the figure below.

Although FactorMatte is not perfect, the results produced are significantly more accurate than the basic approach (OmniMatte). In all the given samples, the background and foreground layers show a sharp separation between them, which cannot be asserted for the compared solution. In addition, ablation studies were conducted to prove the effectiveness of the proposed solution.

It was the summary of FactorMattea new framework to meet the video carpet problem. If you are interested, you can find more information in the links below.

Check paper, coded, and project All credit for this research goes to the researchers on this project. Also don’t forget to register. our Reddit page and discord channelwhere we share the latest AI research news, cool AI projects, and more.

Daniele Lorenzi obtained his M.Sc. in ICT for Internet and Multimedia Engineering in 2021 from the University of Padova, Italy. He holds a doctorate. candidate at the Institute of Information Technology (ITEC) of the Alpen-Adria-Universität (AAU) Klagenfurt. He currently works at the Christian Doppler ATHENA laboratory and his research interests include adaptive video streaming, immersive media, machine learning and QoS/QoE assessment.

James G. Williams