UC Berkeley and Adobe AI researchers propose BlobGAN, a new unsupervised, mid-level representation for insane scene manipulation

Since the advent of computer vision, one of the fundamental questions of the research community has always been how to represent the incredible richness of the visual world. A concept that emerged from the beginning is the importance of a stage in understanding objects. Suppose we want a classifier to distinguish between a sofa and a bed. In this case, the context of the scene will provide information regarding the environment (i.e., the room is a living room or a bedroom) that could be useful for classification.

However, after years of research, scene images are still mainly represented in two ways: 1) top-down, so scene classes are represented with a label in the same way as object classes, or 2) upwards. trendy, with semantic labeling of unique pixels. The main limitation of these two approaches is that they do not represent the different parts of a scene as entities. In the first case, the different components are merged into a single label; in the second case, the individual elements are individual pixels, not entities.

From the official video presentation | Source: https://arxiv.org/pdf/2205.02837.pdf

To fill this gap, researchers from UC Berkeley and Adobe Research have proposed BlobGAN, an extremely new unsupervised mid-level representation for generative scene models. Intermediate level means that the representation is not per-pixel or per-frame, but features in scenes are modeled with spatial and depth-ordered Gaussian blobs. Given random noise, the layout network, an 8-layer MLP, maps it to a collection of blob parameters, which are then distributed into a spatial grid and passed to a StyleGAN2-like decoder. The model is trained in an adversarial framework with an unmodified StyleGAN2 discriminator.

Source: https://arxiv.org/pdf/2205.02837.pdf

Specifically, blobs are represented as ellipses with center coordinates Xladder saspect ratio a, and the angle of rotation θ. Additionally, each blob is associated with two feature vectors, one for structure and one for style.

Source: https://arxiv.org/pdf/2205.02837.pdf | From the official video presentation

Thus, the layout network maps the random noise onto a fixed number of k blobs (the network can also decide to delete a blob by imposing a very low scaling parameter), each represented by four parameters (actually five, because the center is defined by X and there coordinates) and two feature vectors. Then all the ellipses defined by the parameters are splashed into a grid with also the depth dimension, then alpha composited (to handle occlusion and relationships) in 2D and filled using the information in the vectors of features. The image is then transmitted to the generator. In the original StyleGAN2, the generator took as input a single table with all the information extracted, while in this work the first layers are modified to take the layout and appearance separately. This technique imposed an untangled representation, as well as the fact that the authors added uniform noise to the blob parameters before entering them into the generator.

The network defined above was trained with the LSUN scene dataset in an unsupervised fashion.

Although unsupervised, thanks to the spatial uniformity of the blobs and the locality of the convolutions, the network was able to associate different blobs with different components of the scene. This is apparent from the results presented, calculated with k=10 drops. For a detailed visualization of the results, here is the project page with animation. The results are impressive, as can be deduced from the image below: the manipulation of blobs allows a substantial and precise modification of the generated image. For example, it is possible to empty a room (even if the dataset has not been trained with images of empty rooms), to add, reduce and move entities and also to redesign the different objects.

Source: https://arxiv.org/pdf/2205.02837.pdf

In conclusion, while diffusion models have recently eclipsed GANs, this article presents a new and disruptive technique that controls the scene with unprecedented precision. Moreover, the training is entirely unsupervised, thus requiring no time to label the different images.

This Article is written as a summary article by Marktechpost Staff based on the paper 'BlobGAN: Spatially Disentangled
Scene Representations'. All Credit For This Research Goes To Researchers on This Project. Checkout the paper, github, project.

Please Don't Forget To Join Our ML Subreddit

James G. Williams