Google AI researchers propose Pathways Autoregressive (Party) Text-to-Image Model, which generates high-fidelity photorealistic images and supports content-rich synthesis

The human brain can develop complex scenarios based on descriptions, whether verbal or written. Reproducing it to produce visuals based on such descriptions can open up creative applications in multiple fields, be it arts, design or multimedia content development. Recent text-to-image research, such as DALL-E and CogView, has made significant advances in producing high-fidelity images. It has also proven itself by displaying generalization abilities to hitherto unexplored pairings of objects and concepts. Both approach the problem as a language model, converting textual descriptions into visual words. After that, they use current sequence-to-sequence structures such as Transformers to understand the connection between language inputs and visual outputs.

Visual tokenization effectively combines the perspective of text and images, allowing them to be treated as sequences of discrete tokens and therefore sensitive to sequence-to-sequence patterns. To this end, DALL-E and CogView learned from a large collection of potentially noisy text-image pairings using decoder-only language models similar to GPT. Make-A-Scene extends this two-step modeling method to accommodate scene-driven text and image production.

Considerable previous work on scaling large language models and developments in the discretization of images and audio have been made. These patterns forgo the use of discrete image tokens in favor of streaming patterns that create images directly. Compared to previous work, these models improve zero-shot Fréchet Inception Distance (FID) scores on MS-COCO and create images with significantly improved quality and aesthetic appeal. Now, inputs in other modalities can be treated as language-like tokens, and autoregressive models for text-to-image creation remain tempting. The Pathways Autoregressive Text-to-Image (Party) model is presented in this study. This template creates high-quality images from text descriptions, including photorealistic images, paintings, sketches, and more. Researchers show that scaling autoregressive models with a ViT-VQGAN image tokenizer is an excellent technique for improving text-to-image creation. These models effectively integrate and graphically represent global information.

Parti is a transformer-based sequence-to-sequence model, a crucial architecture for various applications such as machine translation, speech recognition, conversational modeling, image captioning, and many others. Parti uses text tokens as input to an autoregressive encoder and decoder to predict discrete image tokens. Image tokens are generated using the Transformer-based ViT-VQGAN image tokenizer, which provides high-fidelity reconstructed outputs and uses less code.

Text-to-image generation using Parti sequence-to-sequence autoregressive model (left) and ViT-VQGAN image tokenizer (right) | https://arxiv.org/pdf/2206.10789v1.pdf

Parti is a basic idea: all its components – encoder, decoder and image tokenizer – are based on conventional transformers. This ease of use allows researchers to develop their models using standard methodologies and current infrastructure. They extend the parameter size of Parti models up to 20B to study the limits of the two-step text-image architecture. They see consistent quality gains in both text-to-image alignment and image quality. On MS-COCO, the 20B Parti model scores a new state-of-the-art FID score of 7.23 and a refined FID score of 3.22.

The main contributions of the researcher to this article are as follows:

  • Show that autoregressive models can achieve peak performance, with a FID of 7.23 zero and 3.22 refined on MS-COCO and a FID of 15.97 zero and 8.39 refined on localized narratives.
  • Scale matters: The largest Parti model is the best for generating high-fidelity photorealistic images and supports content-rich synthesis.
  • Implement a comprehensive new benchmark called PartiPrompts (P2) that sets a new standard for detecting constraints in text-to-image generation patterns

A PyTorch implementation from Parti is available on GitHub.

This Article is written as a summary article by Marktechpost Staff based on the paper 'Scaling Autoregressive Models for Content-Rich
Text-to-Image Generation'. All Credit For This Research Goes To Researchers on This Project. Checkout the paper and github.

Please Don't Forget To Join Our ML Subreddit

James G. Williams