Artificial intelligence researchers develop “CogView2” for a text-to-image conversion system that achieves significant speedups 10 times faster than CogView

This Article Is Based On The Research Paper 'CogView2: Faster and Better Text-to-Image Generation via Hierarchical Transformers'. All Credit For This Research Goes To The Researchers 👏👏👏

Please Don't Forget To Join Our ML Subreddit

Natural language processing and computer vision have become popular study disciplines in recent years, thanks to the rise of artificial intelligence and deep learning. Many scholars have focused their attention and study on image text as a fundamental subject of the discipline. Text to image is the process of creating a realistic image from a textual description, which requires processing ambiguous and partial information in natural language descriptions. Image text is a driving force behind cross-modal learning and cross-modal generation, with applications as diverse as cross-modal information retrieval, photo editing, and computer-aided design. Large-scale preformed transformers, such as DALL-E and CogView, have significantly advanced text-to-image generation. These models are prone to flaws such as – slow generation, expensive high resolution training and one-way.

With its stunningly hyper-realistic graphics, the cutting-edge DALL-E-2 model recently released by OpenAI has captured the attention of mainstream media around the world. Slow generation speeds and high high-resolution training costs limit high-performance autoregressive models like DALL-E-2 and CogView 2021. Additionally, the one-way token generation mechanism of these models differs from that used by processors of vision (ViT), which limits their application to classical visual tasks such as image categorization and object recognition.

A pre-trained cross-modal general language model (CogLM) was generated for efficient prediction of text and image tokens. The CogView2 hierarchical text-to-image system generates images with comparable resolution and quality at speeds up to 10 times faster than CogView when fine-tuned for fast super-resolution.

The General Language Model (GLM) recommends that forward mask prediction be replaced by block-based autoregressive generation in NLP. The Generic Intermodal Language Model is a simpler and general language model for text and image data, based on the analysis above (CogLM). However, for photos, part of its design is useless. The sizes of hidden image patches, for example, are defined. Thus, there is no need for power to fill blocks of indeterminate length as in NLP. Additionally, GLM inserts a sentinel token for each mask region to predict its initial token, lengthening the sequence and limiting the use of 2D local attention.

The project aims to provide a general and simple language model that combines autoregressive generation and two-way contextual mask prediction for text and image data. At the pre-training stage, researchers use a unified tokenizer (ICE Tokenizer, Icetk) for image, Chinese, and English to capture bilingual text and image tokens. They extend the model up to six billion parameters using a transformer with Sandwich LayerNorm as the CogLM backbone. An innovative and versatile masking method is created and implemented to increase model performance and allow fine tuning for various downstream tasks.

The team also offers the CogView2 system, which would enable text-guided interactive image editing and fast super-resolution. Their hierarchical generation mechanism is summarized as follows:

  1. First, the pre-trained CogLM is used to generate a batch of low-resolution images (2020 tokens in CogView2), then (optionally) filter out the bad samples using CogLM image captioning perplexity, which is the CogView post-selection approach. .
  2. A direct super-resolution module refined from the pre-trained CogLM maps the resulting images directly to 6060 token images. Local attention is used to save money on training, which is implemented by the proprietary CUDA kernel. This process often results in high resolution photos with uneven textures and a lack of detail.
  3. Another iterative super-resolution module refined from the pre-trained CogLM refines these high-resolution images. Most tokens are remasked and regenerated using a local autoregressive parallel method (LoPAR), significantly faster than traditional autoregressive generation.

The researchers ran 30 million text-image pairs through comprehensive tests and compared the proposed CogView2 to popular benchmarks such as DALL-E-2 and XMC-GAN.

The results show that fine-tuning the CogLM on the MS-COCO dataset significantly improves the performance on the FID metric. The resulting CogView2 text-to-image conversion system can generate images with better quality and resolutions ten times faster than CogView and achieve results comparable to state-of-the-art DALL-E-2 baselines.

Enhanced Techniques Plugin for Transformers

Cluster sampling

The sampling approach on the projected token distribution is essential in an auto-regressive generation. The most widespread strategies, top-k or top-p (kernel) sampling, present a problem of incomplete truncation.

VQVAE, p Truncation with top-k sampling, where the embeddings of some tokens are quite similar, is used to learn image token vocabulary. A vocabulary of 20,000 tokens is used to express frequent patterns at a more acceptable resolution, which is three times higher than previous efforts, making the situation worse. In icetk, for example, there are about 42 tokens that are basically “white” and only show tiny changes when connected to other tokens. Although the sum of the probability of these “blank” tokens could be large enough, top-k sampling could filter out the majority of them. The problem is illustrated in Figure 5.

Textual attention comes first

In CogLM’s huge training data, most text-image combinations are marginally significant. Even if the model fits the data perfectly, it should still have a reasonable chance of producing irrelevant images. The explainability of the attention operation is explained to reinforce the relevance. A constant c is applied to all attention scores, from image tokens to text tokens. This technique consumes a minimum of time but considerably increases the linguistic meaning of the generated images. In reality, c will not affect the image quality.

Local awareness

One of the essential characteristics of visual data is their locality. Before ViTs, visual computing was dominated by local processes like convolution. Even in ViTs, exchanges between local tokens receive the most attention. It was found that fine-tuning pre-trained CogLM with local and textual attention is often consistent with pre-training global attention weights. However, using a high-level framework like Pytorch, 2D local attention cannot be implemented effectively. A CUDA kernel is created that supports both 2D local attention and 2D auto-regressive local attention, as well as cross-solving local attention. Local attention with a kernel size of 9 X 9 is used in super-resolution modules, which are 40 times faster and use 1% less memory than global attention on a 4096 sequence with a size hidden of 64 per head.

Overall, the work uses hierarchical transformers to help autoregressive text-image models overcome difficulties of slow generation and high complexity and bridge the gap between text-image preformation and visual transformers.

Article: https://arxiv.org/pdf/2204.14217.pdf

Github: https://github.com/thudm/cogview2

James G. Williams