Cambridge AI researchers propose ‘MAGIC’: a training-free framework that incorporates visual controls into the generation of a language model

This Article Is Based On The Research Paper 'Language Models Can See: Plugging Visual Controls in Text Generation'. All Credit For This Research Goes To The Researchers 👏👏👏

Please Don't Forget To Join Our ML Subreddit

The release of Generative Pretrained Transformer (GPT-2) has brought wide attention to generative language (LM) models, which are pre-trained on massive amounts of unstructured text and have generated effective results on a variety of NLP apps. LMs can produce texts constantly using the next token prediction decoding approach of a text prompt. Models such as CLIP and ALIGN, pre-trained joint image-text integration approaches, have revived the learning of multimodal illustrations of text and images. As a result, it is difficult to incorporate the benefits of pre-trained LMs and image text embedding models to generate visually anchored text. Traditional approaches are generally limited by object detectors trained with a fixed set of labels. Currently, the ZeroCap approach is used for image captioning, which is an unsupervised technique that incorporates frozen CLIP and GPT-2. ZeroCap uses updating and gradient optimization on the context cache. This approach slows down inference and makes it difficult to use in real-world scenarios. This research solves this challenge by proposing a new method called iMYge-gguided text generationIwith VSLIP (MAGIC) for text decoding. MAGIC uses explicit “control buttons” to choose the desired outputs, following the direction of the GPT-2 and CLIP models. Moreover, it does not require any additional parameters for training. The approach involves a new term called magic wound to stimulate the predicted outcome to demonstrate information close to a given image. The simulation results show that such a framework enables captioning of images without shooting and the creation of visually anchored stories using a simple plug-and-play method. This research tests two widely used benchmarks: MS-COCO and Flickr30k. The proposed approach outperforms all unsupervised and weakly supervised baselines, achieving leading (SOTA) results on several evaluation metrics. Moreover, the proposed method does not require gradient updating and hence the inference speed is about 27 times faster than the previous zero-shot image captioning SOTA. This approach is also evaluated on the generation of visually anchored stories. On both human and machine evaluations, the proposed method produces efficient and high-quality stories compared to robust baseline methods.


Figure 1 shows visual comparisons of the proposed technique with two other solid zero-shot baselines and the reference legend. The result shows that the proposed approach can produce associated legends while being more efficient to rely on the provided image.


The proposed method contrasts with the more substantial baseline (contrastive search) in Figure 2, where the image retrieved by the story title is displayed on the left. It can show that the MAGIC framework creates text based on the visual concepts of the image.

This research proposes MAGIC Search, a new decoding methodology that seeks to direct the language model decoding process in the desired visual direction. This method adapts to the textual domain by first learning an unsupervised language model on the textual corpus of the final task. It also introduces the magic score, a new scoring criterion for visual checks in the decoding process. The GPT-2 is refined for three epochs on the training text corpus for each model. The proposed approach is compared to top-k sampling with k = 40, kernel sampling with p = 0.95, a CLIP-based method called CLIPRe, and the ZeroCap peak approach. Evaluation of approaches is done using performance measures such as BLEU-1 ([email protected]), BLUE-4 ([email protected]), METER (M), ROUGE-L (RL), CIDEr and SPICE. Additionally, the average inference time per frame instance is used to calculate the decoding speed.


The two main tasks of this research are the captioning of images and the generation of visually grounded stories. This work proposes a new decoding approach, MAGIC, which integrates visual controls in the creation of the language model. It is a framework that requires no training and allows the LM to tackle complex multimodal tasks without sacrificing decoding speed. Experimental results show that the proposed methodology outperforms prior state-of-the-art systems in both automatic and human evaluations. In the future, this generic architecture may be extended to modalities other than text and image (i.e. audio and video).



James G. Williams